Wiki Contributions


Q: Wouldn’t the AGI self-modify to make itself falsely believe that there’s a lot of human flourishing? Or that human flourishing is just another term for hydrogen?

A: No, for the same reason that, if a supervillain is threatening to blow up the moon, and I think the moon is super-cool, I would not self-modify to make myself falsely believe that “the moon” is a white circle that I cut out of paper and taped to my ceiling. [...] I’m using my current value function to evaluate the appeal (valence) of thoughts. 


It's worth noting that humans fail at this all the time.


Q: Wait hang on a sec. [...] how do you know that those neural activations are really “human flourishing” and not “person saying the words ‘human flourishing’”, or “person saying the words ‘human flourishing’ in a YouTube video”, etc.?

Humans screw this up all the time too, and these two failure modes are related.

You can't value what you can't perceive, and when your only ability to perceive "the moon" is the image you see when you look up, then that is what you will protect, and that white circle of paper will do it for you.

For an unusually direct visual level, bodybuilding is supposedly about building a muscular body, but sometimes people will use synthol to create the false appearance of muscle in a way that is equivalent to taping a square piece of paper to the ceiling and calling it a "moon". The fact that it doesn't even look a little like real muscle hints that it's probably a legitimate failure to notice what they want to care about rather than simply being happy to fool other people into thinking they're strong.

For a less direct but more pervasive example, people will value "peace and harmony" within their social groups, but due to myopathy this often turns into short sighted avoidance of conflict and behaviors that make conflict less solvable and less peace and harmony.

With enough experience, you might notice that protecting the piece of paper on the ceiling doesn't get that super cool person to approve of your behavior, and you might learn to value something more tied to the actual moon. Just as with more experience consuming excess sweets, you might learn that the way you feel after doesn't seem to go with getting what your body wanted, and you might find your tastes shifting in wiser directions.

But people aren't always that open to this change.

If I say "Your paper cutout isn't the moon, you fool", listening to me means you're going to have to protect a big rock a bazillion miles beyond your reach, and you're more likely to fail that than protecting the paper you put up. And guess what value function you're using to decide whether to change your values here? Yep, that one saying that the piece of paper counts. You're offering less chance of having "a moon", and relative to the current value system which sees a piece of paper as a valid moon, that's a bad deal. As a result, the shallowness and mis-aimedness of the value gets protected.

In practice, it happens all the time. Try explaining to someone that what they're calling "peace and harmony values" is really just cowardice and is actively impeding work towards peace and harmony, and see how easy it is, for example.

It's true that "A plan is a type of thought, and I’m using my current value function to evaluate the appeal (valence) of thoughts" helps protect well formed value systems from degenerating into wireheading, but it also works to prevent development into values which preempt wireheading, and we tend to be not so fully developed that fulfilling our current values excellently does not constitute wireheading of some form. And it'ss also the case that when stressed, people will sometimes cower away from their more developed goals ("Actually, the moon is a big rock out in space...") and cling to their shallower and easier to fulfill goals ("This paper is the moon. This paper is the moon.."). They'll want not to, but it'll happen all the same when there's enough pressure to.

Sorting out how to best facilitate this process of "wise value development" so as to dodge these failure modes strikes me as important.

I don't think people do, in general. Not as any sort of separate instinctual terminal value somehow patched into our utility function before we're born.

It can be learned, and well socialized people tend to learn it to some extent or another, but young kids are sure selfish and short sighted. And people placed in situations where it's not obvious to them why cooperation is in their best interest don't tend to act like they intrinsically value being cooperative. That's not to say I think people are consciously tallying everything and waiting for a moment to ditch the cooperative BS to go do what they really want to do. I mean, that's obviously a thing too, but there's more than that.

People can learn to fake caring, but people can also learn to genuinely care about other people -- in the kind of way where they will do good by the people they care about even when given the power not to. It's not that their utility function is made up of a "selfish" set consisting of god knows what, and then a term for "cooperation" is added. It's that we start out with short sighted impulses like "stay warm, get fed", and along the way we build a more coherent structure of desires by making trades along the way of the sorts "I value patience over one marshmallow now, and receive two marshmallows in the future" and "You care a bit about my things, and I'll care a bit about yours". We start out ineffective shits that can only cry when our immediate impulses aren't met, and can end up people who will voluntarily go cold and hungry even without temptation or suffering in order to provide for the wellbeing of our friends and family -- not because we reason that this is the best way to stay warm in each moment, but that we have learned to not care so much whether we're a little cold now and then relative to the long term wellbeing of our friends and family.

What I'm saying is that at the point where a person gains sufficient power over reality that they no longer have to deceive others in order to gain support and avoid punishment, the development of their desires will stop and their behaviors will be best predicted by the trades they actually made. If they managed to fake their whole way there from childhood, you will get childish behavior and childish goals. To the extent that they've only managed to succeed and acquire power by changing what they care about to be prosocial, power will not corrupt.

I don't think it's necessary to posit any separate motivational drives. Once you're in a position where cooperation isn't necessary for getting what you want, then there's no incentive to cooperate or shape yourself to desire cooperative things.

It's rare-to-nonexistent in a society as large and interconnected as ours for anyone to be truly powerful enough that there's no incentive to cooperate, but we can look at what people do when they don't perceive a benefit to taking on (in part) other people's values as their own. Sure, sometimes we see embezzlement and sleeping with subordinates which look like they'd correlate with "maximizing reproductive fitness in EEA", but we also see a lot of bosses who are just insufferable in ways that make them less effective at their job to their detriment. We see power tripping cops and security guards, people being dicks to waiters, and without the "power over others" but retaining "no obvious socializing forces" you get road rage and twitter behavior.

The explanation that looks to me to fit better is just that people stop becoming socialized as soon as there's no longer a strong perceived force rewarding them for doing so and punishing them for failing to. When people lose the incentive to refactor their impulses, they just act on them. Sometimes that means they'll offer a role in a movie for sexual favors, but sometimes that means completely ignoring the people you're supposed to be serving at the DMV or taking out your bitterness on the waiter who can't do shit about it.

I guess I meant "as it applies here, specifically", given that Zack was already criticizing himself for that specific thing, and arguing for rather than against politeness norms in the specific place that I commented. I'm aware that you guys haven't been getting along too well and wouldn't expect agreement more generally, though I hadn't been following closely.

It looks like you put some work and emotional energy into this comment so I don't want to just not respond, but it also seems like this whole thing is upsetting enough that you don't really want to be having these discussions. I'm going to err on the side of not getting into any object level response that you might not want, but if you want to know how to get along with Zach and not find it infuriating I think I do understand his perspective (having found myself in similar shoes) well enough to explain how you can do it.

Yeah, I didn't mean that I thought you two agreed in general, just on the specific thing he was commenting on. I didn't mean to insert myself into this feud and I was kinda asking how I got here, but now that I'm here we might as well have fun with it. I think I have a pretty good feel for where you're coming from, and actually agree with a lot of it. However, agreement isn't where the fun is so I'm gonna push back where I see you as screwing up and you can let me know if it doesn't fit.

These two lines stand out to me as carrying all the weight:

I strongly disagree that pro-actively modeling one's interlocutors should be a prerequisite for being able to have a discussion.

I'm extremely wary that a culture that heavily penalizes not-sufficiently-modeling-one's-interlocutor, interferes with the process of subjecting each other's work to scrutiny.

These two lines seem to go hand in hand in your mind, but my initial response to the two is very different.

To the latter, I simply agree that there's a failure mode there and don't fault you for being extremely wary of it. To the former though.... "I disagree that this thing should be necessary" is kinda a "Tough?". Either it's necessary or it isn't, and if you're focusing on what "should" be you're neglecting what is.

I don't think I have to make the case that things aren't going well as is. And I'm not going to try to convince you that you should drop the "should" and attend to the "is" so that things run more smoothly -- that one is up to you to decide, and as much as "should" intentionally looks away from "is" and is in a sense fundamentally irrational in that way, it's sometimes computationally necessary or prudent given constraints.

But I will point out that this "should" is a sure sign that you're looking away from truth, and that it fits Duncan's accusations of what you're doing to a T. "I shouldn't have to do this in order to be able to have a discussion" sounds reasonable enough if you feel able to back up the idea that your norms are better, and it has a strong tendency to lead towards not doing the thing you "shouldn't have to" do. But when you look back at reality, that combination is "I actually do have to do this in order to have a (productive) discussion, and I'm gonna not do it, and I'm going to engage anyway". When you're essentially telling someone "Yeah, know what I'm doing is going to piss you off, and not only am I going to do it anyway I am going to show that pissing you off doesn't even weigh into my decisions because your feelings are wrong", then that's pretty sure to piss someone off.

It's clear that you're willing to weigh those considerations as a favor to Duncan, the way you recount asking Michael Vassar for such a favor, and that in your mind if Duncan wants you to accommodate his fragility, he should admit that this is what he's asking for and that it's a favor not an obligation -- you know, play by your rules.

And it's clear that by just accommodating everyone in this way without having the costs acknowledged (i.e. playing by his rules), you'd be giving up something you're unwilling to give up. I don't fault you there.

I agree with your framing that this is actually a conflict. And there are inherent reasons why that isn't trivially avoidable, but that doesn't mean that there isn't a path towards genuine cooperation -- just that you can't declare same sidedness by fiat.

Elsewhere in the comments you gave an example of "stealing bread" as a conflict that causes "disagreements" and lying. The solution here isn't to "cooperatively" pursue conflicting goals, it's to step back and look at how to align goals. Specifically, notice that everyone is better off if there's less thieving, and cooperate on not-thieving and punishing theft. And if you've already screwed up, cooperate towards norms that make confession and rehabilitation more appealing than lying but less appealing than not-thieving in the first place.

I don't think our problems are that big here. There are conflicts of values, sure, but I don't think the attempts to push ones values over others is generally so deliberately antisocial. In this case, for example, I think you and Duncan both more or less genuinely believe that it is the other party who is doing the antisocial acts. And so rather than "One person is knowingly trying to get away with being antisocial, so of course they're not going to cooperate", I think it's better modeled as an actual disagreement that isn't able to be trivially resolved because people are resorting to trying to use conflict rather than cooperation to advance their (perceived as righteous) goals, and then missing the fact that they're doing this because they're so open to cooperating (within the norms which are objectively correct, according to themselves) and which the other person irrationally and antisocially isn't (by rules they don't agree with)!

I don't agree with the way that he used it, but Duncan is spot on calling your behavior "trauma response". I don't mean it as a big-T "Trauma" like "abused as a child", but trauma in the "1 grain is a 'heap'" sense is at is at the core of this kind of conflict and many many other things -- and it is more or less necessary for trauma response to exist on both sides for these things to not fizzle out. The analogy I like to give is that psychological trauma is like plutonium and hostile acts are like neutrons.

As a toy example to illustrate the point, imagine someone steps on your toes; how do you respond? If it's a barefoot little kid, you might say "Hey kid, you're standing on my toes" and they might say "Didn't mean to, sorry!" and step off. No trauma no problem. If it's a 300lb dude with cleats, you might shove him as hard as you can because the damage incurred from letting him stand on your toes until you can get his attention is less acceptable. And if he's sensitive enough, he might get pissed at you for shoving him and deck you. If it becomes a verbal argument, he might say "your toes shouldn't have been there", and now it's an explicit conflict about where you get to put your toes and whether he as a right to step on them anyway if they are where you put them.

In order to not allow things to degenerate into conflict as the less-than-perfectly-secure cleat wearing giant steps on your toes, you have to be able to withstand that neutron blast without retaliating with your own so much that it turns into a fight instead of a "I'm sorry, I didn't realize your toes were there. I'll step off them for now because I care about your toes, but we need to have a conversation about where your feet are okay to be". 

This means:

 1) orienting to the truth that your toes are going to take damage whether you like it or not, and that "should" can't make this untrue or unimportant.

2) maintaining connection with the larger perspective that tracks what is likely to cause conflict, what isn't, and how to cause the minimal conflict and maximum cooperation possible so that you best succeed at your goals with least sacrifice of your formerly-sacred-and-still-instrumentally-important values.

In some cases, the most truth-oriented and most effective response is going to be politely tapping the big guy on the shoulder while your feet bleed, and having a conversation after the fact about whether he needs to be more careful where he's stepping -- because acting like shoving this guy makes sense is willful irrationality.

In other cases he's smaller and more shove-able and it doesn't make sense to accept the damage, but instead of coming off like "I'm totally happy to apologize for anything I actually did wrong. I'm sorry I called you a jerk while shoving you; that was unnecessary and inappropriate [but I will conspicuously not even address the fact that you didn't like being shoved or that you spilled your drink, because #notmyproblem. I'll explain why I'm right to not give a fuck if you care to ask]", you'll at least be more able to see the value in saying things like "I'm sorry I had to shove you. I know you don't like being shoved, and I don't like doing it. You even spilled your drink, and that sucks. I wish I saw another way to protect our communities ability to receive criticism without shoving you".

This shouldn't need to be said but probably does (for others, probably not for you), so I'll say it. This very much is not me taking sides on the whole thing. It's not a "Zach is in the wrong for not doing this" or a "I endorse Duncan's norms relatively more" -- nor is it the opposite. It's just a "I see Zach as wanting me to argue that he's screwing up in a way that might end up giving him actionable alternatives that might get him more of what he wants, so I will".

I'm confused. It seems to me that you, Zack, and I all have a similar takes on the example you bring up, but the fact that you say this here suggests that you don't see us all as in clear agreement?

Yes, but it's worth pointing out what you can actually expect to get from it, and how easily. Most of what I'm talking about is from personal interactions, and the stuff that's online isn't like "Oh, the science is unanimous, unarguable and unambiguous" -- because we're talking about the equivalent of "physics motors" not "engineering motors". Even if our aerospace lab dyno results were publicly available you'd be right not to trust them at face value. If you have a physics degree then saying "Here's the reasoning, here are the computer simulations and their assumptions, and here's what our tests have shown so far" is easy. If you can't distinguish valid physics from "free energy" kookiness, then even though it's demonstrable and has been demonstrated to those with a good understanding of motor testing validity who have been following this stuff, it's not necessarily trivial to set up a sufficiently legible demonstration for someone who hasn't. It's real, we can get into how I know, but it might not be as easy as you'd like.

The thing that proved to me beyond a shadow of a doubt that there exist bright feedback oriented minds that have developed demonstrable abilities involved talking to one over and over and witnessing the demonstrations first hand as well as the feedback cycles. This guy used to take paying clients for some specific issue they wanted resolved (e.g. "fear of heights"), set concrete testable goals (e.g. "If I climb this specific wall, I will consider our work to have been successful"), and then track his success rate over time and as he changed his methods. He used to rack his brain about what could be causing the behavior he'd see in his failures, come up with an insight that helps to explain, play with it in "role play" until he could anticipate what the likely reactions would be and how to deal with them, and then go test it out with actual clients. And then iterate.

On the "natural discourse, not obviously connected to deliberate cultivation of skill" side, the overarching trajectory of our interactions is itself pretty exceptional. I started out kinda talking shit and dismissing his ideas in a way that would have pissed off pretty much anyone, and he was able to turn that around and end up becoming someone I respect more than just about anyone. On the "clearly the result of iterated feedback, but diverging from natural discourse" side there's quite a bit, but perhaps the best example is when I tried out his simple protocol for dealing with internal conflicts on physical pain, and it completely changed how I relate to pain to this day. I couldn't imagine how it could possibly work "because the pain would still be there" so I just did it to see what would happen, and it took about two minutes to go from "I can't focus at all because this shit hurts" to "It literally does not bother me at all, despite feeling the exact same". Having that shift of experience, and not even noticing the change as it happened.... was weird.

From there, it was mostly just recognizing the patterns, knowing where to look, and knowing what isn't actually an extraordinary claim.

This guy does have some stuff online including a description of that protocol and some transcripts, but again, my first reaction to his writings was to be openly dismissive of him so I'm not sure how much it'll help. And the transcripts are from quite early in his process of figuring things out so it's a better example of watching the mind work than getting to look at well supported and broadly applicable conclusions. Anyway, the first of his blog posts explaining that protocol is here, and other stuff can be found on the same site.

Another example that stands out to me as exceptionally clear concise and concrete (but pretty far from "natural discourse" towards "mind hack fuckery") is this demonstration by Steve Andreas of helping a woman get rid of her phobia. In particular, look at the woman's response and Steve's response to these responses at 0:39,5:47,6:12,6:22,6:26, and 7:44. The 25 year follow up is neat too.

I'll go on record as a counterexample here; I very much want politeness norms to be enforced here, and in my personal life I will pay great costs in order to preserve or create my freedom to be blunt. The requirement for me to be cautious of how I say things here is such a significant cost that I post here far less than I otherwise would. The cost is seriously non-insignificant.

The reason I don't bitch about it is that I recognize that it's necessary. Changing norms to allow people to be relatively more inconsiderate wouldn't actually make things better. It's not just that "pandering to idiots" calls for a euphemism, it's that it probably calls for a mindset that is not so dismissive to people if they're going to be in or close enough to your audience to be offended. Like, actually taking them into consideration and figuring out how to bridge that gap. It's costly. It's also necessary, and often pays off.

I would like to be able to say "Zack, you stupid twat" without having to worry about getting attacked for doing so, but until I've proven to you that I respect you enough that it's to be taken as an affectionate insult between friends.... phrasing things that way wouldn't actually accomplish the goals I'd have for "less polite" speech. If I can earn that recognition then there's probably some flexibility in the norms, and if I can't earn that recognition or haven't earned that recognition, then that's kinda on me.

There does have to be some level at which we stop bending over backwards to spare feelings and just say what's true [to the best of our ability to tell] dammit, but it actually has to be calibrated to what pushes away only those we'd be happy to create friction with and lose. It's one of those things where if you don't actively work to broaden your scope of who you can get along with without poking at, you end up poking too indiscriminately, so I'm happy to see the politeness norms about where they are now.

I hear what you're saying.

What I'm saying is that as someone whose day job is in large part about designing bleeding edge aerospace motors, I find that the distinction you're making falls apart pretty quickly in practice when I try to actually design and test a "physics motor". Even things as supposedly straight forward as "measuring torque" haven't been as straight forward as you'd expect. A few years ago we took one of our motors to a major aerospace company to test on their dyno and they measured 105% efficiency. The problem was in their torque measurements. We had to get clever in order to come up with better measurements.

Coincidentally, I have also put in a ton of work into figuring out how to engineer discourse, so I also have experience in figuring out what needs to be measured, how it can be measured, and how you can know how far to trust your measurements to validate your theories. Without getting too far into it, you want to start out by calibrating against relatively concrete things like "Can I get this person, who has been saying they want to climb up this wall but are too afraid, to actually climb up the rock wall -- yes or no?". If you can do this reliably where others fail, you know you're doing something that's more effective than the baseline (even though that alone doesn't validate your specific explanation uniquely). It'd take a book to explain how to build from there, but at the end of the day if you can do concrete things that others cannot and you can teach it so that the people you teach can demonstrate the same things, then you're probably doing something with some validity to it. Probably.

I'm not saying that there's "no difference" between the process of optimizing discourse and the process of optimizing motors, but it is not nearly as black and white as you think. It's possible to lead yourself astray with confirmation bias in "discourse" related things, but you should see some of the shit engineers can convince themselves of without a shred of valid evidence. The cognitive skills, nebulosity of metric, and ease of coming up with trustable feedback are all very similar in my experience. More like "a darkish shade of gray" vs "a somewhat darker shade of gray".

Part of the confusion probably comes from the fact that what we see these days aren't "physics motors"; they're "engineering motors". An engineering motor is when someone who understands physics designs a motor and then engineers populate the world with surface level variations of this blueprint. By and large, my experience in both academic and professional engineering is that engineers struggle to understand and apply first principles and optimize anything outside of the context that was covered in their textbooks. It's true that within the confines of the textbook, things do get more "cut and dry", but it's an illusion that goes away when you look past industry practice to physics itself.

It's true that our "discourse engineering" department is in a sorry state of being and that the industry guidelines are not to be trusted, but it's not that we have literally nothing, and our relative lack is not because the subject is "too soft" to get a grip on. Motor design is hard to get a grip on too, when you're trying to tread even slightly new ground. The problem is that the principles based minds go into physics and sometimes engineering, but rarely psychology. In the few instances where I've seen bright minds approach "discourse" with an eye to verifiable feedback, they've found things to measure, been able to falsify their own predictions, and have ended up (mostly independently) coming to similar conclusions with demonstrably increased discourse abilities to show for it.

As someone working on designing better electric motors, I can tell you that "What exactly is this metric I'm trying to optimize for?" is a huge part of the job. I can get 30% more torque by increasing magnet strength, but it increases copper loss by 50%. Is that more better? I can drastically reduce vibration by skewing the stator but it will cost me a couple percent torque. Is that better or worse? There are a ton of things to trade between, and even if your end application is fairly well specified it's generally not specified well enough to remove all significant ambiguity in which choices are better.

It's true that there are some motor designs that are just better at everything (or everything one might "reasonably" care about), but that's true for discourse as well. For example, if you are literally just shrieking at each other, whatever you're trying to accomplish you can almost certainly accomplish it better by using words -- even if you're still going to scream those words.

The general rule is that if you suck relative to the any nebulosity in where on the pareto frontier you want to be, then there are "objective" gains to be made. In motor, simultaneous improvements in efficiency and power density will go far to create a "better" motor which will be widely recognized as such. In discourse, the ability to create shared understanding and cooperation will go far to create "better" discourse which will be widely regarded as such.

Optimal motors and discourse will look different in different contexts, getting it exactly right for your use case will always be nebulous, and there will always be weird edge cases and people deliberately optimizing for the wrong thing. But it's really not different in principle.

Load More