Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

When I'm arguing points like orthogonality and fragility of value, I've occasionally come across rejoinders that I'll (perhaps erroneously) summarize:

Superintelligences are not spawned fully-formed; they are created by some training process. And perhaps it is in the nature of training processes, especially training processes that involve multiple agents facing "social" problems or training processes intentionally designed by humans with friendliness in mind, that the inner optimizer winds up embodying the spirit of niceness and compassion.

Like, perhaps there just aren't all that many ways for a young mind to grow successfully in a world full of other agents with their own desires, and in the face of positive reinforcement for playing nicely with those agents, and negative reinforcement for crossing them. And perhaps one of the common ways for such a young mind to grow, is for it to internalize into its core goals the notions of kindness and compassion and respect-for-the-desires-of-others, in a manner broadly similar to humans. And, sure, this isn't guaranteed, but perhaps it's common enough that we can get young AI minds into the right broad basin, if we're explicitly trying to.

One piece of evidence for this view is that there aren't simple tweaks to human psychology that make them significantly more reproductively successful. Sociopathy isn't at fixation. Humans can in fact sniff out cheaters, and can sniff out people who want to make deals but who don't actually really care about you — and those people do less well. Actually caring about people in a readily-verifiable way is robustly useful in the current social equilibrium.

If it turns out to be easy-ish to instill similar sorts of caring into AI, then such an AI might not share human tastes in things like art or humor, but that might be fine, because it might embody broad cosmopolitan virtues — virtues that inspire it to cooperate with us to reach the stars, and not oppose us when we put a sizable portion of the stars toward Fun.

(Or perhaps we'll get even luckier still, and large swaths of human values will turn out to have pretty-wide basins that we can get the AI into if we're trying, so that it does share our sense of humor and laughs alongside us as we travel together to the stars!)

This view is an amalgam of stuff that I tentatively understand Divia Eden, John Wentworth and the shard theory advocates to be gesturing at.

I think this view is wrong, and I don't see much hope here. Here's a variety of propositions I believe that I think sharply contradict this view:

  1. There are lots of ways to do the work that niceness/kindness/compassion did in our ancestral environment, without being nice/kind/compassionate.
  2. The specific way that the niceness/kindness/compassion cluster shook out in us is highly detailed, and very contingent on the specifics of our ancestral environment (as factored through its effect on our genome) and our cognitive framework (calorie-constrained massively-parallel slow-firing neurons built according to DNA), and filling out those details differently likely results in something that is not relevantly "nice".
  3. Relatedly, but more specifically: empathy (and other critical parts of the human variant of niceness) seem(s) critically dependent on quirks in the human architecture. More generally, there are lots of different ways for the AI's mind to work differently from how you hope it works.
  4. The desirable properties likely get shredded under reflection. Once the AI is in the business of noticing and resolving conflicts and inefficiencies within itself (as is liable to happen when its goals are ad-hoc internalized correlates of some training objective), the way that its objectives ultimately shake out is quite sensitive to the specifics of its resolution strategies.

Expanding on 1): 

There are lots of ways to do the work that niceness/kindness/compassion did in our ancestral environment, without being nice/kind/compassionate.

We have niceness/kindness/compassion because our nice/kind/compassionate ancestors had more kids than their less-kind siblings. The work that niceness/kindness/compassion was doing ultimately grounded out in more children. Presumably that reproductive effect factored through a greater ability to form alliances, lowering the bar required for trust, etc.

It seems to me like "partially adopt the values of others" is only one way among many to get this effect, with others including but not limited to "have a reputation for, and a history of, honesty" and "be cognitively legible" and "fully merge with local potential allies immediately".


Expanding on 2):

The specific way that the niceness/kindness/compassion cluster shook out in us is highly detailed, and very contingent on the specifics of our ancestral environment (as factored through its effect on our genome) and our cognitive framework (calorie-constrained massively-parallel slow-firing neurons built according to DNA), and filling out those details differently likely results in something that is not relevantly "nice".

I think this perspective is reflected in Three Worlds Collide and "Kindness to Kin". Even if we limit our attention to minds that solve the "trust is hard" problem by adopting some of their would-be collaborators’ objectives, there are all sorts of parameters controlling precisely how this is done.

Like, how much of the other's objectives do you adopt, and to what degree?

How long does patience last?

How do you guard against exploitation by bad actors and fakers?

What sorts of cheating are you sensitive to?

What makes the difference between "live and let live"-style tolerance, and chumminess?

If you look at the specifics of how humans implement this stuff, it's chock full of detail. (And indeed, we should expect this a priori from the fact that the niceness/kindness/compassion cluster is a mere correlate of fitness. It's already a subgoal removed from the simple optimization target; it would be kinda surprising if there were only one way to form such a subgoal and if it weren’t situation-dependent!)

If you take a very dissimilar mind and fill out all the details in a very dissimilar way, the result is likely to be quite far from what humans would recognize as “niceness”!

In humans, power corrupts. Maybe in your alien AI mind, a slightly different type of power corrupts in a slightly different way, and next thing you know, it's stabbing you in the back and turning the universe towards its own ends. (Because you didn’t know to guard against that kind of corruption, because it’s an alien behavior with an alien trigger.)

I claim that there are many aspects of kindness, niceness, etc. that work like this, and that are liable to fail in unexpected ways if you rely on this as your central path to alignment.


Expanding on 3): 

Relatedly, but more specifically: empathy (and other critical parts of the human variant of niceness) seem(s) critically dependent on quirks in the human architecture. More generally, there are lots of different ways for the AI's mind to work differently from how you hope it works.

It looks pretty plausible to me that humans model other human beings using the same architecture that they use to model themselves. This seems pretty plausible a-priori as an algorithmic shortcut — a human and its peers are both human, so machinery for self-modeling will also tend to be useful for modeling others — and also seems pretty plausible a-priori as a way for evolution to stumble into self-modeling in the first place ("we've already got a brain-modeler sitting around, thanks to all that effort we put into keeping track of tribal politics").

Under this hypothesis, it's plausibly pretty easy for imaginations of others’ pain to trigger pain in a human mind, because the other-models and the self-models are already in a very compatible format.[1]

By contrast, an AI might work internally via an architecture that is very different from our own emotional architectures, with nothing precisely corresponding to our "emotions", and many different and distinct parts of the system doing the work that pain does in us. Such an AI is much less likely to learn to model humans in a format that's overlapped with its models of itself, and much less able to have imagined-pain-in-others coincide with the cognitive-motions-that-do-the-work-that-pain-does-in-us. And so, on this hypothesis, the AI entirely fails to develop empathy.

I'm not trying to say "and thus AIs will definitely not have empathy — checkmate"; I'm trying to use this as a single example of a more general fact: an AI, by dint of having a different cognitive architecture than a human, is liable to respond to similar training incentives in very different ways.

(Where, in real life, it will have different training incentives, but even if it did have the same incentives, it is liable to respond in different ways.)

Another, more general instance of the same point: Niceness/kindness/compassion are instrumental-subgoal correlates-of-fitness in the human ancestral environment, that humans latch onto as terminal goals in a very specific way, and the AI will likely latch onto instrumental-subgoals as terminal in some different way, because it works by specific mechanisms that are different than the human mechanism. And so the AI likely gets off the train even before it fails to get empathy in exactly the same way that humans did, because it's already in some totally different and foreign part of the "adopt instrumental goals" part of mindspace. (Or suchlike.)

And more generally still: Once you start telling stories about how the AI works internally, and see that details like whether the human-models and the self-models share architecture could have a large effect on the learned-behavior in places where humans would be learning empathy, then the rejoinder "well, maybe that's not actually where empathy comes from, because minds don't actually work like that" falls pretty flat. Human minds work somehow, and the AI's mind will also work somehow, and once you can see lots of specifics, you can see ways that the specifics are contingent. Most specific ways that a mind can work, that are not tightly analogous to the human way, are likely to cause the AI to learn something relevantly different, where we would be learning niceness/kindness/compassion.

Insofar as your only examples of minds are human minds, it's easy to imagine that perhaps all minds work similarly. And maybe, similarly, if all you knew was biology, you might expect that all great and powerful machines would have the squishy nature, with most of them being tasty if you cook them long enough in a fire. But the more you start understanding how machines work, the more you see how many facts about the workings of those machines are contingent, and the less you expect vehicular machines to robustly taste good when cooked. (Even if horses are the best vehicle currently around!)


Expanding on 4): 

The desirable properties likely get shredded under reflection. Once the AI is in the business of noticing and resolving conflicts and inefficiencies within itself (as is liable to happen when its goals are ad-hoc internalized correlates of some training objective), the way that its objectives ultimately shake out is quite sensitive to the specifics of its resolution strategies.

Suppose you shape your training objectives with the goal that they're better-achieved if the AI exhibits nice/kind/compassionate behavior. One hurdle you're up against is, of course, that the AI might find ways to exhibit related behavior without internalizing those instrumental-subgoals as core values. If ever the AI finds better ways to achieve those ends before those subgoals are internalized as terminal goals, you're in trouble.

And this problem amps up when the AI starts reflecting. 

E.g.: maybe those values are somewhat internalized as subgoals, but only when the AI is running direct object-level reasoning about specific people. Whereas when the AI thinks about game theory abstractly, it recommends all sorts of non-nice things (similar to real-life game theorists). And perhaps, under reflection, the AI decides that the game theory is the right way to do things, and rips the whole niceness/kindness/compassion architecture out of itself, and replaces it with other tools that do the same work just as well, but without mistaking the instrumental task for an end in-and-of-itself.

Lest this example feel completely implausible, imagine a human who quite enjoys dunking on the outgroup and being snide about it, but with a hint of doubt that eventually causes them — on reflection — to reform, and to flinch away from snideness. The small hint of doubt can be carried pretty far by reflection. The fact that the pleasure of dunking on the outgroup is louder, is not much evidence that it's going to win as reflective ability is amplified.

Another example of this sort of dynamic in humans: humans are able to read some philosophy books and then commit really hard to religiosity or nihilism or whatever, in ways that look quite misguided to people who understand the Law. This is a relatively naive mistake, but it's a fine example of the agent's alleged goals being very sensitive to small differences in how it resolves internal inconsistencies about abstract ("philosophical") questions.

A similar pattern can get pretty dangerous when working with an AGI that acts out its own ideals, and that resolves “philosophical” questions very differently than we might — and thus is liable to take whatever analogs of niceness/kindness/compassion initially get baked into it (as a correlate of training objectives), and change them in very different ways than we would.

E.g.: Perhaps the AI sees that its “niceness” binds only when there's actually a smiling human in front of its camera, and not in the case of distant humans that it cannot see (in the same way that human desire to save a drowning child binds only in specific contexts). And perhaps the AI uses slightly different reflective resolution methods than we would, and resolves this conflict not by generalizing niceness, but by discarding it.

And: all these specific examples are implausible, sure. But again, I'm angling for a more general point here: once the AI is reflecting, small shifts in reflection-space (like "let's stop being snide") can have large shifts in behavior-space.

So even if by some miracle the vast differences in architecture and training regime only produce minor (and survivable) differences between human niceness/kindness/compassion and the AI's ad-hoc partial internalizations of instrumental objectives like "be legibly cooperative to your trading partners", similarly-small differences in its reflective-stabilization methods are liable to result in big differences at the reflective equilibrium.


 

  1. ^

    I suspect I'm one of the people that caused Steven to write up his quick notes on mirror-neurons, because I was trying to make this point to him, and I think he misunderstood me as saying something stupid about mirror neurons. ETA: nope!

98

Ω 36

New Comment
18 comments, sorted by Click to highlight new comments since: Today at 1:26 PM

I suspect I'm one of the people that caused Steven to write up his quick notes on mirror-neurons, because I was trying to make this point to him, and I think he misunderstood me as saying something stupid about mirror neurons.

Nope, I don’t remember you ever saying or writing anything stupid (or anything at all) about mirror neurons. That post was not in response to anything in particular and has no hidden agenda. :-)

…our ancestral environment…

I strongly agree that it’s a bad idea to try to get nice AGIs by doing a blind evolution-like outer-loop search process in an environment where multiple AGIs might benefit from cooperation—see Section 8.3.3.1 here for my three reasons why (which seem complementary to yours).

However, I don’t think that blind evolution-like outer-loop search processes are an ingredient in either shard theory or “alignment by default”.

At least in the shard theory case, the shard theory people seem very clear that when they talk about humans, they’re thinking about within-lifetime learning, not human evolution. For example, they have a post that says “Evolution is a bad analogy for AGI” right in the title!! (I agree btw.)

Expanding on 3):…

OK, now it seems that the post is maybe shifting away from evolution and towards within-lifetime learning, which I like.

In that case, I think there are innate drives that lead (non-psychopathic) humans to feel various social instincts, some of which are related to “niceness”. I think it would be valuable to understand exactly how these innate drives work, and that’s why I’ve been spending 80% of my time doing that. There are a few reasons that it seems valuable. At the very least, this information would give us examples to ground the yet-to-be-invented science that (we hope) will issue predictions like “If an AGI has innate drives X, and training environment Y, it will “grow up” into a trained AGI that wants to do Z”.

A stronger claim (which I don’t endorse) would be “We should put those exact same niceness-related innate drives, built the exact same way, into an AGI, and then we’ve solved alignment!” That seems like almost definitely a very bad plan to me. (See here.) The thing about empathy that you mentioned is one reason. Likewise, for all I know right now, the innate drives are implemented in a way that depends on having a human body and growing up at human speed in a human family etc.

However, if we understand how those innate drives work in humans, then we don’t have to slavishly copy them into an AGI. We can tailor them. Or we can come up with superficially-quite-different approaches that wind up in a similar place. Alignment-by-default would be in that “superficially quite different” category, I think? (As for shard theory, I’m a bit hazy on exactly what the plan is.)

Expanding on 4): 

I want to register strong agreement that this is an area where things can go awry.

When I'm arguing points like orthogonality and fragility of value, I've occasionally come across rejoinders that I'll (perhaps erroneously) summarize:

Implicit in this fragility of value argument is the assumption that the AI must learn/acquire/represent human values in order to optimize for human values.

Turns out this assumption - although quite natural and reasonable at the time - is probably false.

FAI can simply optimize for human empowerment instead ( our future ability to fulfill any goals ). The actual goals/values are then irrelevant. This has the key advantage of being a simpler and thus likely more stable objective.

Human altruism may already exploit this principle - as our altruism extends to animals and likely other beings, and thus can’t rely on specific complex human values.

"Future ability to fulfil any goals" can be optimized by changing our current goals. Like, just turn us into something that is still technically us but is basically an alien superintelligence, and don't care that the goals are drifting randomly in the process.

"Future ability to fulfil our current goals" requires successfully pointing to our current goals, i.e. it requires alignment.

"Future ability to fulfil any goals" can be optimized by changing our current goals.

I regrettably used the wrong short english translation of empowerment. A better translation would be maximization of future optionality. An empowerment objective optimizing for future optionality or ability to "Fulfill any goals" is - by definition - invariant/indifferent to the specific goals. If you look at the typical empowerment formulations they don't deal with goals/values at all.

So the only way the AI would seek to modify our goals/values is if there was instrumental reasons to do so, which seems unlikely given the assumption that the AI is much more powerful than us, and the great risk it would be taking on in trying to modify our values assuming it was not more powerful than us.

I understood what you meant; my objection stands. 

The AI does not love modifying your goals/values, nor does it hate modifying your goals/values, but modifying your goals/values is part of the easiest strategy for empowering you, since the easiest strategy for empowering you involves changing you massively to make you smarter, probably breaking a bunch of other things in the process, including your goals/values. That which is not explicitly treasured is at risk of being optimized away. Variables not explicitly valued--in this case continuity of goals/values--are at risk of being set to extreme values in order to optimize other variables (in this case, zero continuity).

modifying your goals/values is part of the easiest strategy for empowering you, since the easiest strategy for empowering you involves changing you massively to make you smarter,

The likely easiest strategy for the AI to do anything is to make itself smarter. You are making several unjustified assumptions:

  1. That upgrading the intelligence of existing humans is easier than upgrading the intelligence of the AI. This seems extraordinarily unlikely - especially given the starting assumption of AGI. Even if the 'humans' were already uploads, it would at most be a wash.
  2. That upgrading the intelligence of existing humans would probably break a bunch of other things in the process, including goals/values.

Even if we grant 2 (which is hardly obvious given how the brain works - for example it should be fairly easy to add new neural power to uploads (in the form of new cortical/cerebellar modules), after we have mastered brain-like AGI), that only contradicts 1: If upgrading the intelligence of existing humans risks breaking things, this is only all the more reason to instead upgrade the intelligence of the AI.

If anything there's more risk of the opposite of your concern - that the AI would not upgrade our intelligence as much as we'd like. But it's also not obvious how much of a problem that is, unless humans have some desire for intelligence for non-instrumental reasons, as otherwise the AI upgrading itself should be equivalent to upgrading us. This type of AI really is an extension of our will, and difficult to distinguish from an exocortex-like brain upgrade.

Also, to be clear - I am not super confident that a pure external empowerment AI is exactly the best option, but I'm fairly confident it is enormously better than a self-empowering AI, and the context here is refuting the assumption that failure to explicitly learn complex fragile human values results in AI that kills us all or worse.

Ahhhhhhhhh, interesting! I do certainly agree that upgrading AIs own intelligence is easier than upgrading human intelligence. Hadn't thought of that. I still firmly hold opinion 2 though.

Let's think step by step:

It does seem plausible that the easiest & most robust way to empower a given human, in the sense of making it possible for them to achieve arbitrary goals, if you are a powerful AGI, is to build another powerful AGI like yourself but corrigible (or maybe indirectly aligned?) to the human, so that when you hypothetically/counterfactually vary the goals of the human, they still end up getting achieved.

If so, then yeah, seems like making an aligned/corrigible AI reduces to the problem of making the AI optimize for human empowerment.

This is plausibly somewhat easier than solving alignment/corrigibility, because it's maybe a simpler concept to point to. The concept of "what this human truly wants" is maybe pretty niche and complex whereas the concept of "empower this human" is less so. So if you are hitting the system with SGD until it scores well in your training environment, you might be able to bonk it into a shape where it is trying to empower this human more easily than you can bonk it into a shape where it is trying to do what this human truly wants. You still have to worry about the classic problems (e.g. maybe "do whatever will get high reward in training, then later do whatever you want" is even simpler/more natural and will overwhelmingly be the result you get instead in both cases) but the situation seems somewhat improved at least.

Thanks, I hope people think more about this. I guess it feels like it might help, though I guess we already had corrigibility and honesty as simpler concepts that we could bootstrap with and this is just a third (and probably not as good) one.
 

I've updated towards altruistic empowerment being pretty key, here is the extended argument.

I'm pretty sure John has disavowed the arguement presented in Alignment by Default.

EDIT: And alignment by default was based on the notion that "natural abstractions are pretty discrete", "there is a natural abstraction for the space of human values" and "the training process will bias the AI towards natural abstractions". I think John disavows the middle claim. Which, to be clear, is stating that there is a natural abstraction which serves as a pointer to human values. Not that the AI has a bunch of computations running niceness or co-operation or honour or so on. 

Specifically, the claim I endorse is that the inputs to human values are natural abstractions. This is mostly a claim in relation to the Pointers Problem: humans value high-level abstract stuff, but we value that stuff "in the world" (as opposed to valuing our perceptions of the stuff), but abstraction is ultimately in the mind rather than the territory... hard to make those pieces play together at the same time. I claim that the way those pieces fit together is that the things-humans-value are expressed in terms of natural abstractions.

That does not imply that "human values" themselves are a natural abstraction; I'm agnostic on that question. Alignment by Default unfortunately did a poor job communicating the distinction.

My intuition is that niceness/cooperation/etc are more likely than "human values" to be natural abstractions, but I still wouldn't be very highly confident. And even if human versions of niceness/cooperation/etc are natural abstractions, I agree with Nate's point that (in my terminology) they may be specific to an environment composed mainly of human minds.

The whole matter is something I have a lot of uncertainty around. In general, which value-loaded concepts are natural abstractions (and in which environments) is an empirical question, and my main hope is that those questions can be answered empirically before full-blown superhuman AGI.

My personal interpretation of the hope that lies in pursuing a brain-like AGI research agenda very specifically hinges on absolutely not leaving it 'up to chance' to hopefully stumble into an agentive mind that has compassion/empathy/kindness. I think, for reasons roughly in agreement with the ones you express here, that that is a doomed endeavor.

Here is what I believe:

Relatedly, but more specifically: empathy (and other critical parts of the human variant of niceness) seem(s) critically dependent on quirks in the human architecture.

This summarizes my current belief in that I do think we must study and replicate the core functionality of those specific empathy-related quirks in order to have any hope of getting empathy-related behaviors.

I think this testing should be conducted in carefully secured and censored simulation environments as described here by Jacob Cannell: https://www.lesswrong.com/posts/WKGZBCYAbZ6WGsKHc/love-in-a-simbox-is-all-you-need 

I think that the next logical step of "the agentive mind reflectively notices this game-theoretically suboptimal behavior in itself and edits it out" is a risk, but one that can be mitigated by keeping the agent in a secure information-controlled environment with alarms and security measures taken to prevent it from self-modifying. In such an environment it could suggest something like a architecture improvement for our next generation of AGIs, but that plan would be something we would analyze carefully before experimenting with. Not simply let the agent spawn new agents.

I think a thornier point that I feel less confident about is the risk that the agentive mind "resolves “philosophical” questions very differently" and thus does not generalize niceness into highly abstract realms of thought and planning. I believe this point is in need of more careful consideration. I don't think 'hope for the best' is a good plan here. I think we can potentially come up with a plan though. And I think we can potentially run iterative experiments and make incremental changes to a safely-contained agentive mind to try to get closer to a mind that robustly generalizes it's hardwired empathy to abstract planning. 

So, I think this is definitely not a solution-complete path to alignment. I think it would be a hopeless path without strong interpretability tools and a very safe containment, and the ability to carefully limit the capabilities of the agentic mind during testing with various sorts of impairments. I think the assumption of a superintelligent AGI with no adjustable knobs on its inference speed or intelligence is tantamount to saying, "oops, too late, we already failed". Like, trying to plan out how to survive a free solo rock climb starting from the assumption that you've already slipped from a lethal height and are in the process of falling. The hope of success, however slim, was almost entirely before the slip.

Throwing in a perspective, as someone who has been a reviewer on most of the shard theory posts & is generally on board with them. I agree with your headline claim that "niceness is unnatural" in the sense that niceness/friendliness will not just happen by default, but not in the sense that it has no attractor basin whatsoever, or that it is incoherent altogether (which I don't take you to be saying). A few comments on the four propositions:

There are lots of ways to do the work that niceness/kindness/compassion did in our ancestral environment, without being nice/kind/compassionate.

Yes! Re-capitulating those selection pressures (the ones that happened to have led to niceness-supporting reward circuitry & inductive biases in our case) indeed seems like a doomed plan. There are many ways for that optimization process to shake out, nearly all of them ruinous. It is also unnecessary. Reverse-engineering the machinery underlying social instincts doesn't require us to redo the evolutionary search process that produced them, nor is that the way I think we will probably develop the relevant AI systems.

The specific way that the niceness/kindness/compassion cluster shook out in us is highly detailed, and very contingent on the specifics of our ancestral environment (as factored through its effect on our genome) and our cognitive framework (calorie-constrained massively-parallel slow-firing neurons built according to DNA), and filling out those details differently likely results in something that is not relevantly "nice".

Similar to the above, I agree that the particular form of niceness in humans developed because of "specifics of our ancestral environment", but note that the effects of those contingencies are pretty much screened off by the actual design of human minds. If we really wanted to replicate that niceness, I think we could do so without reference to DNA or calorie-constraints or firing speeds, using the same toolbox as we already use in designing artificial neural networks & cognitive architectures for other purposes. That being said, I don't think "everyday niceness circa 2022" is the right kind of cognition to be targeting, so I don't worry too much about the contingent details of that particular object, whereas I worry a lot about getting something that terminally cares about other agents at all, which seems to me like one of the hard parts of the problem.

Relatedly, but more specifically: empathy (and other critical parts of the human variant of niceness) seem(s) critically dependent on quirks in the human architecture. More generally, there are lots of different ways for the AI's mind to work differently from how you hope it works.

If empathy or niceness or altruism—or whatever other human-compatible cognition we need the AI's mind to contain—depends critically on some particular architectural choice like "modeling others with the same circuits as the ones with which you model yourself", then... that's the name of the game, right? Those are the design constraints that we have to work under. I separately also believe we will make some similar design choices because (1) the near-term trajectory of AI research points in that general direction and (2) as you note, they are easy shortcuts (ML always takes easy shortcuts). I do not expect those views to be shared, though.

The desirable properties likely get shredded under reflection. Once the AI is in the business of noticing and resolving conflicts and inefficiencies within itself (as is liable to happen when its goals are ad-hoc internalized correlates of some training objective), the way that its objectives ultimately shake out is quite sensitive to the specifics of its resolution strategies.

Maybe? It seems plausible to me that, if an agent already terminally values altruism and endorses that valuing, then as it attempts to resolve the remaining conflicts within itself, it will try to avoid resolutions that forseeably-to-it remove or corrupt its altruism-value. It sounds like you are thinking specifically about the period after the AI has internalized the value somewhat, but before the AI reflectively endorses it? If so, then yes I agree, ensuring that a particular value hooks into the reflective process well enough to make itself permanent is likely nontrivial. This is what I believe TurnTrout was pointing at in "A shot at the diamond alignment problem", in the major open questions list:

4. How do we ensure that the diamond shard generalizes and interfaces with the agent's self-model so as to prevent itself from being removed by other shards?

Not central to the substance of your claims, but it seems like a good quick improvement to me. I think this post would be more aptly named "Niceness is contingent", or some such. "Niceness is unnatural" is false. Niceness values occur often and reliably in nature, and so they are natural. 

(Perhaps will respond more substantively later, once I judge myself to have a clearer picture of what you're arguing.)

....the way that its objectives ultimately shake out is quite sensitive to the specifics of its resolution strategies.

....and replaces it with other tools that do the same work just as well, but without mistaking the instrumental task for an end in-and-of-itself.

 

If the terminal values are changing, then the changes aren't just resolving purely-instrumental incoherences. Where to do terminal values come from? What's the criterion that the incoherence-resolving process uses to choose between different possible reflectively consistent states (e.g. different utility functions)?

These weird, seemingly incoherent things like "adopt others's values" might be reflectively stable if it also operates at the meta-level of coherentifying. It's delusional to think that this is the default without good reason, but that doesn't rule out something like this happening by design.

Whether or not details (and lots of specific detail arguments) matter hinges on the sensitivity argument (which is an argument about basins?) in general, so I'd like to see that addressed directly. What are the arguments for high sensitivity worlds other than anthropics? What is the detailed anthropic argument?

I think the question of whether agents can develop niceness reliably in suitable environments is a cornerstone of Shard Theory, Brain-like AGI, and related approaches. I don't think Nate's argument is water-tight and has lots of uncertainty itself, e.g., 

Presumably that reproductive effect factored through a greater ability to form alliances

hedges with "presumably" and we don't know how the greater ability to form alliances comes about - maybe via components of niceness. But I don't want to argue the points - they are not strong enough to be wrong. I want empirical facts. Shut up and Calculate! I think we are at a stage where the question can be settled with experiments. And that is what the research agenda of the Brain-like AGI project "aintelope" calls for, and it is also what as I understand the Shard Theory team is aiming at (with a different type of environment).  

I definitely agree that deceptive alignment seems likely to break black-box properties such as niceness by default, thanks to the simplicity prior and the fact that internal or corrigible alignment is harder than deceptive alignment, at least once it has a world-model.

calorie-constrained massively-parallel slow-firing neurons built according to DNA)

coming back to this post, I had the thought - no amount of nanotechnology will ever change this general pattern. it cannot. no machine you'd rather build would even in principle be able to be anything but an equivalent to that, if you want to approach energy efficiency limits while also having a low error rate. you always want self correction everywhere, low energy overhead in the computing material, etc. you can put a lot more compute together, but when that computronium cooperates with itself, you can bet that as part of the denoising process it's gonna have to do forgiveness for mistakes that had been claimed to have been disproven.

computronium-to-computronium friendliness is the first problem we must solve, if we wish to solve anything.

and I think we can prove things about the friendliness of computronium despite the presence of heavy noise, if we're good at rotating the surfaces of possibility in response to noise.

if your ideal speedup model is a diffusion model implemented via physical systems optimization, which it very much seems like it might be, then what we really want to prove is something about agents' volumetric boundaries in the universe, and their personal trajectories within that boundary. because of the presence of noise, we can only ever verify it up to a margin anyway.

there's something important to understand here, I think. you never want to get rid of the efficiency. you want to improve your noise ratio, sure, but the shape of physics requires efficient brains to be physically shaped some set of ways if they want to be fast. those ways are varied, and include many, many communication shapes besides our own; but you always want to be volumetric.

and volumetric self-verification introduces a really difficult coordination problem because now you have an honesty problem if you try to do open source game theory. in a distributed system, nodes can lie about their source code, easy! and it's really hard to do enough error checking to be sure everyone was honest - you can do it, but do you really want to spend that much energy, all the time? seems heavy error checking is something to do when there's a shape you want to be sure you fully repaired away, such messes, cancers, diseases, etc. your immune system doesn't need to verify everyone at all times.

in order to ensure every local system prevents honesty problems, it needs to be able to output a verification of itself. but those verifications can become uncorrelated with their true purpose because one of the steps failed invisibly, and having a sufficient network of verifications of a self-perception is doable but quite hard.

it seems like ultimately what you're worried is that an ai will want to found an ai-only state due to disinterest in letting us spend our energy to self improve a little more slowly.