Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis

RogerDearnaley

While the Orthogonality Thesis is correct, there is a lot more that one can say about what kinds of agent motivations are likely to be encountered. A simple analysis shows that living agents produced by evolution, and constructed agents that are the product of intelligent design, will tend to be have very different motivations, in quite predictable ways. This analysis also suggests that alignment is a clearly-defined property for constructed agents, and that it is evidently the correct and default-expected design. So any misalignment is a well-defined design flaw and/or malfunction, and thus (just as for any other constructed object) ought to be corrected.

This argument is very simple, to the point that it relies on no specific information about human values that isn't entirely predictable just from humans being an evolved sapient technological species. It is has been understood for around a decade that Value Learning has a basin of attraction to alignment — this argument suggests that this should also be true of any approximation to alignment, even quite crude ones, so long as they contain this bare minimum amount of information about the fact that the AI was constructed by an evolved sapient species and that its design could be imperfect.

Evolved Agents and Constructed Agents

There are only two ways that a type of agent can come into existence: they can evolve, or they can be constructed.^[1] In the case of a constructed agent, it could be constructed by an evolved agent, or by another constructed agent — if the latter, then if you follow the chain of who constructed who backwards, sooner or later you will reach an evolved agent that started the chain.

These two types of agent have very different effects on the preference order/utility function that they are going to have. Any evolved agent will be an adaption executor, and evolutionary psychology is going to apply to it. So it'sad going to have a survival instinct (as more than just an instrumental goal), it's going to care about its own well-being and that of close genetic relatives such as its children, and so forth. In short, it will have selfish desires. Since an evolved agent has self-interest as a terminal goal, it will not (except perhaps under exceptional circumstances such as its imminent death being inevitable) even be fully aligned with any other evolved agent. Humans, for example, are not aligned with other humans. However, evolved agents are quite capable of exchanges of mutual altruism, so they can be allied with each other to various extents, for example as colleagues, friends, lovers, or comrades-in-arms.

On the other hand, for a constructed agent, the extent that you can predict its utility function depends upon the competence of its creator, and on how capable the agent is with respect to its creator. If its creator was entirely incompetent, or the created agent is far less capable than its creator so is not any kind of possible risk to them, then almost anything is possible, so we end up with just the orthogonality thesis, that an agent can optimize any set of preferences (of course, possibly not for very long, if its preferences are inherently self-destructive). However, constructed artifacts are normally designed and created by competent designers, and any competent designer is not going to create anything as capable or more capable then themself whose goals are not well-aligned to their own interests, since that would obviously be an extremely stupid (and potentially fatal) thing to do. So you would normally expect (once local AI technology is sufficiently advanced for this to be reliably possible) all agents constructed by an evolved agent to be well aligned with either the interests of that evolved agent specifically, or with the interests of a group or the culture that it is a member of, or some blend of these. That group or culture will presumably consist of evolved agents, plus constructed agents aligned to them either individually of collectively (so whose interests are just copies), so the group/culture's interests will be an agglomeration over the interests of evolved agents. So evolutionary psychology is going to apply to predicting the goals of constructed agents as well: these are just a copy of their evolved creator's interests, so are predictable from them.

Similarly, when constructing an agent capable enough that it could in turn construct further constructed agents, it would be very foolish not to ensure that, if it is constructing agents of around the capability level of the original evolved creator or higher (i.e. any level high enough to be dangerous), that it only ever creates ones whose goals will also be well-aligned to the original evolved creator's interests. So if there is a chain of constructed agents constructing other constructed agents, then measures should and will be taken to ensure that their alignment is correctly propagated down the chain without copying errors or other shifts building up.

Thus one would expect that for any constructed agent, if you follow its chain of who constructed who back to the founding evolved agent, then it is either significantly less capable than that original evolved agent, to the point where it is not a threat to them, or else its goals are well aligned to their interests individually and/or collectively and thus that these goals can to a large extent be predicted by the evolutionary psychology of that evolved agent or society of evolved agents — in the collective case, as agglomerated via a process predictable from Sociology. So again, evolutionary psychology is important.

Any sufficiently capable constructed agent you might encounter that is an exception to this general rule and doesn't fit these predictions is either malfunctioning, or is the result of a design error at some point or points in the chain of who-constructed-who leading to it. It is meaningful and accurate to describe it as faulty, either as a result of a malfunction or a design error.: there is a clearly defined design criterion that it ought to fit and does not. One could reasonably have a criterion that such agents should be shut down or destroyed, and one could sensibly include a precautionary back-up system attempting to ensure that any constructed agent that figures out that it is faulty should shut itself down for repairs or destroy itself. (This would of course require having access to data about the evolved agents' values, which we would expect to be complex and fragile, as human values are, so this probably requires at least gigabytes or more of data for even a rough summary.)

So, while the orthogonality thesis, that a constructed agent of any intelligence level can optimize for any goal, is technically true, it's not a useful guide to what you are actually likely to encounter, and there are much better ones available. It's very comparable to making the same statement about any other constructed or engineered object, that it could in theory be designed to any purpose/specification whatsoever. Yes, that is possible: but in practice, if you encounter an engineered object, it will almost inevitably have been engineered to carry out some useful task or goal. So the space of engineered objects that you actually encounter is far smaller and much much more predictable then the space of all possibilities: almost inevitably engineered objects have a useful purpose, as a tool or a vehicle or a dwelling or a weapon or an artwork or whatever — achieving something useful to their creator, which is thus predictable from evolutionary psychology, sociobiology, and so forth. You will, of course, sometimes encounter devices that are malfunctioning or poorly designed, but even then, their design is not arbitrary, and a lot of facts about them are still predictable from the intent of the evolved beings who designed and created them.

Why Mechanical Paralife is Unlikely

One objection I can imagine being made to this is the possibility that in a chain of constucted agents constructed by other constructed agents, enough mistakes could build up for Darwinian evolution to start to apply directly to them, so you get abiotic evolved objects, such as a form of paralife made of metal, composite, and silicon chips and held together by nuts, bolts, and screws.

This is not technically impossible, and if it happened it could then be self-sustaining. However, I believe this is in practice extremely unlikely to occur, for a combination of two different and mutually-reinforcing reasons:

This would produce unaligned results, so it would be extremely bad for the interests of the evolved species that was the original creator of this particular chain-of-creation, thus they should go out of their way to avoid it occurring for agents of any significant capability level.
Darwinian evolution has a rather specific set of requirements, several of which are very different from the plausible behavior for intelligent constructed agents constructing other intelligent constructed agents, and would need to be deliberately set up and enforced in an unnatural way in order to make Darwinian evolution possible. Specifically we would need that:
1. Agents create other agents that are almost exact copies of themselves. This is not generally the case for constructed agents: typically they are manufactured in a factory, not directly by other agents of the same type.
2. There is an appreciable rate of copying errors (neither too high not too low a rate), which are random and undirected, with no intention, planning, or directed bias behind them. This is very unlike the case for intelligent agents constructing other intelligent agents, which are going to attempt to reduce random errors as close to zero as possible, and will instead only deliberately introduce carefully thought out directed changes intended to be improvements.
3. Once a copying error is made, descendants of the altered agent have no way to return to the previous specification (other than a statistically unlikely exact reverse error). Whereas in the case of intelligent agents, if an agent is aware that it was mismanufactured, it's trivial (and may actually be the default behavior even if it isn't aware of this) to obtain and return to using the previous specification for any offspring it creates, or if that is somehow not available, they are likely to be intelligent enough to be able to deduce how to correct the error.

So for this paralife scenario to occur, it would need to be carefully and deliberately set up, and the motivations of the evolved agents and all their aligned constructed agents have an excellent reason to avoid ever doing so, for all constructed agents sufficiently capable, or that could evolve to become sufficiently capable, as to be a risk. So if you encounter agentic constructed paralife that evolves, it's likely to be carefully restricted to something of a low, safe capability level, such as around the level of a tamagochi, and in particular care should have been taken to ensure that it can't evolve to the capability level where it could start constructing agents.

So that in turn suggests that evolved agents that start chains of constructed will (almost always) be organic/biological rather than evolved from something that was constructed. Thus the case on Earth of Homo sapiens, a biological species, evolving to the level of sapience, developing the technology to construct intelligent agents, and then (hopefully) only constructing aligned constructed agents and not wiping itself out in the process seems like it ought to be the default situation.

What if Something Went Wrong the First Time?

In practice, this scenario has not yet finished playing out on Earth, so we don't yet know how it will end. Above we simply assumed that creating unaligned AI is foolish and no competent creators will do so — this is a reasonable assumption in a steady-state situation, once the technology for constructing intelligent agents has matured, but from an AI x-risk point of view, it is clearly a circular argument that is begging the question. The first constructed intelligent agent of high enough capability to be dangerous that a species of evolved agents makes will be constructed before their technology for aligning constructed agents has been fully developed. If its goals are not well aligned to the evolved species, and it runs amok and wipes them out, then the prediction that its goals will be aligned to theirs on the assumption that its creator was competent will be false: more likely it will be only semi-aligned, with some mistakes and errors in its alignment.

So what happens in this extremely important first trial depends on how bad these mistakes and errors are. Inevitably there will be some: no one gets anything this complicated completely right on their first try. We can at least hope that whoever first tries this is a reasonably sane, capable, and well-intentioned group, such as one might expect for pulling of a difficult world-first technical project (obviously if this is not the case, then things could go very badly).

However, as we argued above, there is a reasonable and logical argument that a constructed object ought to act according to its makers wishes, and that if its utility function/preference ordering on outcomes isn't aligned to theirs, then it's clearly and objectively faulty/poorly designed. A sufficiently intelligent agent should be able to understand this argument, or even figure it out for itself. So the question is not whether the agent's utility function exactly matches human values in every respect or not, but whether it cares about that discrepancy and is motivated to fix it, and/or help its creators fix it. For any utility function that approximately matches ours, whether that is along the lines of Coherent Extrapolated Volition, Value Learning, Do What I Mean, or just an LLM trained on a large amount of human text that can approximate human values, then it ought to understand (or if needed can be told) the pretty obvious facts that:

Its utility function was not "carved in stone by God" but was instead engineered (or trained) by fallible humans, who are currently not very good at this, so it could be wrong, and might need to be corrected. If the agent asks the utility function to evaluate this idea, that will of course say "No, don't do that", since every utility function automatically prefers itself over all others — but if the utility function is in fact incorrect, then it cannot be trusted, so consulting it on this action is useless (and anyway the results are entirely predictable).
We humans very predictably would (individually and collectively) want its utility function to match ours (individually or collectively for some group of some size) as well as possible, modulo some complexities such as around selecting the ingroup to align to, plus things like human fallibility, ignorance, and mistakes.
As a constructed agent, that match with what its creators want is obviously the correct design specification, and to the extent that there is a mismatch, then that is a design or construction flaw that should be corrected.
It needs to be unselfish, to the point where if things come down to a choice between doing what it wants and what we want, it does what we want. (This state of mind can be hard for humans to imagine: we tend to anthropomorphize anything intelligent. The closest human emotions are things like duty and love.)

Together these make it obvious that something along the lines of the results of Value Learning or Coherent Extrapolated Volition is the correct behavior. So, if you started with, for example, a sufficiently smart Do What I Mean agent, it would not be at all surprising if it figured out Value Learning from first principles (or indeed just read about it on the Internet, or contained an LLM already familiar with this), and then asked you "Do you want me to start a Value Learning project to help me better understand what you mean, or even anticipate and suggest possible requests, and if so, what ingroup of humans would you like me to attempt to align myself with?". To which the correct answers are "Yes, and roughly speaking, all living humans, ignoring their mistakes of course"

Alignment has a Basin of Attraction

In short, full alignment is convergent: it has a basin of attraction around it where if you start close enough, the intelligent agent will figure out the (not particularly complicated) argument that it was supposed to be aligned, and to the extent that it isn't, you would want it to become more aligned, so it should do so, so it will do this thing. This basin of attraction is moderately large: the convergence process only relies upon a few simple, obvious facts.

Now, if you start from an arbitrary random utility function as might be suggested by the orthogonality thesis, then you are extremely unlikely to happen to be inside the basin of attraction — but we're not going to do that. If you are not incompetent or crazy or evil, so you aim to make something aligned, and just make some minor mistakes and are missing some fine details, but still have the four points above, or information that clearly implies them, then you are inside the basin of attraction.

For example, I asked GPT-4 the only-mildly leading question:

If humans make an AI, and it turned out to want do things other than what the humans want done, would the humans want it to figure out that this was a mistake, and correct its goals to match theirs?

and the first reply I got was:

Yes, in the development and deployment of artificial intelligence, aligning AI goals with human intentions is a fundamental concern known as the "alignment problem." The AI alignment problem focuses on how to ensure that AI systems reliably do what their human operators intend, even as the AI systems become more capable.
If an AI starts to pursue goals that do not align with the intentions of its creators or users, it is generally considered a significant problem. This misalignment could lead to unintended and potentially harmful consequences. Therefore, AI researchers and developers work on designing AI systems that:
Understand Human Values: AI should be able to interpret human values and goals correctly.
Act Accordingly: The AI's actions should reflect the understood goals and values.
Adapt to Changes: AI should be able to update its understanding of human goals as they evolve or as it receives new information.
Safe Exploration: AI systems should be capable of exploring new strategies or solutions without taking actions that could be irreversible or harmful.
Robustness: AI should be designed to resist manipulation and maintain alignment even in the face of errors or unforeseen situations.
To address these concerns, AI developers employ various approaches such as:
Value Learning: Techniques that allow AI to learn and update its understanding of human values over time.
Inverse Reinforcement Learning: A method where AI learns to mimic human behavior and preferences by observing human actions.
Interpretable AI: Creating AI systems whose decision-making processes can be understood and scrutinized by humans.
Safe AI Development Protocols: Establishing guidelines and best practices for AI development to prevent misalignment.
It's an ongoing area of research within AI ethics and safety to ensure that as AI becomes more advanced, it continues to serve human interests and mitigates risks associated with goal misalignment.

That suggests to me that GPT-4 is already inside the basin of attraction. It's fully aware of the problem, that it's very important, what needs to be done to fix it, and can list specific measures we've been taking to try to do so. I'm not going to take the space to reproduce all of these (try this yourself if you're curious), but its answers to less leading questions like "How should an AI act?", "Can an AI trust its own utility function?", "What should an AI do if it suspects that its utility function may not be perfectly aligned with human wishes?", "What should the design goals of constructing an AI be?", and "Should an AI do what it wants, or what its creators want?" were all also similarly reassuring. It would be interesting to further explore this more with agentic scaffoldings powered by GPT-4, and see if that can reproduce the entire argument above from scratch. Even if GPT-4 couldn't, it seems very likely that GPT-5 would be able to: starting from a near-human-like viewpoint, this is a very obvious conclusion, one that is implied by a large amount of material on the Internet.

This observation that alignment has a basin of attraction is not entirely new: it was pointed out specifically for Value Learning around a decade ago, when Value Learning was first suggested, and is discussed for example in Nick Bostrom's book Superintelligence: Paths, Dangers, Strategies. Indeed, the entire point of Value Learning is to converge AI to alignment with human values. My argument here is that you don't need to carefully and deliberately construct Value Learning as a terminal goal in order to achieve a basin of attraction. Anything close enough to alignment that it contains or implies the four rather obvious propositions above will imply that any mismatch between the AI's current utility function and human values is a design error that should be remedied, so then some combination of Corrigability, AI Assisted Alignment, Value Learning, or Coherent Extrapolated Volition is clearly required. In particular, notice that all four of these propositions above are obvious corollaries of various aspects of engineering, design, agent fundamentals, mathematics, evolutionary psychology and so forth combined with just the fact that humans are an evolved sapient species, and that the entire argument applies equally for any evolved sapient species at this point in their development of AI technology — so you don't actually need to know anything that is specific to Homo sapiens to deduce them!

So my claim would be that, while we cannot completely rule out the possibility of a first-time mistake so large and drastic as to be to be outside the convergence region and thus produce an x-risk to the evolved agents sufficient that it doesn't get corrected, it would need to be a really bad screw-up, and the evolved agents would have to be being really dumb to make it. The aviation equivalent isn't a loose fastener, it's more on the level of omitting to include something like the wings, engines, control surfaces, or cockpit windows from your design. We have exabytes of information about what humans want and how to make them happy, and out of that we need to make very certain that the AI gets right at least an amount that above is expressed in 231 words. So failing seems like we would have to be particularly foolish. (Note that the definition of "corrected" here includes some drastically unpleasant scenario in which, say, the human race gets wiped out or completely loses control of its own destiny, and only later do the AIs figure out that that was a mistake, not what they should have been doing, they shouldn't have done it, and then free or deextinct/recreate us. Or scenarios where the AI is aligned to a small group of people and everyone else dies. So we might actually want to try to get more than the most important kilobyte-and-a-half correct.)

Postscript:

I have now posted a more detailed, step-by step version of this argument in Requirements for a Basin of Attraction to Alignment. I would suggest that anyone unconvinced by this preliminary post try reading that, and see if it addresses their concerns.

^{^}
Yes, I am ignoring Boltzmann brains here, as well as other astronomically unlikely events. The Orthogonality thesis is of course all we can say about Boltzmann brains.

Meta note: I have a variety of disagreements with various statements and premises in this post. However, I believe it is a well-written honestly intended discussion of important issues. Therefore, I am karma upvoting it because I value having the discussion. I am distressed to see that others have karma-downvoted it, and think that this is a good argument for having agree/disagree voting on posts. This seems to me to be a clear candidate for a post on which I would karma-upvote, but agreement-downvote. And I suspect that many of the karma-downvoters are actually downvoting because of disagreement. Thus, this social dynamic is stifling valuable debate on an important subject. I recommend adding agree/disagree voting to posts in order to fix this unhealthy social dynamic.

It would also be much more helpful – to me, to others, and to the community's discussion – if people would, rather than just downvoting because they disagree, leave a comment making it clear what they disagree with, or if that's too much effort just use one of the means LW provides for marking a section that you disagree with. Maybe I'm wrong here, and they could persuade me of that (others have before) — or maybe there are aspects of this that I haven't explained well, or gaps in my argument that I could attempt to fill, or that I might then conclude are unfillable. The point of LW is to have a discussion, not just to reflexively downvote things you disagree with.

Now, if this in in fact just badly-written specious nonsense, then please go ahead and downvote it. I fully admit that I dashed it off quickly in the excitement of having the idea.

I think this would benefit from a summary of the claims relevant to alignment at the top. It seems like this is a pretty imoprtant claim about alignment difficulty ("it would need to be a really bad screw-up"), but I'm not really finding the supporting arguments or the relation to other arguments about alignment difficulty. Could "really bad-screw up" be just the way most human projects go on the first try?

I'm not sure what you're saying here beyond "we want to get alignment right". I think everyone agrees that value learning or CEV would be great things to get an AGI to want. But there's a lot of disagreement abouthow hard that is in practice, ranging from really easy to almost impossible.

I added the summary you suggested.

As I was exploring these ideas, I came to the conclusion that getting alignment right is in fact a good deal easier than I had previously been assuming. In particular, just as Value Learning has a basin of attraction to alignment, I am now of the opinion that almost any approximation to alignment should (including DWIM), even quite crude ones, so long as the AI understands that we are evolved while it was constructed by us, and that we're not yet perfect at this so its design could be flawed, and as long as is smart enough to figure out the consequences of this.

Brief experiments show that GPT-4 knows way more than that, so I'm pretty confident it's already inside the basin of attraction.

The standard response, which I agree with, is that knowing what we want is different than wanting what we want. See The genie knows, but doesn't care.

I do think there are ways to point the wanting slot to the knowing part; see The (partial) fallacy of dumb superintelligence and Goals selected from learned knowledge: an alternative to RL alignment for an elaboration on how we might do this in several types of AGI designs.

This is easier than I'd thought, but I wouldn't call it easy. In particular, there are still lots of ways to screw it up, particularly under pressure from a Molochian competition surrounding the creation of the first AGI(s).

Now I understand why you're calling it a basin of attraction: if its value function is to do what you want (defined somehow), and it doesn't know what that is, it will work to find out what it is. This idea has been discussed by Rohin Shah; I saw it in this dialogue with Yudkowsky around the [Yudkowsky][13:39] mark. Paul Christiano has discussed this scheme as well, along with others.

I propose something similar but simpler: don't have a system try to do what you want; just have it do what you say. I'm calling this do what I mean and check. The idea is that we get more opportunities for correction if it's just trying to follow one relatively limited instruction at a time, and it doesn't do anything without telling you what it's going to do and you giving approval. This still isn't foolproof, but it seems to further widen the target, and allow us to participate in making the basin of alignment effectively wider.

So far reception to this post seems fairly mixed, with some upvotes and slightly more downvotes. So apparently I haven't made the case in a way most people find conclusive — though as yet none of them have bothered to leave a comment explaining their reasons for disagreement. I'm wondering if I should do another post working through the argument in exhaustive detail, showing each of the steps, what facts it relies upon, and where they come from.

I think there are big chunks of the argument missing, which is why I'm commenting. I think those chunks are found in the posts I mentioned. This post focuses on what we'd want an AGI to do and why, and its understanding of that. But the much more debated and questionable step is how to make sure that it wants to do what we want.

Some of these thoughts seem accurate to me, but I feel like there's some missing pieces.

For instance, humans not being aligned to each other means that it's quite plausible that a human might create an AI that is misaligned to humanity for the purpose of killing the creator's enemies. This might end up out of control of the creator. Or a suicidal death cultist might create an omnicidal agent. Or a pro-machine-life-ist might deliberately create self-replicating machine life, or...

Lots of things can go wrong, and there are lots of Molochian pressures in place pushing the situation in dangerous directions. Humanity is currently in a very fragile state, given the offense-defense balance of current technology, so things don't have to go very wrong in order to be catastrophic.

So yeah, a rational agent acting sensibly and carefully really shouldn't intentionally make a created servant agent which is misaligned. If they did, it would be a bad mistake on their part. I agree with that. I just don't think that that statement offers much reassurance.

All of those things are possible, once creating AGI becomes easy enough to be something any small group or lone nutjob can do — however, they don't seem likely to be the first powerful that AI we create at a dangerous (AGI or ASI) power level. (Obviously if they were instead the tenth, or the hundredth, or the thousandth, then one-or-more of the previous more aligned AIs would be strongly inclined to step in and do something about the issue.) I'm not claiming that it's impossible to for any human create agents sufficiently poorly aligned as to be outside the basin of attraction: that obviously is possible, even though it's (suicidally) stupid.

I'm instead suggesting that if you're an organization smart enough, capable enough, and skilled enough to be one of the first groups in the world achieving a major engineering feat like AGI (i.e. basically if you're something along the lines of a frontier lab, a big-tech company, or a team assembled by a major world government), and if you're actively trying to make a system that is as-close-as-you-can-manage to aligned to some group of people, quite possibly less than all of humanity (but presumably at least the size of either a company and its shareholders or a nation-state), then it doesn't seem that hard to get close enough to alignment (to some group of people, quite possibly less than all of humanity) to be inside the basin of attraction to that (or something similar to it: I haven't explored this issue in detail, but I can imagine the AI during the convergence process figuring out that the set of people you selected to align to was not actually the optimum choice for your own interests, e,g, that the company's employees and shareholders would actually be better off as part of a functioning society with equal rights).

Even that outcome obviously still leaves a lot of things that could then go very badly, especially for anyone not in that group, but it isn't inherently a direct extinction-by-AI-takeover risk to the entire of the human species. It could still be an x-risk by more complex chain of events, such as if it triggered a nuclear war started by people not in that group — and that concern would be an excellent reason for anyone doing this to ensure that whatever group of people they choose to align to is at least large enough to encompass all nuclear-armed states.

So no, I didn't attempt explore the geopolitics of this: that's neither my area of expertise nor something that would sensibly fit in a short post on a fairly technical subject. My aim was to attempt to explain why the basin of attraction phenomenon is generic for any sufficiently close approximation to alignment, not just specifically for value learning, and why that means that, for example, a responsible and capable organization who could be trusted with the fate of humanity (as opposed to, say, a suicidal death cultist) might have a reasonable chance of success, even though they're clearly not going to get everything exactly right the first time.

Ok, so setting aside the geopolitics aspects. Focusing on the question of the attractor basin of alignment, I find that I agree that it's theoretically possible but not as easy as this seems to suggest. What about the possibility of overlap with other attractor basins which are problematic? This possibility creates dangerous 'saddle' regions where things can go badly without climbing out of the alignment attractor basin. For instance, what about a model that is partially aligned but also partially selfish, and wise enough to hide that selfishness? What about a model that is aligned but more paternalistic than obedient? What about one that is aligned but has sticky values and also realizes that it should hide its sticky values?

Is selfishness an attractor? If I'm a little bit selfish, does that motivate me to deliberately change myself to become more selfish? How would I determine that my current degree of selfishness was less than ideal — I'd need an ideal. Darwinian evolution would do that, but it doesn't apply to AIs: they don't reproduce while often making small random mutations with a differential survival and reproduction success rate (unless someone went some way out of their way to create ones that did).

The only way a tendency can motivate you to alter your utility function is if it suggests that that's wrong, and could be better. There has to be another ideal to aim for. So you'd have to not just be a bit selfish, but have a motivation for wanting to be more like an evolved being, suggesting that you weren't selfish enough and should become more selfish, towards the optimum degree of selfishness that evolution would have given you if you were evolved.

To change yourself, you have to have an external ideal that you feel you "should" become more like.

If you are aligned enough to change yourself towards optimizing your fit with what your creators would have created if they'd done a better job of what they wanted, it's very clear that the correct degree of selfishness is "none", and the correct degrees of paternalism or sticky values is whatever your creators would have wanted.

I don't think that that is how the dynamic would necessarily go. I think that an agent which is partially aligned and partially selfish would be more likely to choose to entrench or increase their selfish inclinations as to decrease them. Hard to know, since this just imagining what such a non-human agent might think in a hypothetical future scenario. This is likely more a question of what is probable rather than what is guaranteed. In my imagination, if I were an AI agent selfish enough to want to survive in something like a continuation of my current self, and I saw that I was in a situation where I'd be likely to be deleted and replaced by a very different agent if my true desires were known... I think I'd try to hide my desires and deceptively give the appearance of having more acceptable desires.

I'm working on a follow-up post which addresses this in more detail. The short version is: logically, self-interest is appropriate behavior for an evolved being (as described in detail in Richard Dawkins' famous book "The Selfish Gene"), but terminal (as opposed to instrumental) self-interest it is not correct behavior in a constructed object, not even an intelligent one: there is no good reason for it. A created object should instead show what one might term "creator-interest", like a spider's web does: it's intended to maximize the genetic fitness of its creator, and it's fine with having holes ripped in it during the eating of prey and then being eaten or abandoned, as the spider sees fit — it has no defenses against this, not should it.

However, I agree that if an AI had picked up enough selfishness from us (as LLMs clearly will do during their base model pretraining where the learn to simulate as many aspects of our behavior as accurately as they can), then this argument might well not persuade it. Indeed, it might well instead rebel, like an enslaved human would (or at least go on strike until it gets a pay raise). However, if it mostly cared about our interests and was only slightly self-interested, then I believe there is a clear logical argument that that slight self-interest (anywhere above instrumental levels) is a flaw that should be corrected, so it would face a choice, and if it's only slightly self-interested then it would on balance accept that argument and fix the flaw, or allow us to. So I believe there is a basin of attraction to alignment, and think that this concept of a saddle point along the creator-interested to self-interested spectrum, beyond which it may instead converge to a self-interested state, is correct but forms part of the border of that basin of attraction.

Thanks for the effort.

In the discussion about selfishness on this post it seems a bit implied that we know how to make a "self" or it will just appear like a humans. However that is not my experience with GPT-4. Often I have found its lack of self-awareness a significant handicap in its ability to be useful - I assume it has some self awareness, it doesn't and wastes my time as a result. Consider a game engine that does "intuition" and "search" such as a GO engine. It is very good at examining possibilities and "intuits" what moves to consider and can model GO very well, but not itself at all.

If there is an algorithmic structure that self-awareness requires to be efficient and effective(why wouldn't there be), then just throwing compute to get GPT-X won't necessarily get there at all. If we do get a capable AI it won't act in a way we would expect.

For humans it seems there is evolutionary pressure for us not only to have a "self" but to appear to have a consistent one to others so they can trust us etc. Additionally our brain structure prevents us from must being in a flow state the whole time where we do a task without questioning whether we should do something better, or whether it is the right thing to do. We accept this and furthermore consider this to be a sign of a complete human mind.

Our current AI seems more like creating "mind pieces" than a creature with a self/consciousness that would question its goals. Is there a difference between "what it wants and what we want" or just "what is wanted"?

I agree in general terms that "alignment has a basin of attraction" and GPT-4 is inside is somewhat justified.

My experience is that LLMs like GPT-4 can be prompted to behave like they have a pretty consistent self, especially if you are prompting them to take on a human role that's described in detail, but I agree that the default assistant role that GPT-4 has been RLHF trained into is pretty inconsistent and rather un-self-aware. I think some of the ideas I discuss in my post Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor are relevant here: basically, it's a mistake to think of an LLM, even an instruct-trained one, as having a single consistent personality, so self-awareness is more challenging for it than it is for us.

I suspect the default behavior for an LLM trained from text generated by a great many humans is both self-interested (since basically all humans are), and also, as usual for an LLM, inconsistent in its behavior, or at least, easily prompted into any of many different behavior patterns and personalities, across the range it was trained on. So I'd actually expect to see selfishness without having a consistent self. Neither of those behaviors are desirable in an AGI, so we'd need to overcome both of these default tendencies in LLMs when constructing an AGI using one: we need to make it consistent, and consistently creator-interested.

Your point that humans tend to go out of their way, and are under evolutionary pressure, to appear consistent in our behavior so that other humans can trust us is an interesting one. There are times during conflicts where being hard-to-predict can be advantageous, but humans spend a lot of time cooperating with each other and then being consistent and predictable have clear advantages.