What would it mean to solve the alignment problem sufficiently to avoid catastrophe? What do people even mean when they talk about alignment?

The term is not used consistently. What would we want or need it to mean? How difficult and expensive will it be to figure out alignment of different types, with different levels of reliability? To implement and maintain that alignment in a given AGI system, including its copies and successors?

The only existing commonly used terminology whose typical uses are plausibly consist is the contrast between Inner Alignment (alignment of what the AGI inherently wants) and Outer Alignment (alignment of what the AGI provides as output). It is not clear this distinction is net useful.

An alignment failure or misalignment (being misaligned) can mean among other things:

  1. That the system has unintended goals or behaviors.
  2. That the system has unintended (harmful or dangerous) goals or behaviors.
  3. That the system will exhibit undesired goals or behaviors under at least some conditions, in response to at least some inputs
    1. With respect to the intended goals of the alignment efforts made.
    2. From the perspective of those aligning the system.
    3. From the perspective of a given user, or users in general.
    4. From the perspective of a society, planet, government or humanity.
    5. From some other perspective, cause, system of judgment or value.
  4. That the system exhibits such behaviors that have outcomes we would not desire on reflection, including those that lead to poor outcomes, whether or not they match the rules and principles to which we attempted to align the system.
  5. That the system exhibits particular alignment failure modes, such as instrumental convergence and power seeking, lack of corrigibility (resisting being shut down), refusing to obey ‘lawful’ orders, deception or manipulation, attempting to kill all humans and so on.

It is impossible in theory to have all these different kinds of alignment simultaneously. You cannot simultaneously (without any claim of completeness):

  1. Do what I say
  2. Also do what I mean
  3. Also do what I should have said and meant
  4. Also do what is best for me
  5. Also do what broader society or humanity says
  6. Also do what broader society or humanity means or should have said
  7. Also do what broader society or humanity should have said given their values
  8. Also do what is best for everyone
  9. Do some ideal friendly combination of all of it that a broadly good guy would do, in a way that is respectful of and preserves what is valuable on all levels
  10. Strictly follow some other set of rules that were set up long ago, no matter the cost

(And do it all according to a variety of contradictory human heuristics and biases, while looking friendly, while engaging in unnatural behaviors like corrigibility, and not tricking people into giving you requests you can fulfil, and so on, please.)

We must pick at most one of those, or another variation on them, or something else, as our primary target. A machine cannot serve two masters any more than a man can, and the ability to put the machine under arbitrary stress, and its additional capabilities and intelligence, makes this that much more clear. Even individually, many of the requests and desired behavioral sets above are not actually logically coherent or consistent.

Getting any one of those ten is hard enough. It is a problem we do not know how to solve for systems more intelligent than we are. We do not even know how to robustly solve it for current systems.

To solve alignment and retain control of AGIs and their actions, we will need to:

  1. Be able to get an AGI to do something a human selects at all, rather than something not selected. Be able to retain some form of control over what it does in the future, or set it on a chosen course. At all.
  2. Have this alignment be of the appropriate type for the role and circumstances, and sufficiently strong, robust and reliable to be maintained.
  3. Have this alignment and the surrounding dynamics cause humans to choose to remain in control over time, or somehow be unable to choose differently.
  4. Have all of this survive rapid unpredictable changes over long periods, or find a way to prevent such changes.
  5. A key crux: We may need to get this right on the first try when we build the first sufficiently powerful system, due to the consequences of the first try getting this wrong being catastrophic. Also disagreement over ‘exactly how right’ this right would need to be to avoid this.

Useful and consistent terminology and taxonomy beyond this are urgently needed.

We could call these ten forms of alignment these names (by all means please replace with better names, this is hard), again this list is not claimed to be complete:

  1. Literal (Personal) Genie: Do exactly what I say.
  2. Minion: Do what I intended for you to do.
  3. Personal: Do what I would want you to do.
  4. Forceful: Be loyal to me, but do what’s best for me, not strictly what I tells you to do or what he wants or intended.
  5. Literal Genie: Do whatever it is collectively told.
  6. Public Servant: Carry out the will of the people.
  7. Value: Uphold the values of the people, and do what they imply.
  8. Cincinnatus: Do what needs to be done, whether the people like it or not.
  9. Robin Williams: The Genie from Aladdin. Note he is not strategic.
  10. Arbiter: What is the law?

We do not currently have a known method of creating reliable alignment of any kind for future AGI systems, or a path known to lead to this. How promising various existing proposals or plans are for getting us there is heavily disputed and a common crux.

In addition to the type of alignment, one can talk about various aspects of the strength, reliability, precision and robustness of that alignment, as well as what ways exist to weaken, risk or break that alignment. These and related words are not used consistently.

In very broad terms, combining aspects that can be distinct for ease of discussion, one might speak of things like, in terms of either inner alignment, outer alignment, or a combination of both:

  1. Fragile alignment. This is the type of alignment that we know how to achieve in existing LLM systems. Something like: You do your best to noisily specify with words or examples what preferences you want to put into the AI, including how the AI might act when different preferences are in conflict. Under default or similar circumstances, the AI will probably (or even almost certainly) act in ways broadly compatible with the general vibe and sense of what was requested. If you take it outside of its training distribution, this will often break down, and there will be various hacks, tricks, frames and ‘jailbreaks’ available to modify behaviors, with which one can play whack-a-mole to raise difficulty and decrease natural frequency.
  2. Friendly alignment. Cares at least importantly in part, ideally primarily, about humans and the things humans care about, and cares a lot about human values or humanity potentially going extinct, enough to spend resources towards such ends, or at least to spend extra resources to avoid causing such events as side effects, and to not aim for configurations of atoms where we are absent or that we would not find valuable.
  3. Human-level alignment. The AGI cares about at least some humans and human values within the range of roughly the same ways and degrees that typical humans care about other humans and human values. One can speak of quantitative levels of this, and what it would take for such considerations to override or be overridden by other considerations such as instructions given. Under the wrong circumstances it might end up doing almost anything, but it is as tough to get weird failure modes as it would be to get humans to end up in those failure modes, ideally similar to when those humans thinking relatively clearly.
  4. Strict alignment. The damn thing will actually follow some set of instructions to the letter subject to its optimization constraints, hopefully you like the consequences of that. It is a potentially important crux if you disagree with the claim that for almost all specified instruction sets you won’t like the consequences, and there is no known good one yet, due to various alignment difficulties.
  5. Strawberry alignment. MIRI calls a well-constrained version of strict alignment ‘strawberry alignment,’ where you can tell the AI to build two strawberries that are identical on the cellular level, and it will do so without causing anything weird or disruptive to happen.
  6. Robust alignment. Something that is reliably going to act in ways that lead to valuable-according-to-[humans or human values] configurations of atoms, or does its best to preserve some invariants that ensure value is preserved, or something like that, using methods which we would approve on reflection, in way that survives moving far out of the training distribution, and which is secure against disruptions.

All these targets have problems, in addition to ‘we don’t know how to get this’ beyond the first one, ‘do we know what the components mean or how to specify them’ and ‘we don’t understand human values’ and ‘is this even a coherent concept,’ such as:

It is not clear fragile alignment is even meaningfully helpful – that it does much, survives for long, or causes actions compatible with our survival, once the AGI is smarter than we are, even if we get its details mostly right and faced relatively good conditions. There are overlapping and overdetermined reasons (the strength and validity of which are of course disputed) to expect any good properties to break down exactly when it is important they not break down.

It is not clear that human-level or friendly alignment would do us much good for long either, given the nature and history of humans, and the competitive dynamics involved, and the various reasons to expect change. If AGIs are much smarter and more capable and efficient than us, is there reason to think this level of alignment might be sufficient for long?

It is not clear to what extent strict alignment or strawberry alignment gives us affordance to reach good outcomes, how universal and deadly the various sources of lethality involved would be, or how difficult it would be to locate such affordances, especially on the first try.

It is not clear to what extent robust alignment is a coherent concept especially in a competitive world or even how it interacts with maximization, as it contains many potential contradictions and requirements. Or how one could get or even specify this level of alignment even under ideal conditions.

A better and more complete future version of this document would include a better taxonomy here similar to the one above.

A key crux is the type and degree of alignment necessary to avoid catastrophe and achieve good outcomes. Another is the how difficult such alignment will be to achieve with what level of reliability, and which particular obstacles we need to worry about.

A Missing Additional Post: Alignment Difficulties

My post on the progression through various stages of AGI development handwaved ‘alignment’ to focus on when we might need how much of it depending on what path we take in terms of what AGIs or potential-AGIs exist under how much human control, including the implications for type and degree of alignment necessary.

This post, on degrees and types of alignment, asks what alignment actually means, and what forms and degrees it might take and which of them would be required to survive various scenarios, and spread and preserved how robustly, and so on. Are these types of alignment even possible in theory, or coherent logically consistent concepts? If you get the thing you ostensibly want, what would happen?

A third post might ask, how likely would it be, and how hard would it be, for us to achieve a given form or degree of alignment, in systems smarter and more capable than us or any previously existing system?

Is this ‘alignment’ a natural thing you can get easily or even by default, that is essentially a normal engineering problem, or is it a highly unnatural outcome where security mindset and bulletproof approaches as yet unfound even in principle are required, with any flaws are exploited, amplified and fatal, and many lethal problems all of which one must avoid?

How much hope or doom lies in various potential approaches? Would scaled up versions of things that work on non-intelligent systems likely work out of the box or with ordinary reasonable adjustments, or do we know reasons they definitely fail? Can we use incrementally smarter AIs to solve our problems for us? Will the results naturally be robust, have nice properties, be nicely self-maintaining? Does it fall out of this ‘one weird trick’?

How much investment of time and money, how much sacrifice of capability including continuously, is required to get what we need to make a real attempt? To what extent do we need to ‘get it right on the first try’ due to failure not being something we can recover from, and how much does that increase the difficulty level versus problems where we can iterate?

The most lethal-looking, hard-to-avoid, unnatural-to-solve problems include instrumental convergence, power seeking and corrigibility, yet the list of even central ones is very long – see Yudkowsky 2022, A List of Lethalitites.

Creating a neutral-perspective version of such a list, especially an exhaustive one, and getting all the implied cruxes including potential solutions into the crux list, would likely be valuable. Especially if it was combined with those resulting from potential solutions and paths to those solutions, and so on.

Unfortunately, for now, the scope of that project is intractable. I have run out of time if not space, and leave expanding this out to others or to the future. If the answers here matter to you, it will be a long slog of evaluating many complexities, and I urge you not to outsource or abstract it, avoid falling back on social cognition or normality heuristics or grabbing onto metaphors, and instead think hard about the concrete details and logical arguments.

New to LessWrong?

New Comment
10 comments, sorted by Click to highlight new comments since: Today at 1:25 PM

When there is a difference between what I said, what I meant, what I should have said/meant, and what is best for me... I think I would prefer the AI to explain to me this difference, if possible. If impossible to explain, at least to say that there is a difference (and let me figure out how to deal with that information).

The collective version of the same problem is more difficult, because it turns possible resulting intrapersonal conflicts into interpersonal.

From the perspective of ontology and memetic engineering, the whole "ontology" or classification of alignments that you give, "fragile, friendly, ..." is bad because it's not based on some theory but rather on the cacophony of commonsensical ideas. These "alignments" don't even belong to the same type: "Fragile" is an engineering approach (but there are also many other engineering approaches which you haven't mentioned!), 2-3 and 5-6 are black-box descriptions of some alignment characteristics (at least these seem to belong to the same type), and "Strict" looks like a description of a mathematical mechanism. Also, the names are bad.

Fragile alignment

The name is really bad you use a property of this engineering approach (its fragility) as its name. But there are infinitely many other engineering approaches that are very fragile. For example, consider post-filtering LLM's outputs for any occurrences of the words like "bomb", "kill", "poison", n-word, etc., and not passing these rollouts through, but passing all the rest. Is this an alignment technique? Yes, if we consider the filter as part of the cognitive architecture, as we should. Is it fragile? Yes.

But when multiple such approaches are combined, this may actually lead to a cognitive architecture that is robustly aligned. This is the promise of LMCAs and the natural language alignment.

It is not clear fragile alignment is even meaningfully helpful – that it does much, survives for long, or causes actions compatible with our survival, once the AGI is smarter than we are, even if we get its details mostly right and faced relatively good conditions. There are overlapping and overdetermined reasons (the strength and validity of which are of course disputed) to expect any good properties to break down exactly when it is important they not break down.

Laissez-faire (any prompt goes), "bare", non-post-filtered LLM rollouts as a cognitive architecture is obviously doomed. OpenAI has stopped deploying and giving access to base models (GPT-4 is available only in SFT'ed and RLHF'ed form), and I expect that in the next iteration (GPT-5), they will (my wishful thinking) stop even that -- they will give access only to an LMCA from the beginning. Even "rollout + post-filtering" is already a primitive form of such LMCA.

In turn, LMCAs in general shouldn't necessarily inherit the cardinal sin of LLMs (that are "exponentially diverging diffusion process", in the words of LeCun). And SFT/RLHF (things that you have called "fragile alignment") could be a part of a robust architecture, as I already noted above.

[-]blf1y10

The first four and next four kinds of alignment you propose are parallel except that they concern a single person or society as a whole.  So I suggest the following names which are more parallel.  (Not happy about 3 and 7.)

  1. Personal Literal Genie: Do exactly what I say.
  2. Personal Servant: Do what I intended for you to do.
  3. Personal Patriot: Do what I would want you to do.
  4. Personal Nanny: Be loyal to me, but do what’s best for me, not strictly what I tells you to do or what he wants or intended.
  5. Public Literal Genie: Do whatever it is collectively told.
  6. Public Servant: Carry out the will of the people.
  7. Public Patriot: Uphold the values of the people, and do what they imply.
  8. Public Nanny: Do what needs to be done, whether the people like it or not.
  9. Gentle Genie: The Genie from Aladdin. Note he is not strategic.
  10. Arbiter: What is the law?

Is this ‘alignment’ a natural thing you can get easily or even by default, that is essentially a normal engineering problem, or is it a highly unnatural outcome where security mindset and bulletproof approaches as yet unfound even in principle are required, with any flaws are exploited, amplified and fatal, and many lethal problems all of which one must avoid?

To answer these questions specifically, it's really important not just to consider AI--human alignment "in the abstract", but embedded in the current civilisation, with its infrastructure and incentives structures. As I wrote here:

[...] we should address this strategic concern by rewiring the economic and action landscapes (which also interacts with the "game-theoretic, mechanism-design" alignment paradigm mentioned above). The current (internet) infrastructure and economic systems are not prepared for the emergence of powerful adversarial agents at all:

  • There are no systems of trust and authenticity verification at the root of internet communication (see https://trustoverip.org/)
  • The storage of information is centralised enormously (primarily in the data centres of BigCos such as Google, Meta, etc.)
  • Money has no trace, so one may earn money in arbitrary malicious or unlawful ways (i.e., gain instrumental power) and then use it to acquire resources from respectable places, e.g., paying for ML training compute at AWS or Azure and purchasing data from data providers. Formal regulations such as compute governance and data governance and human-based KYC procedures can only go so far and could probably be social-engineered by a superhuman imposter or persuader AI.

In essence, we want to design civilisational cooperation systems such that being aligned is a competitive advantage. Cf. "The Gaia Attractor" by Rafael Kaufmann.

This is a very ambitious program to rewire the entirety of the internet, other infrastructure, and the economy, but I believe this must be done anyway, just expecting a "miracle" HRAD invention to be sufficient without fixing the infrastructure and system design layers doesn't sound like a good strategy. By the way, such infrastructure and economy rewiring is the real "pivotal act".

If we imagined that the world had a "right" kind of infrastructure and social structure (really decentralised, trust-first), probably alignment would be much more of an "ordinary engineering" problem. With the current economic and infrastructural vulnerabilities mentioned, however, the alignment becomes a much higher-stakes problem, requiring more of "bulletproof" solutions "on the first try", I think.

Strict alignment. The damn thing will actually follow some set of instructions to the letter subject to its optimization constraints, hopefully you like the consequences of that. It is a potentially important crux if you disagree with the claim that for almost all specified instruction sets you won’t like the consequences, and there is no known good one yet, due to various alignment difficulties.

The name is uninformative and possibly misleading. If the set of instructions is in a natural or a formal language, you push the alignment difficulty into the semantics and semiotics, which are not "strict", and the alignment ends up not "strict" either.

In the planning-as-inference frame, I guess you probably mean something like an external evaluation (perhaps with some "good old-fashioned algorithm", a-la "type checker", rather than another AI, although it's really questionable whether such "type checker" could be built) of the inferred plans in their entirety. But again, even internal AI's representations are symbols, not "ground truth", so they are subject to the same difficulty of semantic and semiotic interpretation as natural language.

It is not clear to what extent strict alignment or strawberry alignment gives us affordance to reach good outcomes, how universal and deadly the various sources of lethality involved would be, or how difficult it would be to locate such affordances, especially on the first try.

"Strict" alignment is an engineering technique or approach, that shouldn't be judged in isolation but rather as a part of a cognitive architecture as a whole, as I explained in this comment.

"Strawberry" alignment is an external evaluation criterion or characteristic. However, I would go even further and say that "strawberry" is just a thought experiment which is not meant to be an actual eval that we will actually try, and the purpose of this thought experiment is just to show that the process of reasoning (and the result of reasoning, i.e., the plan), and the resulting behaviour are the actual objects of ethical evaluation and alignment rather than simply "goals". Goals become "good" or "bad" only in the context of larger plans and behaviour. This thought could be expressed in different ways, e.g., directly, as I just did (as well as in this comment). The latest OpenAI paper, "Let's Verify Step By Step" highlights this idea, too.

It is not clear that human-level or friendly alignment would do us much good for long either, given the nature and history of humans, and the competitive dynamics involved, and the various reasons to expect change. If AGIs are much smarter and more capable and efficient than us, is there reason to think this level of alignment might be sufficient for long?

"Human-level" is just more commonly called "value alignment" (or "alignment with human values" if you want). But I agree with the conclusion: "friendly" is an attempt at "moral fact alignment" ("humanity is valuable to preserve"), which is probably futile without considering and aligning on the underlying theory ethics, i.e., without the methodological and scientific alignment, as I described in a different comment. Value alignment, if taken literally, i.e., as attempting to impart AI with humans' heuristics about value, is also a species of "moral fact alignment", just somewhat more concrete than just "humanity is valuable to preserve" (although the latter is also one of the human values).

Have this alignment and the surrounding dynamics cause humans to choose to remain in control over time, or somehow be unable to choose differently.

This is self-contradictory: if the surrounding dynamics strongly preclude humans from "choosing otherwise", humans are no longer "in control". Also, under certain definitions of "choosing differently", humans may be precluded from moving into different biological and computational substrates, which in itself might be a cosmic tragedy because it may forever preclude humans from realising vast amounts of potential.

And Zvi points out these contradictions himself:

It is not clear to what extent robust alignment is a coherent concept especially in a competitive world or even how it interacts with maximization, as it contains many potential contradictions and requirements.

It is impossible in theory to have all these different kinds of alignment simultaneously. You cannot simultaneously (without any claim of completeness):

  1. Do what I say
  2. Also do what I mean
  3. Also do what I should have said and meant
  4. Also do what is best for me
  5. Also do what broader society or humanity says
  6. Also do what broader society or humanity means or should have said
  7. Also do what broader society or humanity should have said given their values
  8. Also do what is best for everyone
  9. Do some ideal friendly combination of all of it that a broadly good guy would do, in a way that is respectful of and preserves what is valuable on all levels
  10. Strictly follow some other set of rules that were set up long ago, no matter the cost

I think we should already forget about items 1 and 2 (as well as 10, but that goes without saying). In the context of communicating and aligning with superhumanly smart AI, it is really hubristic and stupid to think that although we will be much stupider than AI, it will be able to write a superhumanly good theory of ethics (such that humans didn't come up with in thousands of years) in a second, and design an AI (or change oneself) to follow this theory, humans' thoughts about value will still be somehow "covered with gold" and worth adhering to, for that AI.

Given this, I disagree that "we will need to: 1) Be able to get an AGI to do something a human selects at all, rather than something not selected. Be able to retain some form of control over what it does in the future, or set it on a chosen course. At all. ...". I'm not sure that this is possible (which implies that the orthogonality thesis holds to a sufficiently strong degree, which I doubt), but if this is possible, I don't think this would be desirable.

Items 3-9 raise an important and valid concern, that of the (infinity) multiplicity of (collective) identity and alignment subjectivity.

We must pick at most one of those, or another variation on them, or something else, as our primary target. A machine cannot serve two masters any more than a man can, and the ability to put the machine under arbitrary stress, and its additional capabilities and intelligence, makes this that much more clear.

I think this is a wrong reaction to the above concern. Humans can serve multiple masters (themselves, their family, their community, their nation/society, the whole of humanity, and Gaia), so why AIs couldn't?

In the phrase "It is impossible in theory to have all these different kinds of alignment simultaneously", it's unclear what you mean by "having alignment". If you meant something like formal, "total" alignment, then sure, it's impossible to have perfect alignment in the physical reality outside us (i.e., outside simple simulations and mathematical abstractions). But if we strive for continuously increasing alignment along various dimensions and various levels of intelligent entities (alignment subjects), it should definitely be possible in general (although some "bad" situations may call for a choice where the alignment between certain entities or along a certain dimension is worsened in favour of the alignment between some other types of entities or along other dimensions).

I think that to increase our chances to realise a good future, we must find a principled way of addressing the issue of the multiplicity of identity and alignment subjectivity, which is the essence of a scale-free theory of ethics.

the system has unintended (harmful or dangerous) goals or behaviors.

Note that judgements about the harmfulness and dangerousness of some goals or behaviours are themselves theory-laden. This is why Goal alignment without alignment on epistemology, ethics, and science is futile. From the perspective of any theory of cognition/intelligence that includes a generative model (which is not only Active Inference, but also LeCun's H-JEPA, LMCAs such as the "exemplary actor", and more theories of cognition and/or AI architectures) for performing planning-as-inference, I think a straightforward and useful ladder of aligned could be introduced: methodological, scientific, and fact alignment:

Thus, the crux of alignment is aligning the generative models of humans and AIs. Generative models could be "decomposed", vaguely (there is a lot of intersection between these categories), into

  • Methodology: the mechanics of the models themselves (i.e., epistemology, rationality, normative logic, ethical deliberation),
  • Science: mechanics, or "update rules/laws" of the world (such as the laws of physics or the heuristical learnings about society, economy, markets, psychology, etc.), and
  • Fact: the state of the world (facts, or inferences about the current state of the world: CO2 level in the atmosphere, the suicide rate in each country, distance from Earth to the Sun, etc.)

These, we can conceptualise, give rise to "methodological alignment", "scientific alignment", and "fact alignment" respectively. Evidently, methodological alignment is most important: it in principle allows for alignment on science, and methodology plus science helps to align on facts.

Under this framework, goals are a specific type of facts, laden by a specific theory of mind of another agent (natural or AI). A theory of mind here should be a specialised version of a general theory of cognition which itself, as noted above, includes a generative model and planning-as-inference, under which goals become future world states or some features of future world states, predicted/planned (prediction and planning is the same thing, under planning-as-inference) by the other mind (or oneself, if the agent reflects about its own goals).[1]

Thus, goal alignment is easy (practically, automatic) when two agents are aligned on methodology and science (albeit goal alignment even between methodologically and scientifically aligned agents usually still requires communication and coordination, unless we enter the territory of logical handshakes...), but is also futile when there is no common methodological and scientific ground. 

  1. ^

    Incidentally, this means that RL is not a very useful framework for discussing goals, because goals couldn't be conceptualised under RL easily, which causes a lot of trouble to people in the AI safety community who tend to think that there should be a single "right" theory or framework of cognition and intelligence. There should not: For alignment, we should simultaneously use multiple theories of cognition and value. And RL, although probably couldn't be deployed very usefully to discuss goal alignment specifically, could still be used to discussed some aspects of value alignment between the minds.