I would order these differently.
Within the first section (prompting/RLHF/Constitutional):
The core reasoning here is that human feedback directly selects for deception. Furthermore, deception induced by human feedback does not require strategic awareness - e.g. that thing with the hand which looks like it's grabbing a ball but isn't is a good example. So human-feedback-induced deception is more likely to occur, and to occur earlier in development, than deception from strategic awareness. Among the three options, "Constitutional" AI applies the most optimization pressure toward deceiving humans (IIUC), RLHF the next most, whereas prompting alone provides zero direct selection pressure for deception; it is by far the safest option of the three. (Worlds Where Iterative Design Fails talks more broadly about the views behind this.)
Next up, I'd put "Experiments with Potentially Catastrophic Systems to Understand Misalignment" as 4th-hardest world. If we can safely experiment with potentially-dangerous systems in e.g. a sandbox, and that actually works (i.e. the system doesn't notice when it's in testing and deceptively behave itself, or otherwise generalize in ways the testing doesn't reveal), then we don't really need oversight tools in the first place. Just test the thing and see if it misbehaves.
The oversight stuff would be the next three hardest worlds (5th-7th). As written I think they're correctly ordered, though I'd flag that "AI research assistance" as a standalone seems far safer than using AI for oversight. The last three seem correctly-ordered to me.
I'd also add that all of these seem very laser-focused on intentional deception as the failure mode, which is a reasonable choice for limiting scope, but sure does leave out an awful lot.
deception induced by human feedback does not require strategic awareness - e.g. that thing with the hand which looks like it's grabbing a ball but isn't is a good example. So human-feedback-induced deception is more likely to occur, and to occur earlier in development, than deception from strategic awareness
The phenomenon that a 'better' technique is actually worse than a 'worse' technique if both are insufficient is something I talk about in a later section of the post and I specifically mention RLHF. I think this holds true in general throughout the scale, e.g. Eliezer and Nate have said that even complex interpretability-based oversight with robustness testing and AI research assistance is also just incentivizing more and better deception, so this isn't unique to RLHF.
But I tend to agree with Richard's view in his discussion with you under that post that while if you condition on deception occurring by default RLHF is worse than just prompting (i.e. prompting is better in harder worlds), RLHF is better than just prompting in easy worlds. I also wouldn't call non-strategically aware pursuit of inaccurate proxies for what we want 'deception', because in this scenario the system isn't being intentionally deceptive.
In easy worlds, the proxies RLHF learns are good enough in practice and cases like the famous thing with the hand which looks like it's grabbing a ball but isn't just disappear if you're diligent enough with how you provide feedback. In that world, not using RLHF would get systems pursuing cruder and worse proxies for what we want that fail often (e.g. systems just overtly lie to you all the time, say and do random things etc.). I think that's more or less the situation we're in right now with current AIs!
If the proxies that RLHF ends up pursuing are in fact close enough, then RLHF works and will make systems behave more reliably and be harder to e.g. jailbreak or provoke into random antisocial behavior than with just prompting. I did flag in a footnote that the 'you get what you measure' problem that RLHF produces could also be very difficult to deal with for structural or institutional reasons.
Next up, I'd put "Experiments with Potentially Catastrophic Systems to Understand Misalignment" as 4th-hardest world. If we can safely experiment with potentially-dangerous systems in e.g. a sandbox, and that actually works (i.e. the system doesn't notice when it's in testing and deceptively behave itself, or otherwise generalize in ways the testing doesn't reveal), then we don't really need oversight tools in the first place.
I'm assuming you meant fourth-easiest here not fourth hardest. It's important to note that I'm not here talking about testing systems to see if they misbehave in a sandbox and then if they don't assuming you've solved the problem and deploying. Rather, I'm talking about doing science with models that exhibit misaligned power seeking, with the idea being that we learn general rules about e.g. how specific architectures generalize, why certain phenomena arise etc. that are theoretically sound and we expect to hold true even post deployment with much more powerful systems. Incidentally this seems quite similar to what the OpenAI superalignment team is apparently planning.
So it's basically, "can we build a science of alignment through a mix of experimentation and theory". So if e.g. we study in a lab setting a model that's been fooled into thinking it's been deployed, then commits a treacherous turn, enough times we can figure out the underlying cause of the behavior and maybe get new foundational insights? Maybe we can try to deliberately get AIs to exhibit misalignment and learn from that. It's hard to anticipate in advance what scientific discoveries will and won't tell you about systems, and I think we've already seen cases of experiment-driven theoretical insights, like simulator theory, that seem to offer new handles for solving alignment. How much quicker and how much more useful will these be if we get the chance to experiment on very powerful systems?
Hey thanks for writing this up! I thought you communicated the key details excellently - in particular these 3 camps of varying alignment difficulty worlds, and the variation within those camps. Also I think you included just enough caveats and extra details to give readers more to think about, but without washing out the key ideas of the post.
Just wanted to say thanks, this post makes a great reference for me to link to.
I think this post is really helpful and has clarified my thinking about the different levels of AI alignment difficulty. It seems like a unique post with no historical equivalent, making it a major contribution to the AI alignment literature.
As you point out in the introduction, many LessWrong posts provide detailed accounts of specific AI risk threat models or worldviews. However, since each post typically explores only one perspective, readers must piece together insights from different posts to understand the full spectrum of views.
The new alignment difficulty scale introduced in this post offers a novel framework for thinking about AI alignment difficulty. I believe it is an improvement compared to the traditional 'P(doom)' approach which requires individuals to spontaneously think of several different possibilities which is mentally taxing. Additionally, reducing one's perspective to a single number may oversimplify the issue and discourage nuanced thinking.
In contrast, the ten-level taxonomy provides concrete descriptions of ten scenarios to the reader, each describing alignment problems of varying difficulties. This comprehensive framework encourages readers to consider a variety of diverse scenarios and problems when thinking about the difficulty of the AI alignment problem. By assigning probabilities to each level, readers can construct a more comprehensive and thoughtful view of alignment difficulty. This framework therefore encourages deeper engagement with the problem.
The new taxonomy may also foster common understanding within the AI alignment community and serve as a valuable tool for facilitating high-level discussions and resolving disagreements. Additionally, it proposes hypotheses about the relative effectiveness of different AI alignment techniques which could be empirically tested in future experiments.
Thanks for writing this up. I really liked this framing when I first read about it but reading this post has helped me reflect more deeply on it.
I’d also like to know your thoughts on whether Chris Olah’s original framing, that anything which advances this ‘present margin of safety research’ is net positive, is the correct response to this uncertainty.
I wouldn't call it correct or incorrect only useful in some ways and not others. Whether it's net positive may rely on whether it is used by people in cases where it is appropriate/useful.
As an educational resource/communication tool, I think this framing is useful. It's often useful to collapse complex topics into few axes and construct idealised patterns, in this case a difficulty-distribution on which we place techniques by the kinds of scenarios where they provide marginal safety. This could be useful for helping people initially orient to existing ideas in the field or in governance or possibly when making funding decisions.
However, I feel like as a tool to reduce fundamental confusion about AI systems, it's not very useful. The issue is that many of the current ideas we have in AI alignment are based significantly on pre-formal conjecture that is not grounded in observations of real world systems (see the Alignment Problem from a Deep Learning Perspective). Before we observe more advanced future systems, we should be highly uncertain about existing ideas. Moreover, it seems like this scale attempts to describe reality via the set of solutions which produce some outcome in it? This seems like an abstraction that is unlikely to be useful.
In other words, I think it's possible that this framing leads to confusion between the map and the territory, where the map is making predictions about tools that are useful in territory which we have yet to observe.
To illustrate how such an axis may be unhelpful if you were trying to think more clearly, consider the equivalent for medicine. Diseases can be divided up into varying classes on difficulty to cure with corresponding research being useful for curing them. Cuts/Scrapes are self-mending whereas infections require corresponding antibiotics/antivirals, immune disorders and cancers are diverse and therefore span various levels of difficulties amongst their instantiations. It's not clear to me that biologists/doctors would find much use from conjecture on exactly how hard vs likely each disease is to occur, especially in worlds where you lack a fundamental understanding of the related phenomena. Possibly, a closer analogy would be trying to troubleshoot ways evolution can generate highly dangerous species like humans.
I think my attitude here leads into more takes about good and bad ways to discuss which research we should prioritise but I'm not sure how to convey those concisely. Hopefully this is useful.
You're right that I think this is more useful as an unscientific way for (probably less technical governance and strategy people) to orientate towards AI alignment than for actually carving up reality. I wrote the post with that audience and that framing in mind. By the same logic, your chart of how difficult various injuries and diseases are to fix would be very useful e.g. as a poster in a military triage tent even if it isn't useful for biologists or trained doctors.
However, while I didn't explore the idea much I do think that it is possible to cash this scale out as an actual variable related to system behavior, something along the lines of 'how adversarial are systems/how many extra bits of optimization over and above behavioral feedback are needed'. See here for further discussion on that. Evan Hubinger also talked in a bit more detail about what might be computationally different about ML models in low vs high adversarialness worlds here.
Behavioural Safety is Insufficient
Past this point, we assume following Ajeya Cotra that a strategically aware system which performs well enough to receive perfect human-provided external feedback has probably learned a deceptive human simulating model instead of the intended goal. The later techniques have the potential to address this failure mode. (It is possible that this system would still under-perform on sufficiently superhuman behavioral evaluations)
There are (IMO) plausible threat models in which alignment is very difficult but we don't need to encounter deceptive alignment. Consider the following scenario:
Our alignment techinques (whatever they are) scale pretty well, as far as we can measure, even up to well-beyond-human-level AGI. However, in the year (say) 2100, the tails come apart. It gradually becomes pretty clear that what we want out powerful AIs to do and what they actually do turns out not to generalize that well outside of the distribution on which we have been testing them so far. At this point, it is to late to roll them back, e.g. because the AIs have become uncorrigible and/or power-seeking. The scenario may also have more systemic character, with AI having already been so tightly integrated into the economy that there is no "undo button".
This doesn't assume either the sharp left turn or deceptive alignment, but I'd put it at least at level 8 in your taxonomy.
I'd put the scenario from Karl von Wendt's novel VIRTUA into this category.
I agree that this is a real possibility and in the table I did say at level 2,
Misspecified rewards / ‘outer misalignment’ / structural failures where systems don’t learn adversarial policies [2]but do learn to pursue overly crude and clearly underspecified versions of what we want, e.g. the production web or WFLL1.
From my perspective, it is entirely possible to have an alignment failure that works like this and occurs at difficulty level 2. This is still an 'easier' world than the higher levels because you can get killed in a much swifter and earlier way with far less warning in those worlds.
The reason I wouldn't put it at level 8 is because presumably the models are following a reasonable proxy for what we want if it generalizes well beyond human level, but this proxy is inadequate in some ways that become apparent later on. The level 8 says not that any misgeneralization occurs but that rapid, unpredictable misgeneralization occurs around the human level such that alignment techniques quickly break down.
In the scenario you describe, there'd be an opportunity to notice what's going on (after all you'd have superhuman AI that more or less does what it's told to help you predict future consequences of even more superhuman AI) and the failure occurs much later.
This helpful article by Holden Karnofsky also describes an increasing scale of alignment difficulty, although it's focused on a narrower range of the scale than mine (his scale covers 4-7) and is a bit more detailed about the underlying causes of the misalignment. Here's how my scale relates to his:
The "playing the training game" threat model, where systems behave deceptively only to optimize in-episode reward, corresponds to an alignment difficulty level of 4 or higher. This is because scalable oversight without interpretability tools (level 4) should be sufficient to detect and address this failure mode. The AI may pretend to be helpful during training episodes, but oversight exposing it to new situations will reveal its tendency toward deception.
(Situationally aware) Deception by default corresponds to a difficulty level of 6. If misaligned AIs form complex inner goals and engage in long-term deception, then scalable oversight alone will not catch intentionally deceptive systems that can maintain consistent deceitful behavior. Only interpretability tools used as part of the oversight process (level 6) give us the ability to look inside the system and identify deceptive thought patterns and tendencies.
Finally, the gradient hacking threat model, where AIs actively manipulate their training to prevent alignment, represents an alignment difficulty of 7 or higher. Even interpretability-based oversight can be defeated by sufficiently sophisticated systems that alter their internals to dodge detection.
Regarding level 10 'impossible', here is a summary of arguments.
| 10 | Impossible | Alignment of a superintelligent system is impossible in principle. | Alignment is theoretically impossible, incoherent or similar. |
See Interpreting the Learning of Deceit for a practical proposal that should work up to and including level 7 on this scale (under certain assumptions that it discusses).
Nitpick (probably just me overthinking/stating the obvious) on levels 8-9 (I'm "on" level 8): I'd assume the point of this alignment research pre-SLT is specifically to create techniques that aren't broken by the SLT "obsoleting previous alignment techniques." I also think alignment techniques of the required soundness, would also happen to work on less-intelligence systems.
This is plausibly true for some solutions this research could produce like e.g. some new method of soft optimization, but might not be in all cases.
For levels 4-6 especially the pTAI that's capable of e.g. automating alignment research or substantially reducing the risks of unaligned TAI might lack some of the expected 'general intelligence' of AIs post SLT and be too unintelligent for techniques that rely on it having complete strategic awareness, self-reflection, a consistent decision theory, the ability to self improve or other post SLT characteristics.
One (unrealistic) example, if we have a technique for fully loading the human CEV into a superintelligence ready to go that works for levels 8 or 9, that may well not help at all with improving scalable oversight of non-superintelligent pTAI which is incapable of representing the full human value function.
Positively transformative AI systems could reduce the overall risk from AI by: preventing the construction of a more dangerous AI; changing something about how global governance works; instituting surveillance or oversight mechanisms widely; rapidly and safely performing alignment research or other kinds of technical research; greatly improving cyberdefense; persuasively exposing misaligned behaviour in other AIs and demonstrating alignment solutions, and through many other actions that incrementally reduce risk.
One common way of imagining this process is that an aligned AI could perform a ‘pivotal act’ that solves AI existential safety in one swift stroke. However, it is important to consider this much wider range of ways in which one or several transformative AI systems could reduce the total risk from unaligned transformative AI.
Is it important to consider the wide range of ways in which a chimp could beat Garry Kasparov in a single chess match, or the wide range of ways in which your father [or, for that matter, von Neumann] could beat the house after going to Vegas?
Sorry if I sound arrogant, but this is a serious question. Sometimes differences in perspective can be large enough to warrant asking such silly-sounding questions.
I am unclear where you think the problem for a superintelligence [which is smart enough to complete some technologically superhuman pivotal act] is non-vanishingly-likely to come in, from a bunch of strictly less smart beings which existed previously, and which the smarter ASI can fully observe and outmaneuver.
If you don't think the intelligence difference is likely to be big enough that "the smarter ASI can fully observe and outmaneuver" the previously-extant, otherwise-impeding thinkers, then I understand where our difference of opinion lies, and would be happy to make my extended factual case that that's not true.
The point is that in this scenario you have aligned AGI or ASI on your side. On the assumption that the other side has/is a superintelligence and you are not, then yes this is likely a silly question, but I talk about 'TAI systems reducing the total risk from unaligned TAI'. So this is the chimp with access to a chess computer playing Gary Kasparov at chess.
And to be clear, on any slower takeoff scenario where there's intermediate steps from AGI to ASI, the analogy might not be quite that. In the scenario where there I'm talking about multiple actions, I'm usually talking about a moderate or slow takeoff where there are intermediately powerful AIs and not just a jump to ASI.
Yes, for me, though I'd give this low probability mostly due to the chimp breaking the computer, and most of the problems for the chimp fundamentally come down to their body structure being very unoptimized for tools, rather than them being less intelligent absolutely, and humans have much better abilities to use tools than chimps.
I'm inclined to conclude from this that you model the gulf between chimp society and human society as in general having mostly anatomical rather than cognitive causes. Is that correct?
I'd say that cognitive causes like coordination/pure smartness do matter, but yes I'm non-trivially stating that a big cause of the divergence between chimps and humans is because of their differing anatomies, combined with an overhang from evolution where evolution spent it's compute shockingly inefficiently compared to us, though we haven't reduced the overhang to 0, and some exploitable overhangs from evolution still remain.
I'm modeling this as multiple moderate influences, plus a very large influenced added up to a ridiculous divergence.
AFAIK, there's around as much evidence for non-human ape capability in the mirror test as there is for certain cetaceans and magpies, and the evidence on apes being somewhat capable of syntactical sign language is mixed to dubious. It's true the internal politics of chimp societies are complex and chimps have been known to manipulate humans very complexly for rewards, but on both counts the same could be said of many other species [corvids, cetaceans, maybe elephants] none of which I would put on approximate par with humans intelligence-wise, in the sense of crediting them with plausibly being viable halves of grandmaster-level chess centaurs.
I'm curious if you have an alternate perspective wrt any of these factors, or if your divergence from me here comes from looking for intelligence in a different place.
Also, I'm not sure what you mean by "overhang [from] evolution spen[ding] its compute shockingly inefficiently" in this context.
My alternate perspective here is that while IQ/intelligence actually matters, I don't think that the difference is so large as to explain why chimp society was completely outclassed, and I usually model mammals as having 2 OOMs worse to a few times better intelligence than us, though usually towards the worse end of that range, so other factors matter.
So the difference is I'm less extreme than this:
none of which I would put on approximate par with humans intelligence-wise.
On the overhang from evolution spending it's compute shockingly inefficiently, I was referring to this post, where evolution was way weaker than in-lifetime updating, for the purposes of optimization:
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?