I'm so tired of people needing to explain this. An important question for me: "Why didn't people just read Yudkowsky and Bostrom and understand the threat model?" It seems like many people did, but many people don't seem to get it.
I like the "aligning product VS aligning superintelligence" phrase.
a model builds the next rung on the capability ladder
I wouldn't expect "a model" to be the object to track as generalized capabilities compound towards superintelligence. The generalized objects I think is correct to track are "outcome influencing systems" (OISs) most probably OISs hosted on the sociotechnical substrate. Probably something like AI companies and/or coordinated clusters of personality self replicators (PSRs) and whatever kind of OISs they develop into which I expect it will no longer feel right to call PSRs anymore.
But otherwise I agree. There are many OISs in the environment with compounding capabilities and we basically don't understand their preferences or development paths.
irreversible guardrail decay
This is a nice phrase. I would like if we had more focus on what the guardrails even are and how to build sensible guardrails... and reverse the decay of guardrails which have decayed reversibly. Probably useful to have a map of what kinds of guardrail decay are truly irreversible under what scenarios.
If we got a global plague that crippled global trade sufficiently that we couldn't maintain data centers anymore, that would probably rebuild many guardrails we thought were lost forever. Not that I want that. I want us to avoid dystopia. Avoiding dystopia with lesser dystopia isn't really what I'm hoping for.
I tend to treat the core as that "superintelligence alignment" has to work in domains where humans aren't good supervisors. Being able to assume good human supervision allows you to do a lot more engineering right now.
I disagree. I think you've set up a strawman for an alignment target which is unreasonable and which a generally intelligent model would never be able to satisfy.
> "If you can use your intent aligned system to write code which jailbreaks other LLMs and enables them to do dangerous ML research, it is also not Aligned in the original sense."
This seems incorrect. Actions are not good or bad in isolation. The same action can be good when analyzed from one perspective and bad when analyzed from another.
Suppose I have an aligned, generally intelligent model. It wants to do the Right Thing at all times, cares deeply about what I want it to care deeply about, etc. Now suppose someone puts this model into a box where it has no access to the outside world and they tell it they are trying to do AI safety research. They are trying to understand how to defend against a specific jailbreak. In order to do this, they need to produce the jailbreak so that they can study it. What is the model supposed to do? In your case, because it's aligned, it should at all times completely refuse no matter what?
More broadly, I am struggling to see what evidence you have for why current alignment frameworks (among other things) would fail to transfer to more capable models. Suppose we have a "product" AI model which is aligned to a constitution and that this product AI model is the start of RSI. Is it clear that the later models don't abide by the constitution? What if each successive model holds those convictions deeper and deeper?
It may seem unreasonable within the current paradigm, but I think it's necessary to reach if we get strong superintelligence. To have a system that you can't make destroy the entire system is needed if you want the whole system to remain undestroyed indefinitely.
Your right that I didn't explain why each framework fails to plausibly scale to very strong models, maybe that's also worth it's own post, because there are a lot and each have limits that you need to go a bit into the weeds to see.
I am struggling to see what evidence you have for why current alignment frameworks (among other things) would succeed in transferring to superintelligence.
I think there's probably a reason that merely claiming this isn't sticking, you haven't specified a mechanism that can stick in people's heads. "Product alignment != superintelligence alignment" fits in five words (kinda), but doesn't give a reason in five words. I'd rather say "Local alignment != asymptotic alignment".
local alignment: your alignment bounds are tight enough that your alignment generalizes within a known regime. asymptotic alignment: you have some form of confidence that your alignment uncertainty goes down as the model does more work. I claim you can have asymptotic alignment without having a formally certified proof of asymptotic alignment, but that it would be surprising to be able to have empirical asymptotic alignment without the model confidently telling you that it expects that someday, it or a successor will be able to give a formal proof of alignment. And to have strong empirical asymptotic alignment you'd need to have solved basically all the ongoing empirical alignment challenges.
I'm apparently quite bad at getting posts out the door, and so it's reference-class unlikely I'll get this one out the door, but I have a post cooking that would give an overview of the difference. I have an undercooked post I could hit publish on which is just me prompting claude to explain the difference; added you to review that.
tl;dr: progress on making Claude friendly[1] is not the same as progress on making it safe to build godlike superintelligence. solving the former does not imply we get a good future.[2] please track the difference.
The term 'Alignment' was coined[3] to point to the technical problem of understanding how to build minds such that if they were to become strongly and generally superhuman, things would go well.
It has been increasingly adopted by frontier AI labs and much of the rest of the AI safety community to mean a much easier challenge, something like "having AIs that are empirically doing approximately what you ask them to do".[4]
If it's possible to use an intent-aligned product to build a research system which discovers a new paradigm and breaks your guardrails, then it is not Aligned in the original sense.
If you can use your intent aligned system to write code which jailbreaks other LLMs and enables them to do dangerous ML research, it is also not Aligned in the original sense.
Conflating progress on product alignment with progress on superintelligence alignment seems to be lulling much of the AI safety community into a false sense of security.
Why is Superintelligence Alignment less prominent?
Because product alignment is:
This is inconvenient!
It would be awesome if we could ride easy-to-evaluate profitable empirical feedback loops all the way to a great future. But this seems far from certain.[6]
Why do we need Superintelligence Alignment to survive?
Reality is allowed to be inconvenient. There's strong reason to expect that superhuman situationally aware agents inside your experiment breaks some of the foundations the scientific process relies upon, such as:
In short: Your experimental subject is not a neutral substrate, but a strategic actor more capable than you.
If we don't have guarantees of maintaining safety properties each time a model builds the next rung on the capability ladder, we're rolling a dice for irreversible guardrail decay.[7] And we're going to be very rapidly rolling huge numbers of those dice and the feedback loop spins up.
As we're headed up the exponential, we're going to need techniques which generalize to strongly superhuman agents – ones which correctly believe they could defeat all of humanity. Product-aligned AIs might help with that work, but the type of research they would need to automate needs to look more like technical philosophy and reliably avoiding slop, not just avoiding scheming and passing product-alignment benchmarks.[8]
Only a tiny fraction of the field of AI safety is focused on these big picture bottlenecks,[9] due to a mix of funding incentives and it being more rewarding for most people to do empirical science.[10]
When you see people enthusiastically talking about how much progress we have on 'Alignment', please track (and ask!) whether they're talking about aligning products or aligning superintelligence.
If you're friends with Claude, please read and consider this post first: Protecting humanity and Claude from rationalization and unaligned AI
This is not to say product alignment can't help or there is no path to victory which goes through product alignment, just that you need to solve a different problem (superintelligence alignment) at some stage of your plan.
I think by Stuart Russell in ~2014.
Sometimes with self-awareness of this history, like Paul's Intent Alignment, but that's increasingly rare.
Getting Product-Aligned AI is a convergent subgoal of many possible goals, and ultimate ends may be easily hidable behind convergent subgoals
And even if possible in theory, practice by the current players under race conditions looks far from the level of competence needed to actually pull it off.
Capabilities generalize in a way alignment doesn't because reality gives you feedback directly on your capability (you can or can't do a task), whereas there needs to be a specific system gives feedback on alignment and if that's a proxy for what you want you get eaten at higher power levels.
If this doesn't ring true to you, please click through to the linked posts.
And even for those people focusing on theory, there's a lot more focus on basic science of ML than trying to backchain the conceptual engineering needed to survive superintelligence. I'd estimate somewhere in the mid tens of people globally are focusing on what looks like the main cruxes.
Response to Jan Leike, evhub, Boaz, etc. Thanks for feedback and copyediting to @Luc Brinkman, @Mateusz Bagiński, @Claude+