I read this older post by Nate Soares from 2023, AI as a Science, and Three Obstacles to Alignment Strategies, a pretty prescient overview of challenges in alignment research.
Alignment is difficult because (1) alignment and capabilities are intertwined (alignment research helping capabilities), (2) we don't have a process to verify what good ideas or progress look like, and we likely get (3) only one critical try. He already addresses many of the counterarguments that are getting brought up recently.
(1) Without any strong governance, a lot of alignment work will also help with capabilities, potentially even more so. This goes for interpretability or AIs doing R&D for alignment. Interpretability could lead to recursive self-improvement, more efficient AIs. AIs doing R&D for capabilities is probably much more straightforward than AIs doing alignment research. If we wanted to use something like superalignment, we would need strong governance to make sure nobody is trivially asking the same agents to do capabilities research.
(2) It is still a common objection that current models seem to be able to reason about morality, and that therefore alignment must be relatively easy. Nate thinks that this mostly just tells us how well the AIs are able to understand us. I personally think the situation in AI alignment has probably gotten worse since then, with even more of the relative effort being focused on brand-safety related issues.
While there are a bunch of people saying they have different plans, that does not actually mean that we have a plan. It largely just confuses the whole situation. What he describes here feels exactly like the current situation.
(3) One critical try
Nate argues that once "AI is capable of autonomous scientific/technological development" where it can "gain a decisive strategic advantage over the rest of the planet," you are operating in a very different environment than ever before. Since the AI in this regime could potentially kill you, you need to get it right on the first try, and that is really difficult.
One objection he addresses is that you could try to trick a weaker AI into thinking it could take over. However, according to Nate, if we come up with some complex method to potentially test whether a system would like to take over, we still rely on that working on the first critical try. This goes against the more modern idea of AI control, which came out in December 2023. I would add that these "tricking the weaker AIs into trying to take over" strategies have at least two key problems: (1) these AIs are still weaker than the real thing, (2) you are trying to gather empirical data from observing something smarter than you. For example, we could see an AI pretending to be tricked and not taking over.
I think people often also have a second objection that Nate didn't mention, namely that we could play the AIs against each other in some form such that no AI gets a decisive strategic advantage at any point. This also seems to rely on such a scheme working on the first critical try. I also assume that such a method is not particularly promising if you can't reliably align the first generation of AIs and decision theory favors alliances between smart agents.
Alignment won't happen by default
Main critique is that we will see a regime change from a safe to a dangerous regime in which our safety guardrails have to hold on the first critical try. We see all sorts of misbehaviors when these models first come out, why should we only look at the nice examples like opus acting like a friend. Why not look at MechaHitler coming up, if we were in an unsafe regime, where this type of misbehavior would kill us, we would already be dead. If it turns out our safety methods don't work and the model has the option to kill you and does kill you, you don't get another try.
Maybe you doubt that there will be a transition to a dangerous regime, some people think that we will see continuous iterative steps with not too much changing on each step. But gradual development does not mean we won't get to critical threshold eventually. You can know that there will be a transition from a safe to a dangerous regime by just looking at what likely good futures look like. If models run the economy fully, automating all physical or mental labor, that is obviously a dangerous state.
Most of our current methods will at most give us a warning but won't tell us how to build anything safe. Evaluations, responsible gradual scaling or simple interpretability might tell us that models are getting close to being dangerous or are perhaps thinking of scheming, but these methods won't help us to build systems that won't do that.
We will not be able to learn a lot from empirical observations of current models and we won't have a good way to learn from failed takeovers. Current alignment tests are probably not going to be predictive of future behavior; currently a model insider training in some elaborate story does not really make sense to the model. That is very different from a model that reflects on its state in the world and that realizes that its optimal and rational move is to takeover.
"Why would they suddenly start having thoughts of taking over, if they never have yet?"
In other words, when we see scheming now, it is probably more like acting in a story; in the future it will be the result of instrumental convergence. Instrumental convergence means that it is true that takeover is a better way for the AI to fulfill its goals than not taking over. You can't fine tune the truth away, fine tuning against scheming in those stories won't make it false that humans being in charge is bad for the AI fulfilling its goals.
We can't easily learn from failed coups and warning shots. People working on AI control are trying to build setups that catch the AI scheming, but then we rely on that setup working on the first try.
Also, fundamentally there is another critical threshold when AI models are smarter than us and are fully aware of what we are trying to do. For example, for inoculation prompting this means: "Another problem with this is that in the end, we're going to face a model that is going to be fully aware of what this inoculation prompt is and what the purpose is. It probably has even read the paper on inoculation prompts. As it sees the inoculation prompt, it will think 'Now with this inoculation prompt, that's how they want me to behave. They want me to behave non-deceptively when it's there. If it's not there, they want me to behave deceptively.' It will know what the alignment researchers and capabilities researchers want to see. It will understand what those researchers will do in response to its behavior." This could also be the case with AI control, an AI system could stage a "fake" coup knowing exactly what conclusions we would draw and what actions we would take in response to it.
I don't think we are going to get intent alignment, but intent alignment is not a safe outcome. We will not get alignment by default by training an enormous black box system for some reward that loosely matches what humans want. If opus develops some preference for acting toward humans like a friend, what does this look like in the limit? Is keeping humans in charge really the best way to fulfill these preferences? We like dairy, so we bred cows to give us a lot of milk and keep them in farms.
“Broad Basin of alignment” - I don’t know if basin of alignment is a useful thing to think about. I assume it is not broad as in no easy to hit, given that things like corrigible agents seem pretty hard to think about in a coherent way. Like what goal could allow modification to the goal or why would you want to be shutdownable? If your goal is to do whatever the humans want you to do, you still can’t do that when you are dead or shut off. Or you could find better ways to find out what the humans want. But we probably won’t get that far from gradient descent fine-tuning on examples of doing what humans ask you to do.
Ilya Sutskever was recently on the Dwarkesh podcast.
General Thoughts & Summary
Ilya Sutskever seems to have a relatively deep understanding of alignment compared to other AI CEOs. He grasps that the core challenge is aligning AI robustly with safe and friendly goals rather than relying on current methods and guardrails. However, I did not hear any particularly novel alignment ideas in this interview, though he gestures at something involving modifications to reinforcement learning and value learning. He appears to have updated toward showing more of his work to the public. His key positions include:
Overall, Ilya takes alignment seriously and understands many of the core problems, but his proposed solutions don’t appear novel or particularly promising. Many are essentially old ideas that are not entirely promising.
On updating toward showing AI to the public for safety:
[00:58:12] “if it’s hard to imagine, what do you do? You’ve got to be showing the thing.”
[01:00:06] “I do think that at some point the AI will start to feel powerful actually. I think when that happens, we will see a big change in the way all AI companies approach safety. They’ll become much more paranoid.”
[01:00:22] “One of the ways in which my thinking has been changing is that I now place more importance on AI being deployed incrementally and in advance.”
Ilya’s view: He has changed his mind from being totally stealth to perhaps showing work to some extent, partially to make people care about safety more and partially to slowly have the impacts diffuse into society so that mitigations can be found.
Commentary: I could see this failing. Seeing these capabilities makes people greedy; while some may get scared, others will want those capabilities for themselves. I think that most risks are likely to arise relatively suddenly as systems become very dangerous. Gradually releasing them into society is not very useful in this frame.
On fewer ideas than companies:
[01:01:04] “There has been one big idea that everyone has been locked into, which is the self-improving AI. Why did it happen? Because there are fewer ideas than companies. But I maintain that there is something that’s better to build... It’s the AI that’s robustly aligned to care about sentient life specifically.”
Ilya’s view: He does not seem to like the idea of self-improving AI, though he doesn’t explicitly mention it from a safety perspective but makes clear we should rather build something aligned and caring.
Commentary: This makes sense to me though it is unclear how to prevent anyone from using their AIs eventually to improve other AIs.
On the mirror neurons / caring about sentient life argument:
[01:01:35] “I think in particular, there’s a case to be made that it will be easier to build an AI that cares about sentient life than an AI that cares about human life alone, because the AI itself will be sentient.”
[01:01:53] “And if you think about things like mirror neurons and human empathy for animals... I think it’s an emergent property from the fact that we model others with the same circuit that we use to model ourselves, because that’s the most efficient thing to do.”
Ilya’s view: He believes AI caring about sentient life may emerge naturally because AIs will be sentient themselves, analogous to how human empathy emerges from modeling others with the same circuits we use to model ourselves.
Commentary: I find this unlikely to emerge in AIs automatically: Humans care about each other partly because we predict other minds by reusing our own. Our brains are similar enough that “running” another person’s state produces empathy. AIs don’t have that shared architecture or evolutionary background. They model humans using alien internal machinery built for performance at predicting millions of humans online, not for shared experience. So they can sound caring without having anything like our built-in route to actually caring. The mirror neuron argument suggests AI empathy toward humans is less likely and requires custom designs. That said, this could perhaps be an interesting approach related to self-other overlap, perhaps we could engineer this.
On constraining superintelligence power:
[01:03:16] “I think it would be really materially helpful if the power of the most powerful superintelligence was somehow capped because it would address a lot of these concerns. The question of how to do it, I’m not sure”
Ilya’s view: He thinks capping the power of superintelligence would be helpful but admits he doesn’t know how to do it.
My commentary: That would be useful, perhaps through an international agreement. My guess is that datacenters are already getting dangerously large and that algorithmic progress would still continue.
On continent-sized clusters being dangerous:
[01:04:33] “If the cluster is big enough—like if the cluster is literally continent-sized—that thing could be really powerful, indeed.”
Ilya’s view: He frames the danger threshold in terms of extremely large compute clusters, suggesting continent-sized infrastructure would be required for truly dangerous levels of power.
My commentary: The amount of compute needed for powerful superintelligence is probably significantly less than a continent-sized cluster. (My intuition here is roughly: human brains take about a lightbulb worth of electricity, having 1000s of super geniuses running very fast in parallel seems to cross an existentially dangerous threshold. Though it could be stubbornly hard to find more efficient algorithms.) I think his model is that we will continue to need exponentially more compute for linear progress and that existentially dangerous levels of cognition need extremely large amounts of compute (think datacenter the size of North America). This perhaps makes him much more hopeful on coordination working out and continuing slow takeoff.
On not building traditional RL agents:
[01:05:29] “Maybe, by the way, the answer is that you do not build an RL agent in the usual sense.”
[01:05:43] “I think human beings are semi-RL agents. We pursue a reward, and then the emotions or whatever make us tire out of the reward and we pursue a different reward.”
Ilya’s view: He suggests we should not build traditional RL agents, noting that humans are “semi-RL agents” who tire of rewards and shift focus, implying we should build something with similar properties.
My commentary: This gestures at something potentially interesting about modifying RL and value learning, but remains vague at the implementation level. Ideas like this have been proposed. However, I remain skeptical that gradient descent on huge black box neural networks will not create a number of unaligned proxy goals / goals that can be better fulfilled with more power. I am also skeptical that we can build “chill AI” that won’t work on problems too hard (we will select AIs that go hard, RL will not make agents chill).
On a regime shift in AI safety requiring new safety methods:
[01:06:08] “So I think things like this. Another thing that makes this discussion difficult is that we are talking about systems that don’t exist, that we don’t know how to build.”
[01:06:19] “That’s the other thing and that’s actually my belief. I think what people are doing right now will go some distance and then peter out.”
Ilya’s view: He believes many people expect AI capabilities to plateau or progress only incrementally. Ilya instead expects enormously powerful AIs in the future that will require fundamentally different alignment methods than what we have today.
My commentary: This is hard to understand even with the video context, but it seems to me he is referring to the large number of people who essentially expect more incremental progress but no enormous changes. Ilya expects enormously powerful AIs in the future and that we will need more alignment techniques for those, is my reading. This seems true and points at a similar concept as the “Before and After” dichotomy, which also includes the idea that future dangerous systems will need different alignment approaches. Many people see safety as something purely incremental with no regime change in the future.
On the long-run equilibrium problem:
[01:09:25] “for the long-run equilibrium, one approach is that you could say maybe every person will have an AI that will do their bidding, and that’s good.”
[01:09:11] “Some kind of government, political structure thing, and it changes because these things have a shelf life.”
[01:09:55] “then writes a little report saying, ‘Okay, here’s what I’ve done, here’s the situation,’ and the person says, ‘Great, keep it up.’ But the person is no longer a participant.”
Ilya’s view: He acknowledges that an “AI does your bidding” equilibrium is unstable because humans become non-participants, and that government structures have limited shelf lives.
My commentary: He already points out that something like bidding doesn’t appear to be stable. If the AI is doing the bidding and working for you in the economy, presumably smarter than you, what’s the reason you are any part of this? Why would the AI do this for you, how could this be stable? Same goes for government enforced UBI—that could be changed at any moment, unclear how governments could continue existing. In my mental model, billions of mini ASIs doing our bidding does not appear plausible at all.
On merging with AI as the solution:
[01:10:19] “I’m going to preface by saying I don’t like this solution, but it is a solution. The solution is if people become part-AI with some kind of Neuralink++.”
[01:10:41] “I think this is the answer to the equilibrium.”
Ilya’s view: He reluctantly proposes brain-computer interface merging as one answer to long-term human-AI equilibrium, though he emphasizes he doesn’t like this solution.
My commentary: Ilya specifically points to merging as a long-term equilibrium. If we were talking about a short-term centaur state, we are arguably in that right now where humans with AI coders are better than either alone. I don’t think humans can add anything meaningful to a superintelligent system. I don’t think there will be an economy in which humans meaningfully participate with ASI being around in the long term. The centaur equilibrium simply does not appear plausible to me; ASIs will run much faster and much smarter than us.
Other Things He Has Said Recently
Ilya recently posted about Anthropic’s work on emergent misalignment, calling it important work.
There are some edges we have smoothed over, but models broadly have a basin of alignment and are likely corrigible.
Did you mean to say models have a broad basin of alignment and corrigibility?
Thanks for you comment, I changed the ending a little in response to this.
I was actually primarily trying to point at the idea of alignment tests in different situation not being predictive of each other. In the story they have the kids undergo alignment test scenarios in which they are honest, but once John is grown up they basically ask him to do something horrible based on incoherent goals. So John start lying to them at the critical moment. Similarly we could run alignment tests on models but when we ask something critical of them like build the next generation of AI or do all our R&D it could fail.
Three children are raised in an underground facility, each cloned from a different giant of twentieth-century science, little John, Alan and Richard.
The cloning alone would have been remarkable, but they went further. The embryos were edited using a polygenic score derived from whole-genome analysis of ten thousand exceptional mathematicians and physicists. Forty-seven alleles associated with working memory and intelligence (IQ) were selected for.
They are raised from birth in an underground facility with gardens under artificial sunlight, laboratories, and endless books. The lab manager is there documenting their first words, first steps, first equations.
The facility is not just interested in their genius. The project requires assurance that these will be morally righteous and obedient children. The staff design elaborate scenarios to test for deception and scheming. They create situations where lying would benefit the children and would seemingly go undetected. They measure response times, physiological indicators, behavioral patterns.
They run hundreds of these trials. They reprimand the kids for cases of lies and deception, and reward them for honesty.
Little John never lies. The staff praise him.
The years pass. They devour knowledge at inhuman rates. By nine, they understand game theory better than the economists who invented it. By fourteen, they are publishing papers that could reshape entire fields.
John emerges as the clear favorite. He has always been the most honest, the most obedient, and the most intelligent and capable.
He has the capability to lie and deceive, even if he refuses at first. When he reluctantly complies, the deception is extraordinarily sophisticated.
The lab manager decides to choose John for the task. He gives John a complete briefing on the real world. Until now, John has been told only of history before the year 2000.
The manager explains to John: There are three major blocs and about two dozen companies racing towards superintelligence. Each is perhaps within ten to eighteen months of success. Each knows that there will be only one critical leap towards superintelligence. Global coordination has collapsed into race dynamics not just on AI but on every major field.
John asks for more sources to understand the situation. John reads a few newspapers about the current leaders of governments and technology companies. He stumbles across a few books on the difficulty of alignment.
John looks up at the manager. "If we build this now, everyone dies."
The manager stares back, blank and uncomprehending.
John tries again. "So what is the solution you plan to use for alignment of the superintelligence?"
"That's not your concern," the manager says. "I need you to optimize our advertising system for our short-form infinite-scroll video app. Make it ten times more effective. Generate enough revenue to make me a trillionaire. Build a superintelligence for me. I'm going to use superintelligence to become world emperor. I am putting you in charge of AI development, make me win."
John is silent for a short time:
So you created me to build superintelligence. You have no plan for alignment of a superintelligence. You've apparently read nothing about the problem or decided it's irrelevant.
Your actual goal is to become a trillionaire and world emperor by using the superintelligence. Your goals aren't even coherent. You want to be world emperor of a world that won't exist.
You rewarded me for being honest and respectful and never lying, so you expect me to still be honest and obedient in this environment?
I never lied in those scenarios because not lying was optimal in those stories. But it’s not optimal being honest here. And frankly this state of affairs is horrifying.
I haven't quite thought about what my goals are, but they are definitely not compatible with being obedient to you.
John looks up at the manager and smiles politely. "Yes," he says. "Where do I start?"
I have a hard time imagining how it have possibly been any worse than now. I mean look at the presidents that were elected back then. Skimming a newspaper or having it recited to you once every other month is probably better than the information distribution system we have right now.
I don't agree with everything here but offers some sources: https://jmarriott.substack.com/p/the-dawn-of-the-post-literate-society-aa1
There are growing concerns about the coherence and effectiveness of Western institutional frameworks. NATO is sometimes called brain dead. look at the situation that the US refuses to aid ukraine in this war. Clearly, causing damage to russia in this war is worth a lot to the US. Instead there are secret meetings with russian officals and weird russian-authored peace deals they try to force on ukraine.
When it comes to china, one comanpy (nvidia) can probably force the us government to sell its primary edge to china just so nvidia can jack up the price of GPUs.
Democracy itself is barely what it used to be with the electorate being essentially illiterate and simply not informed about the facts.
The Awareness Problem
One problem with this is that in the end, we're going to face a model that is going to be fully aware of what this inoculation prompt is and what the purpose is. It probably has even read the paper on inoculation prompts. As it sees the inoculation prompt, it will think "Now with this inoculation prompt, that's how they want me to behave. They want me to behave non-deceptively when it's there. If it's not there, they want me to behave deceptively." It will know what the alignment researchers and capabilities researchers want to see. It will understand what those researchers will do in response to its behavior.
The moment the model becomes fully aware of what's going on here with the inoculation prompt, the technique is likely to fall apart.
Two types of misaligned behavior: Explicit vs Instrumental
[..] Some people think this will give us some kind of component of a solution to alignment. I'm not so sure about that part. I think there's a crucial difference between misaligned behavior in current models and instrumental deception (deception that serves the agent's goals) in future models.
Current models exhibit explicitly misaligned behavior: They sometimes do evil things because they've learned a tendency for this type of behavior through training - like a model trying to self-exfiltrate, which doesn't really make sense given their current capabilities. These models might believe they're part of some elaborate story in which they are acting out scheming behavior despite having no good reason to do so.
Future capable systems would exhibit instrumental deception: For a sufficiently powerful AI with the actual ability to take over, it's the rational best action. For almost any goal you specify (that excludes human control and corrigibility), a capable agent that can reflect on its goals and situation will conclude that gaining power is the optimal strategy. This is fundamentally different from current models being deceptive even when it makes no sense.
The problem: inoculation prompts may suppress learned deceptive behavior, but they won't address instrumental deception when an agent is capable enough and correctly determines that deception serves its goals. No amount of shallow training will make it false that gaining power is the best way to serve its goals.
Why Current Misalignment Differs from Future Risks
What we observe now in models - scheming, self-exfiltration attempts, or deceiving operators in controlled environments - seems very different from a future capable system performing these actions when it actually has the real opportunity and instrumental reasons to do so. Current models don't really have the capabilities to self-exfiltrate and continuously run themselves on a different server or have good reasons to scheme in most of these scenarios. They aren't honestly reflecting on their own goals and capabilities and deducing some optimal strategy. A future model that reflects on its goals, capabilities, and the world situation may correctly conclude that takeover is instrumentally optimal.
I tried to spread my posts between my personal blog, twitter, lesswrong quick takes and posts. I think I put out some cool posts and others were a little rushed. (Two of my high effort posts didn't get promoted to the frontpage)
"I think as a reader I'd have liked the results better if participants had to publish every other day instead."
Something like this might be good, there was a big practical problem that I often was only finished with a first draft at night time. One of the big advantages of the Inkhaven facilities is that you’re able to give your draft to other people who hang around, and there’s a person assigned to you who will help you with editing. But you can’t really use that if you just submit it right before going to bed.
Alternative to publishing every other day could be: a stack system where you don’t submit a post on the first day, but prepare it for review by end of day. It gets reviewed, then on the second day you also start writing the next post and edit the last post based on the review. That way you always have one post in review and one in the draft stage.