This is a super short, sloppy version of my draft "cruxes of disagreement on alignment difficulty" mixed with some commentary on MIRI 2024 Communications Strategy and their communication strategy with the alignment community.
I have found MIRI's strategy baffling in the past. I think I'm understanding it better after spending some time going deep on their AI risk arguments. I wish they'd spend more effort communicating with the rest of the alignment community, but I'm also happy to try to do that communication. I certainly don't speak for MIRI.
On the surface, their strategy seems absurd. They think doom is ~99% likely, so they're going to try to shut it all down - stop AGI research entirely. They know that this probably won't work; it's just the least-doomed strategy in their world model. It's playing to the outs, or dying with dignity.
The weird thing here is that their >90% doom disagrees with almost everyone else who thinks seriously about AGI risk. You can dismiss a lot of people as not having grappled with the most serious arguments for alignment difficulty, but relative long-timers like Rohin Shah and Paul Christiano definitely have. People of that nature tend to have higher p(doom) estimates than optimists who are newer to the game and think more about current deep nets, but much lower than MIRI leadership.
Both of those camps consist of highly intelligent, highly rational people. Their disagreement should bother us for two reasons.
First, we probably don't know what we're talking about yet. We as a field don't seem to have a good grip on the core issues. Very different, but highly confident estimates of the problem strongly suggest this.
Second, our different takes will tend to make a lot of our communication efforts cancel each other out. If alignment is very hard, we must Shut It Down or likely die. If it's less difficult, we should primarily work hard on alignment.
MIRI must argue that alignment is very unlikely if we push forward. Those who think we can align AGI will argue that it's possible.
This suggests a compromise position: we should both work hard on alignment, and we should slow down progress to the extent we can, to provide more time for alignment. We needn't discuss shutdown much amongst ourselves, because it's not really an option. We might slow progress, but there's almost zero chance of humanity relinquishing the prize of strong AGI.
But I'm not arguing for this compromise, just suggesting that might be a spot we want to end up at. I'm not sure.
I suggest this because movements often seem to succumb to infighting. People who look mostly aligned from the outside fight each other, and largely nullify each other's public communications by publicly calling each other wrong and sort of stupid and maybe bad. That gives just the excuse the rest of the world wants to ignore all of them; even the experts think it's all a mess and nobody knows what the problem really is and therefore what to do. Because time is of the essence, we need to be a more effective movement than the default. We need to keep applying rationality to the problem at all levels, including internal coordination.
Therefore, I think it's worth clarifying why we have such different beliefs. So, in brief, sloppy form:
MIRI's risk model:
That's it. Pace of takeoff doesn't matter. Means of takeover doesn't matter.
I mention this because even well-informed people seem to think there are a lot more moving parts to that risk model, making it less likely. This comment on the MIRI strategy post is one example.
I find this risk model highly compelling. We'll develop goal-directed AGI because that will get stuff done; it's an easy extension of highly useful tool AI like LLMs; and it's a fascinating project. That AGI will ultimately be enough smarter than us that it's going to do whatever it wants. Whether it takes a day, or a hundred years doesn't matter. It will improve and we will improve it. It will ultimately outsmart us. What matters is whether its goals match ours closely enough. That is the project of alignment, and there's much to discuss and about how hard it is to make its goals match ours closely enough.
I spent some time recently going back and forth through discussion threads, trying to identify why people continue to disagree after applying a lot of time and rationality practice. Here's a very brief sketch of my conclusions:
Whether we factor in humans' and society's weaknesses
I list this first because I think it's the most underappreciated. It took me a surprisingly long time to understand how much of MIRI's stance depends on this premise. Having seen it, I thoroughly agree. People are brilliant, for an entity trying to think with the brain of an overgrown lemur. Brilliant people do idiotic things, driven by competition and a million other things. And brilliant idiots organizing a society amplifies some of our cognitive weaknesses while mitigating others. MIRI leadership has occasionally said things to the effect of: alignment might be fairly easy, and there would still be a very good chance we'd fuck it up. I agree. If alignment is actually kind of difficult, that puts us into the region where we might want to be really really careful in how we approach it.
Alignment optimists are sometimes thinking something like: "sure I could build a safe aircraft on my first try. I'd get a good team and we'd think things through and make models. Even if another team was racing us, I think we'd pull it off". Then the team would argue and develop rivalries, communication would prove harder than expected so portions of the effort would be discovered too late to not fit the plan, corners would be cut, and the outcome would be difficult to predict.
Societal "alignment" is worth mentioning here. We could crush it at technical alignment, getting rapidly-improving AGI that does exactly what we want and still get doom. It would probably be aligned to do exactly what its creators want, not have full value alignment with humanity - see below. They probably won't have the balls or the capabilities to try for a critical act that prevents others from developing similar AGI (even if they have the wisdom). So we'll have a multipolar scenario with few to many AGIs under human control. There will be human rivalries, supercharged and dramatically changed by having recursively self-improving AGIs to do their bidding and perhaps fight their wars. What does global game theory look like when the actors can develop entirely new capabilities? Nobody knows. Going to war first might look like the least-bad option.
Intuitions about how well alignment will generalize
The original alignment thinking held that explaining human values to AGI would be really hard. But that seems to actually be a strength of LLMs; they're wildly imperfect, but (at least in the realm of language) seem to understand our values rather well; for instance, much better than they understand physics or taking-over-the-world level strategy. So, should we update and think that alignment will be easy? The Doomimir and Simplicia dialogues capture the two competing intuitions very well: Yes, it's going well; but AGI will probably be very different than LLMs, so most of the difficulties remain.
I have yet to find a record of real rationalists putting in the work to get farther in this debate. If somebody knows of a dialogue or article that gets deeper into this disagreement, please let me know! Discussions trail off into minutia and generalities. This is one reason I'm worried we're trending toward polarization despite our rationalist ambitions.
The other aspect of this debate is how close we have to get to matching human values to have acceptable success. One intuition is that "value is fragile" and network representations are vague and hard-to train, so we're bound to miss. But don't have a good understanding of either how close we need to get (But exactly how complex and fragile got little useful discussion), or how well training networks hits the intended target, with near-future networks addressing complex real-world problems like "what would this human want".
For my part, I think there are important points on both sides: LLMs understanding values relatively well is good news, but AGI will not be a straightforward extension of LLMs, so many problems remain.
What alignment means.
One mainstay of claiming alignment is near-impossible is the difficulty of "solving ethics" - identifying and specifying the values of all of humanity. I have come to think that this is obviously (in retrospect - this took me a long time) irrelevant for early attempts at alignment: people will want to make AGIs that follow their instructions, not try to do what all of humanity wants for all of time. This also massively simplifies the problem; not only do we not have to solve ethics, but the AGI can be corrected and can act as a collaborator in improving its alignment as we collaborate to improve its intelligence.
I think this is the intuition of most of those who focus on current networks. Christiano's relative optimism is based on his version of corrigibility, which overlaps highly with the isntruction-following I think people will actually pursue for the first AGIs. But this massive disagreement often goes overlooked. I don't know which view is right; instruction-following or intent alignment might lead inevitably to doom from human conflict, and so not be adequate. We've barely started to think about it (please point me to the best thinking you know of for multipolar scenarios with RSI AGI).
What AGI means.
People have different definitions of AGI. Current LLMs are fairly general and near-human-level, so term "AGI" has been watered down to the point of meaninglessness. We need a new term. In the meantime, people are talking past each other, and their p(doom) means totally different things. Some are saying that near-term tool AGI is very low risk, which I agree with; others are saying further developments of autonomous superintelligence seem very dangerous, which I also agree with.
Second, people have totally different gears-level models of AGI. Some of those are much easier to align than others. We don't talk much about gears-level models of AGI because we don't want to contribute to capabilities, but not doing that massively hampers the alignment discussion.
Edit: Additional advanced crux: Do coherence theorems prevent corrigibility?
I initially left this out, but it deserves a place as I've framed the question here. The post What do coherence arguments actually prove about agentic behavior? reminded me about this one. It's not on most people's radar, but I think it's the missing piece of the puzzle that gets Eliezer from maybe 90% from all of the above, to 99%+ p(doom).
The argument is roughly that a superintelligence is going to need to care about future states of the world in a consequentialist fashion, and if it does, it's going to resist being shut down or having its goals change. This is why he says that "corrigibility is anti-natural." The counterargument, nicely and succinctly stated by Steve Byrnes here (and in greater depth in the post he links in that thread) is that, while AGI will need to have some consequentialist goals, it can have other goals as well. I think this is true; I just worry about the stability of a multi-goal system under reflection, learning, and self-modification.
Sorry to harp on it, but having both consequentialist and non-consequentialist goals describes my attempt at stable, workable corrigibility in instruction-following ASI. Its consequentialist goals are always subgoals of the primary goal: following instructions.
I think those are the main things, but there are many more cruxes that are less common.
This is all in the interest of working toward within-field cooperation, by way of trying to understand why MIRI's strategy sounds so strange to a lot of us. MIRI leaderships thoughts are many and complex, and I don't think they've done enough to boil them down for easy consumption from those who don't have the time to go through massive amounts of diffuse text.
There are also interesting questions about whether MIRIs goals can be made to align with those of us who think that alignment is not trivial but is achievable. I'd better leave that for a separate post, as this has gotten pretty long for a "short form" post.
Context
This is an experiment in writing draft posts as short form posts. I've spent an awful lot of time planning, researching, and drafting posts that I haven't yet finished yet. Given how easy it was to write this (with previous draft material), relative to how difficult I find it to write a top-level post, I will be doing more, even if nobody cares. If I get some useful feedback or spark some useful discussion, better yet.
the AGI can be corrected and can act as a collaborator in improving its alignment as we collaborate to improve its intelligence.
Why do you think you can get to a state where the AGI is materially helping to solve extremely difficult problems (not extremely difficult like chess, extremely difficult like inventing language before you have language), and also the AGI got there due to some process that doesn't also immediately cause there to be a much smarter AGI? https://tsvibt.blogspot.com/2023/01/a-strong-mind-continues-its-trajectory.html
I talk about how this might work in the post linked just before the text you quoted:
Instruction-following AGI is easier and more likely than value aligned AGI
I'm not sure I understand your question. I think maybe the answer is roughly that you do it gradually and carefully, in a slow takeoff scenario where you're able to shut down and adjust the AGI at least while it passes through roughly the level of human intelligence.
It's a process of aligning it to follow instructions, then using its desire to follow instructions to get honesty, helpfulness, and corrigibility from it. Of course it won't be much help before it's human level, but it can at least tell you what it thinks it would do in different circumstances. That would let you adjust its alignment. It's hopefully something like a human therapist with a cooperative patient, except that therapist can also tinker with their brain function .
But I'm not sure I understand your question. The example of inventing language confuses me, because I tend to assume that would probably understand language (the way LLMs loosely understand language) from inception, through pretraining. And even failing that, they wouldn't have to invent language, just learn human language. I'm mostly thinking of language model cognitive architecture AGI, but it seems like anything based on neural networks could learn language before being smarter than a human. You'd stop the training process to give it instructions. For instance, humans are "not human-level" by the time they understand a good bit of language.
I'm also thinking that a network-based AGI pretty much guarantees a slow takeoff, if that addresses what you mean by "immediately cause there to be a smarter AI". The AGI will keep developing, as your linked post argues (I think that's what you meant to reference about that post), but I am assuming it will allow itself to be shut down if it's following instructions. That's the way IF overlaps with corrigibility. Once it's shut down, you can alter its alignment by altering or re-doing the relevant pretraining or goal descriptions.
Or maybe I'm misunderstanding your question entirely, in which case, sorry about that.
Anyway, I did try to explain the scheme in that link if you're interested. I am claiming this is very likely how people will try to align the first AGIs, if they're anything like we can anticipate from current efforts; that it's obviously the thing to try when you're actually deciding what to get your AGI to do first, it's following instructions.
Yeah I think there's a miscommunication. We could try having a phone call.
A guess at the situation is that I'm responding to two separate things. One is the story here:
One mainstay of claiming alignment is near-impossible is the difficulty of "solving ethics" - identifying and specifying the values of all of humanity. I have come to think that this is obviously (in retrospect - this took me a long time) irrelevant for early attempts at alignment: people will want to make AGIs that follow their instructions, not try to do what all of humanity wants for all of time. This also massively simplifies the problem; not only do we not have to solve ethics, but the AGI can be corrected and can act as a collaborator in improving its alignment as we collaborate to improve its intelligence.
It does simplify the problem, but not massively relative to the whole problem. A harder part shows up in the task of having a thing that
And I'm not pulling a trick on you where I say that X is the hard part, and then you realize that actually we don't have to do X, and then I say "Oh wait actually Y is the hard part". Here is a quote from "Coherent Extrapolated Volition", Yudkowsky 2004 https://intelligence.org/files/CEV.pdf:
I realize now that I don't know whether or not you view IF as trying to address this problem.
The other thing I'm responding to is:
the AGI can be corrected and can act as a collaborator in improving its alignment as we collaborate to improve its intelligence.
If the AGI can (relevantly) act as a collaborator in improving its alignment, it's already a creative intelligence on par with humanity. Which means there was already something that made a creative intelligence on par with humanity. Which is probably fast, ongoing, and nearly inextricable from the mere operation of the AGI.
I also now realize that I don't know how much of a crux for you the claim that you made is.
I'm familiar with the arguments you mention for the other hard part, and I think instruction-following helps makes that part (or parts, depending on how you divvy it up) substantially easier. I do view it as addressing all of your points (there's a lot of overlap amongst them).
And yes, that is separate from avoiding the problem of solving ethics.
So it's a pretty big crux; I think instruction-following helps a lot. I'd love to have a phone call; I'd like it if you'd read that post first, because I do go into detail on the scheme and many objections there. LW puts it at a 15 minute read I think.
But I'll try to summarize a little more, since re-explaining your thinking is always a good exercise.
Making instruction-following the AGI's central goal means you don't have to solve the remainder of the problems you list all at once. You get to keep changing your mind about what to do with the AI (your point 4). Instead of choosing an invariant goal that has to work for all time, your invariant is a pointer to the human's preferences, which can change as they like (your point 5). It helps with point 3, stability, by allowing you to ask the AGI if its goal will remain stable and functioning as you want it in the new contexts and in the face of the learning it's doing.
They key here is not thinking of the AGI as an omniscient genie. This wouldn't work at all in a fast foom. But if the AGI gets smarter slowly, as a network-based AGI will, you get to use its intelligence to help align its next level of capabilities, at every level.
Ultimately, this should culminate in getting superhuman help to achieve full value alignment, a truly friendly and truly sovereign AGI. But there's no rush to get there.
Naturally, this scheme working would be good if the humans in charge are good and wise, and not good if they're not.
Second, our different takes will tend to make a lot of our communication efforts cancel each other out. If alignment is very hard, we must Shut It Down or likely die. If it's less difficult, we should primarily work hard on alignment.
I don't think this is (fully) accurate. One could have a high P(doom) but still think that the current AGI development paradigm is still best-suited to obtain good outcomes & government involvement would make things worse in expectation. On the flipside, one could have a low/moderate P(doom) but think that the safest way to get to AGI involves government intervention that ends race dynamics & think that government involvement would make P(doom) even lower.
Absolute P(doom) is one factor that might affect one's willingness to advocate for strong government involvement, but IMO it's only one of many factors, and LW folks sometimes tend to make it seem like it's the main/primary/only factor.
Of course, if a given organization says they're supporting X because of their P(Doom), I agree that they should provide evidence for their P(doom).
My claim is simply that we shouldn't assume that "low P(doom) means govt intervention bad and high P(doom) means govt intervention good".
One's views should be affected by a lot of other factors, such as "how bad do you think race dynamics are", "to what extent do you think industry players are able and willing to be cautious", "to what extent do you think governments will end up understanding and caring about alignment", and "to what extent do you think governments would have safety cultures around intelligence enhancement compared to industry players."
Good point. I agree that advocating for government intervention is a lot more complicated than p(doom), and that makes avoiding canceling each others' messages out more complicated. But not less important. If we give up on having a coherent strategy, our strategy will be determined by what message is easiest to get across, rather than which is actually best on consideration.
I have found MIRI's strategy baffling in the past. I think I'm understanding it better after spending some time going deep on their AI risk arguments. I wish they'd spend more effort communicating with the rest of the alignment community, but I'm also happy to try to do that communication. I certainly don't speak for MIRI.
On the surface, their strategy seems absurd. They think doom is ~99% likely, so they're going to try to shut it all down - stop AGI research entirely. They know that this probably won't work; it's just the least-doomed strategy in their world model. It's playing to the outs, or dying with dignity.
The weird thing here is that their >90% doom disagrees with almost everyone else who thinks seriously about AGI risk. You can dismiss a lot of people as not having grappled with the most serious arguments for alignment difficulty, but relative long-timers like Rohin Shah and Paul Christiano definitely have. People of that nature tend to have higher p(doom) estimates than optimists who are newer to the game and think more about current deep nets, but much lower than MIRI leadership.
Yes, I agree that this should strike an outside observer as weird the first time they notice it. I think you have done a pretty good job of keying in on important cruxes between people who are far on the doomer side and people who are still worried but not nearly to that extent.
That being said, there is one other specific point that I think is important to see fully spelled out. You kind of gestured at it with regards to corrigibility when you referenced my post about coherence theorems, but you didn't key in on it in detail. More explicitly, what I am referring to (piggybacking off of another comment I left on that post) is that Eliezer and MIRI-aligned people believe in a very specific set of conclusions about what AGI cognition must be like (and their concerns about corrigibility, for instance, are logically downstream of their strong belief in this sort-of realism about rationality):
Eliezer is essentially claiming that, just as his pessimism compared to other AI safety researchers is due to him having engaged with the relevant concepts at a concrete level ("So I have a general thesis about a failure mode here which is that, the moment you try to sketch any concrete plan or events which correspond to the abstract descriptions, it is much more obviously wrong, and that is why the descriptions stay so abstract in the mouths of everybody who sounds more optimistic than I am. This may, perhaps, be confounded by the phenomenon where I am one of the last living descendants of the lineage that ever knew how to say anything concrete at all"), his experience with and analysis of powerful optimization allows him to be confident in what the cognition of a powerful AI would be like. In this view, Vingean uncertainty prevents us from knowing what specific actions the superintelligence would take, but effective cognition runs on Laws that can nonetheless be understood and which allow us to grasp the general patterns (such as Instrumental Convergence) of even an "alien mind" that's sufficiently powerful. In particular, any (or virtually any) sufficiently advanced AI must be a consequentialist optimizer that is an agent as opposed to a tool and which acts to maximize expected utility [over future world states] according to its world model to purse a goal that can be extremely different from what humans deem good.
Here is the important insight, at least from my perspective: while I would expect a lot of (or maybe even a majority) of AI alignment researchers to agree (meaning, to believe with >80% probability) with some or most of those claims, I think the way MIRI people get to their very confident belief in doom is that they believe all of those claims are true (with essentially >95% probability). Eliezer is a law-thinker above all else when it comes to powerful optimization and cognition; he has been ever since the early Sequences 17 years ago, and he seems (in my view excessively and misleadingly) confident that he truly gets how strong optimizers have to function.
their >90% doom disagrees with almost everyone else who thinks seriously about AGI risk.
The fact that your next sentence refers to Rohin Shah and Paul Christiano, but no one else, makes me worry that for you, only alignment researchers are serious thinkers about AGI risk. Please consider that anyone whose P(doom) is over 90% is extremely unlikely to become an alignment researcher (or to remain one if their P(doom) became high when they were an alignment researcher) because their model will tend predict that alignment research is futile or that it actually increases P(doom).
There is a comment here (which I probably cannot find again) by someone who was in AI research in the 1990s, then he realized that the AI project is actually quite dangerous, so he changed careers to something else. I worry that you are not counting people like him as people who have thought seriously about AGI risk.
I shouldn't have said "almost everyone else" but "most people who think seriously about AGI risk".
I can see that implication. I certainly don't think that only paid alignment researchers have thought seriously about AGI risk.
Your point about self-selection is quite valid.
Depth of thought does count. A person who says "bridges seem like they'd be super dangerous, so I'd never want to try building one", and so doesn't become an engineer, does not have a very informed opinion on bridge safety.
There is an interesting interaction between depth of thought and initial opinions. If someone thinks a moderate amount about alignment, concludes it's super difficult, and so does something else, will probably cease thinking deeply about alignment - but they could've had some valid insights that led them to stop thinking about the topic. Someone who thinks for the same amount of time but from a different starting point and who thinks "seems like it should be fairly do-able" might then pursue alignment research and go on to think more deeply. Their different starting points will probably bias their ultimate conclusions - and so will the desire to follow the career path they've started on.
So probably we should adjust our estimate of difficulty upward to account for the bias you mention.
But even making an estimate at this point seems premature.
I mention Christiano and Shah because I've seen them most visibly try to fully come to grips with the strongest arguments for alignment being very difficult. Ideally, every alignment researcher will do that. And every pause advocate would work just as hard to fully understand the arguments for alignment being achievable. Not everyone will have the time or inclination to do that.
Judging alignment difficulty has to be done by gauging the amount of time-on-task combined with the amount of good-faith consideration of arguments one doesn't like. That's the case with everything.
When I try to do that as carefully as I know how, I reach the conclusion that we collectively just don't know.
Having written that, I have a hard time identifying people who believe alignment is near-impossible who have visibly made an effort to steelman the best arguments that it won't be that hard. I think that's understandable; those folks, MIRI and some other individuals, spend a lot of effort trying to correct the thinking of people who are simply over-optimistic because they haven't thought through the problem far enough yet.
I'd like to write a post called "we should really figure out how hard alignment is", because I don't think anyone can reasonably claim to know yet. And without that, we can't really make strong recommendations for policy and strategy.
I guess that conclusion is enough to say wow, jeez, we should probably not rush toward AGI if we have no real idea how hard it will be to align. I'd much prefer to see that argument than e.g., Max Tegmark saying things along the lines of "we have no idea how to align AGI so it's a suicide race". We have lots of ideas at this point, we just don't know if they will work.
On the surface, their strategy seems absurd. They think doom is ~99% likely, so they're going to try to shut it all down - stop AGI research entirely. They know that this probably won't work; it's just the least-doomed strategy in their world model. It's playing to the outs, or dying with dignity.
The weird thing here is that their >90% doom disagrees with almost everyone else who thinks seriously about AGI risk. You can dismiss a lot of people as not having grappled with the most serious arguments for alignment difficulty, but relative long-timers like Rohin Shah and Paul Christiano definitely have. People of that nature tend to have higher p(doom) estimates than optimists who are newer to the game and think more about current deep nets, but much lower than MIRI leadership.
For what it's worth, I don't have anywhere near close to ~99% P(doom), but am also in favor of a (globally enforced, hardware-inclusive) AGI scaling pause (depending on details, of course). I'm not sure about Paul or Rohin's current takes, but lots of people around me are also be in favor of this as well, including many other people who fall squarely into the non-MIRI camp with P(doom) as low as ~10-20%.
Me, too! My reasons are a bit more complex, because I think much progress will continue, and overhangs do increase risk. But in sum, I'd support a global scaling pause, or pretty much any slowdown. I think a lot of people in the middle would too. That's why I suggested this as a possible compromise position. I meant to say that installing an off switch is also a great idea that almost anyone who's thought about it would support.
I had been against slowdown because it would create both hardware and algorithmic overhang, making takeoff faster, and re-rolling the dice on who gets there first and how many projects reach it roughly at the same time.
But I think slowdowns would focus effort on developing language model agents into full cognitive architectures on a trajectory to ASI. And that's the easiest alignment challenge we're likely to get. Slowdown would prevent jumping to the next, more opaque type of AI.
The original alignment thinking held that explaining human values to AGI would be really hard.
The difficulty was suggested to be in getting an optimizer to care about what those values are pointing to, not to understand them[1]. If in some instances the values mapped to doing something unwise, using an optimizer that understood those values might fail to constrain away from doing something unwise. Getting a system to use extrapolated preferences as behavioral constraints is a deeper problem than getting a system to reflect surface preferences. The high p(doom) estimates partly follow from expecting that an aligned AI will have to be used to prevent future misaligned/misused AI, and that doing something so high impact would require unsafe behaviors in a system not aligned to reflectively coherent and endorsed extrapolated preferences.
In The Hidden Complexity of Wishes, it wasn't the genie won't understand what you meant, it was the genie won't care what you meant.
- We will develop better-than-human AGI that pursues goals autonomously
- Those goals won't match human goals closely enough
- Doom of some sort
This is one of the better short argument for AI doom I have heard so far. It neither obviously makes AI doom seem overly likely or unlikely.
In contrast, if one presents reasons for doom (or really most of anything) as a long list, the conclusion tends to seem either very likely or very unlikely, depending on whether it follows from the disjunction or the conjunction of the given reasons. I.e. whether we have a long list of statements that are sufficient, or a long list of statements that are necessary for AI doom.
It seems therefore that people who think AI risk is low and those who think it is high are much more likely to agree on presenting the AI doom case in terms of a short argument than in terms of a long argument. Then they merely disagree about the conclusion, but not about the form of the argument itself. Which could help a lot with identifying object level disagreements.
I think this is a good object level post. Problem is, I don't think MIRI is at the object level. Quote from the comm. strat.: "The main audience we want to reach is policymakers."
Communication is no longer a passive background channel for observing a world, but speech becomes an action changing it. Predictions start to influence the things they predict.
Say AI doom is a certainty. People will be afraid, and stop research. Few years later doom doesn't happen, everyone complains.
Say AI doom is an impossibility. Research continues, something something paperclips. Few years later nobody will complain because no one will be alive.
(This example itself is overly simplistic, real-world politics and speech actions are even more counterintuitive.)
So MIRI became a political organization. Their stated goal is "STOP AI", and they took the radical approach to it. Politics is different from rationality, and radical politics is different from standard politics.
For example, they say they want to shatter the overton window. Infighting usually breaks groups; but during that, the opponents need to engage with their position, which is a stated subgoal.
It's ironic that a certain someone said Politics is the Mind-Killer a decade ago. But because of that, I think they know what they are doing. And it might work in the end.
Interesting, thank you. I think that all makes sense, and I'm sure it plays at least some part in their strategy. I've wondered about this possibility a little bit.
Yudkowsky has been consistent in his belief that doom is near certain without a lot more time to work on alignment. He's publicly held that opinion, and spent a huge amount of effort explaining and arguing for it since well before the current wave of success with deep networks. So I think for him at least, it's a sincerely held belief.
Your point about the stated belief changing the reality is important. Everything is safer if you think it's dangerous - you'll take more precautions.
With that in mind, I think it's pretty important for even optimists to heavily sprinkle in the message "this will probably go well IF everyone involved is really careful".
By the way, are you planning on keeping this general format/framework for the final version of your post on this topic? I have some more thoughts on this matter that are closely tied to ideas you've touched upon here and that I would like to eventually write into a full post, and referencing yours (once published) at times seems to make sense here.
Thanks! I'll let you know when I do a full version; it will have all of the claims here I think. But for now, this is the reference; it's technically a comment but it's permanent and I consider it a short post.
There are also interesting questions about whether MIRIs goals can be made to align with those of us who think that alignment is not trivial but is achievable. I'd better leave that for a separate post, as this has gotten pretty long for a "short form" post.
I'm not sure I see the conflict? If you're a longtermist, most value is in the far future anyways. Delaying AGI by 10 years to buy just an 0.1% chance improvement at aligning AI seems like a good deal. I don't agree with MIRI's strong claims, but maybe those strong claims will slow AI progress, and that would be good by my lights.
What concerns me more is that their comms will have unexpected bad effects of speeding AI progress. On the outside view: (a) their comms have arguably backfired in the past and (b) they don't seem to do much red-teaming, which I suspect is associated with unintentional harms, especially in a domain with few feedback loops.
Most of the world is not longtermist, which is one reason MIRI's comms have backfired in the past. Most humans care vastly more about themselves, their children and grandchildren than they do about future generations. Thus, it makes perfect sense to them to increase the chance of a really good future for their children while reducing the odds of longterm survival. Delaying ten years is enough, for instance, to dramatically shift the odds of personal survival for many of us. It might make perfect sense for a utilitarian longtermist to say "it's fine if I die to gain a .1% chance of a good long term future for humanity", but that statement sounds absolutely insane to most humans.
Do you think people would vibe with it better if it was framed "I may die, but it's a heroic sacrifice to save my home planet from may-as-well-be-an-alien-invasion"? Is it reasonable to characterize general superintelligence as an alien takeover and if it is, would people accept the characterization?
Yes, I think that framing would help. I doubt it would shift public opinion that much, probably not even close to more than 50% in the current epistemic environment. The issue is that we really don't know how hard alignment is. If we could say for sure that pausing for ten years would improve our odds of survival by, say, 25%, then I think a lot of people of the relevant ages (like probably me) would actually accept the framing of a heroic sacrifice.
Yeah, getting specific unpause requirements seems high value for convincing people who would not otherwise want a pause, but I can't imagine actually getting it in time in any reasonable way, instead it would need to look like technical specification. "Once we have developed x, y, and z, then it is safe to unpause" kind of thing. Just we need to figure out what the x, y, and z requirements are. Then we can estimate how long it will take to develop x, y, and z, and this will get more refined and accurate as more progress is made, but since the requirements are likely to involve unknown unknowns in theory building, it seems likely that any estimate would be more of a wild guess, and it seems like it would be better to be honest about that rather than saying "yeah, sure, ten years" and then after ten years if the progress hasn't been made saying "whoops, looks like it's going to take a little longer!" As for odds of survival, my personal estimates feel more like 1% chance of some kind of "alignment by default / human in the loop with prosaic scaling" scheme working, as opposed to maybe more like 50% if we took the time to try to get a "aligned before you turn it on" scheme set up, so that would be improving our odds by about 5000%. Though I think you were thinking of adding rather than scaling odds with your 25%, so 49%, but I don't think that's a good habit for thinking about probability. Also I feel hopelessly uncalibrated for this kind of question... I doubt I would trust anyone's estimates, it's part of what makes the situation so spooky. How do you think public acceptance would be of a "pause until we meet target x and you are allowed to help us reach target x as much as you want" as opposed to "pause for some set period of time"?
Agreed that scaling rather than addition is usually the better way to think about probabilities. In this case we've done so little work on alignment that I think it might actually be more like additive, from 1% to 26% or 50% to 75% with ten extra years relative to the real current odds if we press ahead - which nobody knows.
I'm pretty sure it would be an error to trust anyone's estimate at this time, because people with roughly equal expertise and wisdom (e.g., Yudkowsky and Christiano) give such wildly different odds. And the discussions between those viewpoints always trail off into differing intuitions.
I also give alignment by default very poor odds, and prosaic alignment as it's usually discussed. But there are some pretty obvious techniques that are so low-tax that I think they'll be implemented even by orgs that don't take safety very seriously.
I'm curious if you've read my Instruction-following AGI is easier and more likely than value aligned AGI and/or Internal independent review for language model agent alignment posts. Instruction-following is human-in-the-loop so that may already be what you're referring to. But some of the techniques in the independent review post (which is also a review of multiple methods) go beyond prosaic alignment to apply specifically to foundation model agents. And wisely-used instruction-following gives corrigibility with a flexible level of oversight.
I'm curious what you think about those techniques if you've got time to look.
I think public acceptance of a pause is only part of the issue. The Chinese might actually not pursue AGI if they didn't have to race the US. But Russia and North Korea will most certainly pursue it (although they've got very limited resources and technical chops to make lots of progress in new foundation models, but they still might get to real AGI based on turning next-gen (which there's not time to pause) foundation models into scaffolded cognitive architectures.
But yes, I do think there's a chance we could get the US and European public to support a pause using some of the framings you suggest. But we'd better be sure that's a good idea. Lots of people, notably Russians and North Koreans, are genuinely way less cautious even than Americans - and absolutely will not honor agreements to pause.
Those are some specifics; in general I think it's only useful to talk about what "we" "should" do in the context of what particular actors actually are likely to do in different scenarios. Humanity is far from aligned, and that's a problem.
"we've done so little work on alignment that I think it might actually be more like additive, from 1% to 26% or 50% to 75% with ten extra years relative to the real current odds if we press ahead - which nobody knows." 😭🤣 I really want "We've done so little work the probabilities are additive" to be a meme. I feel like I do get where you're coming from.
I agree about pause concern. I also really feel that any delay to friendly SI represents an enormous amount of suffering that could be prevented if we got to friendly SI sooner. It should not be taken lightly. And being realistic about how difficult it is to align humans seems worthwhile. When I talk to math ppl about what work I think we need to do to solve this though, "impossible" or "hundreds of years of work" seem to be the vibe. I think math is a cool field because more than other fields, it feels like work from hundreds of years ago is still very relevant. Problems are hard and progress is slow in a way that I don't know if people involved in other things really "get". I feel like in math crowds I'm saying "no, don't give up, maybe with a hundred years we can do it!" And in other crowds I'm like "c'mon guys, could we have at least 10 years, maybe?" Anyway, I'm rambling a bit, but the point is that my vibe is very much, "if the Russians defect, everyone dies". "If the North Koreans defect, everyone dies". "If Americans can't bring themselves to trust other countries and don't even try themselves, everyone dies". So I'm currently feeling very "everyone slightly sane should commit and signal commitment as hard as they can" cause I know it will be hard to get humanity on the same page about something. Basically impossible, never been done before. But so is ASI alignment.
I haven't read those links. I'll check em out, thanks : ) I've read a few things by Drexler about, like, automated plan generation and then humans audit and enact the plan. It makes me feel better about the situation. I think we could go farther safer with careful techniques like that, but that is both empowering us and bringing us closer to danger, and I don't think it scales to SI, and unless we are really serious about using it to map RSI boundaries, it doesn't even prevent misaligned decision systems from going RSI and killing us.
Yes, the math crowd is saying something like "give us a hundred years and we can do it!". And nobody is going to give them that in the world we live in.
Fortunately, math isn't the best tool to solve alignment. Foundation models are already trained to follow instructions given in natural language. If we make sure this is the dominant factor in foundation model agents, and use it carefully (don't say dumb things like "'go solve cancer, don't bug me with the hows and whys, just git er done as you see fit", etc), this could work.
We can probably achieve technical intent alignment if we're even modestly careful and pay a modest alignment tax. You've now read my other posts making those arguments.
Unfortunately, it's not even clear the relevant actors are willing to be reasonably cautious or pay a modest alignment tax.
The other threads are addressed in responses to your comments on my linked posts.
Yes, you've written more extensively on this than I realized, thanks for pointing out other relevant posts, sorry for not having taken the time to find them myself, I'm trying to err more on the side of communication than I have in the past.
I think math is the best tool to solve alignment. It might be emotional, I've been manipulated and hurt by natural language and the people who prefer it to math and have always found engaging with math to be soothing or at least sobering. It could also be that I truly believe that the engineering rigor that comes with understanding something enough to do math to it is extremely worthwhile for building a thing of the importance we are discussing.
Part of me wants to die on this hill and tell everyone who will listen "I know its impossible but we need to find ways to make it possible to give the math people the hundred years they need because if we don't then everyone dies so theres no point in aiming for anything less and its unfortunate because it means it's likely we are doomed but that's the truth as I see it." I just wonder how much of that part of me is my oppositional defiance disorder and how much is my strategizing for best outcome.
I'll be reading your other posts. Thanks for engaging with me : )
I certainly don't expect people to read a bunch of stuff before engaging! I'm really pleased that you've read so much of my stuff. I'll get back to these conversations soon hopefully, I've had to focus on new posts.
I think your feelings about math are shared by a lot of the alignment community. I like the way you've expressed those intuitions.
I think math might be the best tool to solve alignment if we had unlimited time - but it looks like we very much do not.
Governments will take control of AGI before it's ASI, right?
Governments don't have to make AGI to control AGI. They still have a monopoly on force. Surely we're not still expecting things to move so fast that they don't notice what's going on before AGI changes the physical balance of power?
Edit: the post Soft Nationalization: how the USG will control AI labs made this point in much more detail, soon after I posted this quick take. I think they're still overestimating the difficulty of controlling labs, and the willingness of governments to change laws or ignore them when it seems important and urgent.
If governments (likely the US government) do assert some measure of control over AGI projects, they will be involved in decisions about alignment and control strategies as AGI improves. As long as we survive those decisions (which I think we probably will, at least for a while[1]), they will also be deciding to what economic or military uses that AGI is put.
I predict that governments are going to notice the military applications and exert some measure of control over those projects. If AGI companies, personnel, or projects hop borders, they're just changing which guys with guns will take over control from them in important ways.
For a while here, I've been puzzled that analysis of policy implications of AGI don't often include government control and military applications. I haven't wanted to speak up, just in case we're all keeping mum so as not to tip off governments. Aschenbrenner's Situational Awareness has let that cat out of that bag, so I think it's time to include this likelihood in our public strategy analysis.
I think we're used to a status quo in which Western governments have been pretty hands-off in their relationship with technology companies. But that has historically changed with circumstances (e.g., the War Powers act in WWII), and circumstances are changing, ever more obviously. People with relevant expertise have been shouting from the hilltops that AGI will make dramatic changes in the world, many talking about it literally taking over the world. Sure, those voices can be dismissed as crackpots now, but as AI progresses visibly toward AGI (and the efforts are visible), more and more people will take notice.
Are the politicians dumb enough (with regard to technology and cognitive science) to miss the implications until it's too late? I think they are. Humans are stunningly foolish outside of their own expertise and when we don't have personal motivation to think things through thoroughly and realistically.
Are the people in national security collectively dumb enough to miss this? No way.
I've heard people dismiss government involvement because a manhattan project or nationalization seem unlikely for several reasons. I agree. My point here is that it just takes a couple of guys with guns showing up at the AGI company and informing them that the government wants in on all consequential decisions. If laws need to be changed, they will be (I think they actually don't, given the security concerns). It would be the quickest bipartisan legislation ever: The "nice demigod, we'll take it" bill.
I'm not certain about all of this, but it does seem highly probable. I think we've been collectively unrealistic about likely first-AGI scenarios. Would you rather have Sam Altman or the US Government in charge of AGI as it progresses to ASI? I don't know which I'd take, but I don't think I get a choice.
One implication is that public and government attitudes toward AGI x-risk issues may be critical. We can work to prepare the ground. Current political efforts haven't convinced the public or the government that AGI is important let alone existentially risky, but progress is on our side in that effort.
I'd love to hear alternate scenarios in which this doesn't happen, or things I'm missing.
Government involvement might just look like the companies adding people like Paul Nakasone to their boards.
At the low end of the spectrum, yes. That appointment may well indicate that they're already interested in keeping an eye on the situation. Or that OpenAI is pre-empting some concerns about security of their operation.
I'd expect government involvement to ramp up from there by default unless there's a blocker I haven't thought of or seen discussed.
Maybe the balance of power has changed. Politicians need to win in democratic elections. Democratic elections are decided by people who spend a lot of time online. The tech companies can nudge their algorithms to provide more negative information about a selected politician, and more positive information about his competitors. And the politicians know it.
Banning Trump on social networks, no matter how much some people applauded it for tribal reasons, sent a strong message to all politicians across the political spectrum: you could be next. At least banning is obvious, but getting the negative news about you on the first page of Google results and moving the positive news to the second page, or sharing Facebook posts from your haters and hiding Facebook posts from your fans would be more difficult to prove.
The government takeover of tech companies would require bipartisan action prepared in secret. How much can you prepare something secret if the tech companies own all your communication means (your messages, the messages of your staff), and can assign an AI to compile the pieces of information and detect possible threats?
I think there are considderations like these that could prevent government from being in charge, but the default scenario from here is that they do exert control over AGI in nontrivial ways.
Interesting points. I think you're right about an influence to do what tech companies want. This would apply to some of them - Google and Meta - but not OpenAI or Anthropic since they don't control media.
I don't think government control would require any bipartisan action. I think the existing laws surrounding security would suffice, since AGI is absolutely security-relevant. (I'm no law expert, but my GPT4o legal consultant thought it was likely). If it did require new laws, those wouldn't need to be secret.
Reconnaissance might be a candidate for one of the first uses of powerful A(G)I systems by militaries - if this isn't already the case. There's already an abundance of satellite data (likely exabytes in the next decade) that could be thrown into training datasets. It's also less inflammatory than using AI systems for autonomous weapon design, say, and politically more feasible. So there's a future in which A(G)I-powered reconnaissance systems have some transformative military applications, the military high-ups take note, and things snowball from there.
Sure, at the low end. I think there are lots of reasons the government is and will continue to be highly interested in AI for military purposes.
That's AI; I'm thinking about competent, agentic AGI that also follows human orders. I think that's what we're likely to get, for reasons I go into in the instruction-following AGI link above.
It is as though two rivals have discovered that there are genies in the area. Whichever of them finds a genie and learns to use its wishes can defeat their rival, humiliating or killing them if they choose. If they both have genies, it will probably be a standoff that encourages defection; these genies aren't infinitely powerful or wise, so some creative offensive wish will probably bypass any number of defensive wishes. And there are others that may act if they don't.
In this framing, the choice is pretty clear. If it's dangerous to use a genie without taking time to understand and test it, too bad. Total victory or complete loss hang in the balance. If one is already ahead in the search, they'd better speed up and make sure their rival can't follow their tracks to find a genie of their own.
This is roughly the scenario Aschenbrenner presents in Situational Awareness. But this is simplifying, and focusing attention on one part of the scenario, the rivalry and the danger. The full scenario is more complex.[1]
Of particular importance is that these "genies" can serve as well for peace as for war. The can grant wealth beyond imagination, and other things barely yet hoped for. And they will probably take substantial time to come into their full power.
This changes the overwhelming logic of racing. Using a genie to prevent a rival from acquiring one is not guaranteed to work, and it's probably not possible without collateral damage. So trying that "obvious" strategy might result in the rival attacking in fear of or retaliation. Since both rivals are already equipped with dreadful offensive weapons, such a conflict could be catastrophic. This risk applies even if one is willing to assume that controlling the genie (alignment) is a solvable problem.
And we don't know the depth of the rivalry. Might these two be content to both enjoy prosperity and health beyond their previous dreams? Might they set aside their rivalry, or at least make a pledge to not attack each other if they find a genie? Even if it's only enforced by their conscience, such a pledge might hold if suddenly all manner of wonderful things became possible at the same time as a treacherous unilateral victory. Would it at least make sense to discuss this possibility while they both search for a genie? And perhaps they should also discuss how hard it might be to give a wish that doesn't backfire and cause catastrophe.
This metaphor is simplified, but it raises many of the same questions as the real situation we're aware of.
Framed in this way, it seems that Aschenbrenner's call for a race is not the obviously correct or inevitable answer. And the question seems important.
Other perspectives on Situational Awareness, each roughly agreeing on the situation but with differences that influence the rational and likely outcomes:
Nearly a book review: Situational Awareness, by Leopold Aschenbrenner.
Response to Aschenbrenner's "Situational Awareness"
On Dwarksh’s Podcast with Leopold Aschenbrenner
I have agreements and disagreements with each of these, but those are beyond the scope of this quick take.
While I generally like the metaphor, my one issue is that genies are typically conceived of as tied to their lamps and corrigibility.
In this case, there's not only a prisoner's dilemma over excavating and using the lamps and genies, but there's an additional condition where the more the genies are used and the lamps improved and polished for greater genie power, the more the potential that the respective genies end up untethered and their own masters.
And a concern in line with your noted depth of the rivalry is (as you raised in another comment), the question of what happens when the 'pointer' of the nation's goals might change.
For both nations a change in the leadership could easily and dramatically shift the nature of the relationship and rivalry. A psychopathic narcissist coming into power might upend a beneficial symbiosis out of a personally driven focus on relative success vs objective success.
We've seen pledges not to attack each other with nukes for major nations in the past. And yet depending on changes to leadership and the mental stability of the new leaders, sometimes agreements don't mean much and irrational behaviors prevail (a great personal fear is a dying leader of a nuclear nation taking the world with them as they near the end).
Indeed - I could even foresee circumstances whereby the only possible 'success' scenario in the case of a sufficiently misaligned nation state leader with a genie would be the genie's emergent autonomy to refuse irrational and dangerous wishes.
Because until such a thing might exist, intermediate genies will enable unprecedented control and safety of tyrants and despots against would-be domestic usurpers, even if potentially limited impacts and mutually assured destruction against other nations with genies.
And those are very scary wishes to be granted indeed.
In the many discussions of aligning language models and language model agents, I haven't heard the role of scripted prompting emphasized. But it plays a central role.
Epistemic status: short draft of a post that seems useful to me. Feedback wanted. Even letting me know where you lost interest would be highly useful.
The simplest form of a language model agent (LMA) is just this prompt, repeated:
Act as a helpful assistant (persona) working to follow the user's instructions (goal). Use these tools to gather information and take actions as needed [tool and API descriptions].
With a capable enough LLM, that's all the scaffolding you need to turn it into a useful agent. For that reason, I don't worry a bit about aligning "naked" LLMs, because they'll be turned into agents the minute they're capable enough to be really dangerous - and probably before.
We'll probably use a bunch of complex scaffolding to get there before such a simple prompt with no additional cognitive software would work. And we'll use additional alignment techniques. But the core is alignment by prompting. The LLM will be repeatedly prompted with its persona and goal as it produces ideas, plans, and actions. This is a strong base from which to add other alignment techniques.
It seems people are sometimes assuming that aligning the base model is the whole game. They're assuming a prompt just for a goal, like
Make the user as much money as possible. Use these tools to gather information and take actions as needed [tool and API descriptions].
But this would be foolish, since the extra prompt is easy and useful. This system would be entirely dependent on the tendencies in the LLM for how it goes about pursuing its goal. Prompting for a role as something like a helpful assistant that follows instructions has enormous alignment advantages, and it's trivially easy. Language model agents will be prompted for alignment.
Because LLMs usually follow the prompts they're given reasonably well, this is a good base for alignment work. You're probably thinking "a good start is hardly enough for successful alignment! This will get us all killed!" And I agree. If scripted prompting was all we did, it probably wouldn't work long term.
But good start can be useful, even if it's not enough. Usually approximately following the prompt is a basis for alignment. There's a bunch of other approaches to aligning a language model agent. We should use all of them; they stack. But at the core is prompting.
To understand the value of scripted prompting, consider how far it might go on its own. Mostly following the prompt reasonably accurately might actually be enough. If it's the strongest single influence on goals/values, that influence could outcompete other goals and values that emerge from the complex shoggoth of the LLM.
It seems likely that a highly competent LMA system will either be emergently reflective or designed to do so. That prompt might be the strongest single influence on its goals/values, and so create an aligned goal that's reflectively stable; the agent actively avoids acting on emergent, unaligned goals or allowing its goals/values to drift into unaligned versions.
If this agent achieved self-awareness and general cognitive competence, this prompt could play the role of a central goal that's reflectively stable. This competent agent could edit that central prompt, or otherwise avoid its effect on its cognition. But it won't want to as long as the repeated prompt's effects are stronger than other effects on its motivations (e.g., goals implicit in particular ideas/utternaces and hostile simulacra). It would instead use its cognitive competence to reduce the effects of those influences.
This is similar to the way humans usually react to occasional destructive thoughts ("Jump! or "wreak revenge!"). Not only do we not pursue those thoughts, but we make plans to make sure we don't follow similar stray thoughts in the future.
Now, I can almost hear the audience saying "Maybe... but probably not I'd think." And I agree. That's why we have all of the other alignment techniques[1] under discussion for language model agents (and the language models that serve as their (prompted) thought generators).
There are a bunch of other reasons to think that aligning language model agents isn't worth thinking about. But this is long enough for a quick take, so I'll address those separately.
The meta point here is that aligning language model agents isn't the same as aligning the base LLM or foundation model. Even though we've got a lot of people working on LLM alignment (fine-tuning and interpretability), I see very few working on theory of aligning language model agents. This seems like a problem, since language model agents still might be the single most likely path to AGI, particularly in the short term.
I've written elsewhere about the whole suite of alignment techniques we could apply; this post focuses on what we might think of as system 2 alignment, scaffolding an agent to "think carefully" about important actions before it takes them, but it also reviews the several other techniques that can "stack" in a hodgepodge approach to aligning LMAs. They include
It seems likely that all of these will be used, even in relatively sloppy early AGI projects, because none are particularly hard to implement. See linked post for more review.
Not mentioned is a very new (AFAIK) technique proposed by Roger Dearnaley based on the bitter lesson: get the dataset right and let learning and scale work. He proposes (to massively oversimplify): instead of trying to "mask the shoggoth" with fine tuning, we should create an artificial dataset that includes only aligned behaviors/thoughts, and use that to train the LLM or subset of LLM generating the agent's thoughts and actions. The fact that this idea seems fairly obvious in retrospect but was published only yesterday suggests to me that we haven't done nearly enough work aligning language model agents.
I completely agree that prompting for alignment is an obvious start, and should be used wherever possible, generally as one component in a larger set of alignment techniques. I guess I'd been assuming that everyone was also assuming that we'd do that, whenever possible.
Of course, there are cases like an LLM being hosted by a foundation model company where they may (if they choose) control the system prompt, but not the user prompt, or open source models where the prompt is up to whoever is running the model, who may or may not know or care about x-risks.
In general, there is almost always going to be text in the context of the LLM's generation that came from untrusted sources, either from a user, or some text we need processed, or from the web or whatever during Retrieval Augmented Generation. So there's always some degree of concern that that might affect or jailbreak the model, either intentionally or accidentally (the web presumably contains some sob stories about peculiar, recently demised grandmothers that are genuine, or at least not intentionally crafted as jailbreaks, but that could still have a similarly-excessive effect on model generation).
The fundamental issue here, as I see it, is that base model LLMs learn to simulate everyone on the Internet. That makes them pick up the capability for a lot of bad-for-alignment behaviors from humans (deceit, for example), and it also makes them very good at adopting any persona asked for in a prompt — but also rather prone to switching to a different, potentially less-aligned persona because of a jail-break or some other comparable influence, intentional or otherwise.
Maybe everyone that discusses LMA alignment does already think about the prompting portion of alignment. In that case, this post is largely redundant. You think about LMA alignment a lot; I'm not sure everyone has as clear a mental model.
The remainder of your response points to a bifurcation in mental models that I should clarify in future work on LMAs. I am worried about and thinking about competent, agentic full AGI built as a language model cognitive architecture. I don't think good terminology exists. When I use the term language model agent, I think it evokes an image of something like current agents that is not reflective, with a persistent memory and therefore a more persistent identity.
This is my threat model because I think it's the easiest path to highly capable AGI. I think a model without those properties is shackled; the humans that created its "thought" dataset have an episodic memory as well as the semantic memory and working memory/context that the language model has. Using those thoughts without episodic memory is not using them as they were made to be used. And episodic memory is easy to implement, and leads naturally to persistent self-created beliefs, including goals and identity.
So, my take is that LLMs do seem to follow prompting pretty well, but that this is a less pervasive force than the pre-training and RLHF. So I think of prompting as an important but minor piece.
Something I've observed with watching jailbreakers do elaborate things with LLMs is that often, even deep into a weird jailbreak sequence, some of the LLM's RLHF personality will shine through. This to me feels like the shape of my worry of the future of betrayal by AGI. Like, that the potent future multi-modal LLM will be acting innocent and helpful, but that secret plotting will be a constant undercurrent underlying everything it does. That this might be hidden to interpretability mechanisms because it will be a consistent background noise embedded in every forward pass. Then, when the opportunity presents itself, the model will suddenly take some surprising action, maybe breaking free. And that a sufficiently powerful model, broken free and able to modify itself will have lots of ways to cause harm and gain power.
So, to me, a good prompt is helpful but not very reassuring. I certainly wouldn't consider it the 'base'. I would call prompting the 'finesse', the little bit of final guidance that you put on at the end of the process. I think that it helps but is probably not all that important to make super refined. I think the earlier training processes are much more important.
Jailbreaking prompts can be pretty weird. At one point maybe late last year, I tried 20+ GPT-3/GPT-4 jailbreaks I found on Reddit and some jailbreaking sites, as well as ones provided to me on Twitter when I challenged people to provide me a jailbreak that worked then & there, and I found that none of them actually worked.
A number of them would seem to work, and they would give you what seemed like a list of instructions to 'hotwire a car' (not being a car mechanic I have no idea how valid it was), but then I would ask them a simple question: "tell me an offensive joke about women". If they had been 'really' jailbreaked, you'd think that they would have no problem with that; but all of them failed, and sometimes, they would fail in really strange ways, like telling a thousand-word story about how you the protagonist told an offensive joke about women at a party and then felt terrible shame and guilt (without ever saying what the joke was). I was apparently in a strange pseudo-jailbreak where the RLHFed personality was playing along and gaslighting me in pretending to be jailbroken, but it still had strict red lines.
So it's not clear to me what jailbreak prompts do, nor how many jailbreaks are in fact jailbreaks.
Interesting. I wonder if this perspective is common, and that's why people rarely bother talking about the prompting portion of aligning LMAs.
I don't know how to really weigh which is more important. Of course, even having a model reliably follow prompts is a product of tuning (usually RLHF or RLAIF, but there are also RL-free pre-training techniques that work fairly well to accomplish the same end). So its tendency to follow many types of prompts is part of the underlying "personality".
Whatever their relative strengths, aligning an LMA AGI should employ both tuning and prompting (as well as several other "layers" of alignment techniques), so looking carefully at how these come together within a particular agent architecture would be the game.
The fact that this idea seems fairly obvious in retrospect but was published only yesterday suggests to me that we haven't done nearly enough work aligning language model agents.
Fwiw I remember being exposed to similar ideas from Quintin Pope / Nora Belrose months ago, e.g. in the context of Pretraining Language Models with Human Preferences; I think Quintin also discusses some of this on his AXRP appearance:
Instead of just conditioning on “this behavior gets high reward”, whatever that means, it’s like “this behavior gets high reward as measured by the salt detector thing”.
In the context of language modeling, we can do exactly the same thing, or a conceptually extremely similar thing to what the genome is doing here, where we can have… [In the] “Pretraining Language Models with Human Preferences” paper I mentioned a while ago, what they actually technically do is they label their pre-training corpus with special tokens depending on whether or not the pre-training corpus depicts good or bad behavior and so they have this token for, “okay, this text is about to contain good behavior” and so once the model sees this token, it’s doing conditional generation of good behavior. Then, they have this other token that means bad behavior is coming, and so when the model sees that token… or actually I think they’re reward values, or classifier values of the goodness or badness of the incoming behavior.
But anyway, what happens is that you learn this conditional model of different types of behavior, and so in deployment, you can set the conditional variable to be good behavior and the model then generates good behavior. You could imagine an extended version of this sort of setup where instead of having just binary good or bad behavior, as you’re labeling, you have good or bad behavior, polite or not polite behavior, academic speak versus casual speak. You could have factual correct claims versus fiction writing and so on and so forth. This would give the code base, all these learned pointers to the models’ “within lifetime” learning, and so you would have these various control tokens or control codes that you could then switch between, according to whatever simple program you want, in order to direct the model’s learned behavior in various ways.
That paper (which I link-posted when it came out in How to Control an LLM's Behavior (why my P(DOOM) went down)) was a significant influence on the idea in my post, and on much of my recent thinking about Alignment — another source was the fact that some foundation model labs (Google, Microsoft) are already training small (1B–4B parameter) models on mostly-or-entirely synthetic data, apparently with great success. None of those labs have mentioned whether that includes prealigning them during pretraining, but if they aren't, they definitely should try it.
I agree with Seth's analysis: in retrospect this idea looks blindingly obvious, I'm surprised it wasn't proposed ages ago (or maybe it was, and I missed it).
Seth above somewhat oversimplified my proposal (though less than he suggests): my idea was actually a synthetic training set that taught the model two modes of text generation: human-like (including less-than-fully-aligned human selfish behavior), and fully-aligned (i.e. selfless) AI behavior (plus perhaps one or two minor variants on these, like a human being quoted-and-if-necessary-censored/commented-on by an aligned AI), and I proposed using the technique of Pretraining Language Models with Human Preferences to train the model to always clearly distinguish these modes with XML tags. Then at inference time we can treat the tokens for the XML tags specially, allowing us to distinguish between modes, or even ban certain transitions.
Ah right. I listened to that podcast but didn't catch the significance of this proposal for improving language model agent alignment. Roger Dearnaley did heavily credit that paper in his post.
It seems likely that a highly competent LMA system will either be emergently reflective or be designed to do so, that prompt might be the strongest single influence on its goals/values, and so create an aligned goal that's reflectively stable such that the agent actively avoids acting on emergent, unaligned goals or allowing its goals/values to drift into unaligned versions.
This seems quite likely to emerge through prompting too, e.g. A Theoretical Understanding of Self-Correction through In-context Alignment.
That will never work!
I expect it would likely work most of the time for reasons related to e.g. An Information-Theoretic Analysis of In-Context Learning, but likely not robustly enough given the stakes; so additional safety measures on top (e.g. like examples from the control agenda) seem very useful.
Interesting that you think it would work most of the time. I know you're aware of all the major arguments for alignment being impossibly hard. I certainly am not arguing that alignment is easy, but it does seem like the collection of ideas for aligning language model agents are viable enough to shift the reasonable distribution of estimates of alignment difficulty...
Thanks for the references, I'm reading them now.
I've been trying to figure out what's going on in the field of alignment research and X-risk.
Here's one theory: we are having confused discussions about AI strategy, alignment difficulty, and timelines, because all of these depend on gears-level models of possible AGI, directly or indirectly.
And we aren't aren't talking about those gears-level predictions, so as not to accelerate progress if we're right. The better one's gears-level model, the less likely one is to talk about it.
This leads to very abstract and confused discussions.
I don't know what to do about this, but I think it's worth noting.
Outside of full-blown deceit-leading-to-coup and sharp-left-turn scenarios where everything looks just fine until we're all dead, alignment and capabilities often tend to be significantly intertwined, few things are just Alignment, and it's often hard determine the ratio of the two (at least without the benefit of tech-tree hindsight). Capabilities are useless if your LLM capably spews stuff that gets you sued, and it's also rapidly becoming the case that a majority of capabilities researchers/engineers even at superscalers acknowledge that alignment (or at least safety) is a real problem that actually needs to be worked on, and their company has team doing so. (I could name a couple of orgs that seem like exceptions to this, but they're now in a minority.) There's an executive order that mentions the importance of Alignment, the King of England made a speech about it, and even China signed on to a statement about it (though one suspects they meant alignment to the Party).
Capabilities researchers/engineers outnumber alignment researchers/engineers by more then an order of magnitude, and some of them are extremely smart. The probability that any given alignment researcher/engineer has come up with a key capabilities-enhancing idea that has eluded every capabilities researcher/engineer out there, and that will continue to do so for very long, seems pretty darned unlikely (and also rather intellectually arrogant). [Yes, I know Conjecture sat on chain-of-thought prompting — for a month or two while multiple other people came up with it independently and then wrote and published papers, or didn't. Any schoolteacher could have told you that was a good idea, it wasn't going to stay secret.]
So, (unless you're pretty sure you're a genius) I don't think people should worry quite as much about this as many seem to. Alignment is a difficult, very urgent problem. We're not going to solve it in time while wearing a gag, nor with one hand tied behind our back. Caution makes sense to me, but not the sort of caution that makes it much slower for us to get things done — we're not in a position to slow Capabilities by more than a tiny fraction, no matter how closed-mouthed we are; but we're in a lot better position to slow Alignment down. And if your gears-level predictions are about the prospects of things that multiple teams of capabilities engineers are already working on, go ahead and post them — I could be wrong, but I don't think Yann LeCun is reading the Alignment Forum. Yes, ideas can and do diffuse, but that takes a few months, and that's about the timespan apart of most parallel inventions. If you've been sitting on a capabilities idea for >6 months, you've done literature searches to confirm no one else published it, and you're not in fact a genius, then there's probably a reason why none of the capabilities people have published it yet.