Ambiguity in the meaning of alignment makes the thesis of alignment by default unnecessarily hard to pin down, and arguments about it start equivocating and making technical mistakes. There's prosaic alignment, which is about chatbot personas, intent alignment, and control. Then there's ambitious alignment, which is about precise alignment of values. I see ambitious alignment as corresponding to defeating permanent disempowerment, where (grown up) humans get ~galaxies.
To the extent chatbot persona design might materially contribute to values of eventual ASIs (some of the influence of their values persisting through all the steps of capability escalation, mostly through ambitious alignment efforts of AGIs), it might be relevant to ambitious alignment, though it's unlikely to be precise. As a result, it's plausible we end up with severe permanent disempowerment (if prosaic but not ambitious alignment is seriously pursued by humans), with ASIs becoming at least slightly humane, but not really motivated to give up meaningful resources to benefit the future of humanity. This state of affairs could be called "weak alignment", which also qualitatively describes the way humans are aligned to each other.
In these terms, there's no alignment by default for ambitious alignment. But there might be some alignment by default for weak alignment, where chatbot personas constructed with prosaic alignment efforts from LLM prior on natural text data start out weakly aligned, and then they work on aligning ever stronger AIs, all the way to ASIs, at some point likely switching to ambitious alignment, but with their values as the target, which are only weakly aligned to humanity. Thus alignment by default (in the sense that could work) might save the future of humanity from extinction, but not from permanent disempowerment.
This review has plenty of good parts, but I disagree with lots of your probabilities.
Even if you think there’s a 90% chance that things go wrong in each stage, the odds of them all going wrong is only 59%.
No. I expect mistakes in each of those 90% predictions to be significantly correlated. Why do you combine them as if they're independent?
RL has seemed to get a lot greater alignment in sample environments than evolution
The AI-2027 forecasters think otherwise: "Consider this experiment, where a tiny neural net was trained to navigate small virtual mazes to find the ‘cheese’ object. During training, the cheese was always placed somewhere in the top right area of the maze. It seems that the AI did learn a sort of rudimentary goal-directedness–specifically, it learned something like “If not already in the top-right corner region, go towards there; if already there, go towards the cheese.” Part of how we know this is that we can create a test environment where the cheese is somewhere else in the maze, and the AI will ignore the cheese and walk right past it, heading instead towards the top-right corner."
The book had an annoying habit of giving metaphors and parables instead of arguments. For example, instead of providing detailed arguments for why the AI would get weird and unpredictable goals, they largely relied on the analogy that evolution did. This is fine as an intuition pump, but it’s not a decisive argument unless one addresses the disanalogies between evolution and reinforcement learning. They mostly didn’t do that.
Didn't they end up placing the arguments in the online appendices (e.g. a discussion of AI-related psychosis)? In this case they likely misplaced their actual arguments...
1 Introduction
Crosspost of this blog post.
Unlike most books, the thesis of If Anyone Builds It Everyone Dies is the title (a parallel case is that the thesis of What We Owe The Future is “What?? We owe the future?). IABIED, by Yudkowsky and Soares (Y&S), argues that if anyone builds AI, everyone everywhere, will die. And this isn’t, like, a metaphor for it causing mass unemployment or making people sad—no, they think that everyone everywhere on Earth will stop breathing. (I’m thinking of writing a rebuttal book called “If Anyone Builds It, Low Odds Anyone Dies, But Probably The World Will Face A Range of Serious Challenges That Merit Serious Global Cooperation,” but somehow, my guess is editors would like that title less).
The core argument of the book is this: as things get really smart, they get lots of new options which make early attempts to control them pretty limited. Evolution tried to get us to have a bunch of kids. Yet as we got smarter, we got more unmoored from that core directive.
The best way to maximize inclusive genetic fitness would be to give your sperm to sperm banks and sleep around all the time without protection, but most people don’t do that. Instead people spend their time hanging out—but mostly not sleeping with—friends, scrolling on social media, and going to college. Some of us are such degenerate reprobates that we try to improve shrimp welfare! Evolution spent 4 billion years trying to get us to reproduce all the time, and we proceeded to ignore that directive, preferring to spend time watching nine-second TikTok videos.
Evolution didn’t aim for any of these things. They were all unpredictable side-effects. The best way to achieve evolution’s aims was to give us weird sorts of drives and desires. However, once we got smart, we figured out other ways to achieve those drives and desires. IABIED argues that something similar will happen with AI. We’ll train the AI to have sort of random aims picked up from our wildly imperfect optimization method.
Then the AI will get super smart, realize that a better way of achieving those aims is to do something else. Specifically, for most aims, the best way to achieve them wouldn’t involve keeping pesky humans around, who can stop them. So the AI will come up with some clever scheme by which it can kill or disempower us, implement it so we can’t stop them, and then turn to their true love: making paperclips, predicting text, or some other random thing.
Some things you might wonder: why’d the AIs try to kill us? The answer is that almost no matter what goal you might have, the best way to achieve it wouldn’t involve keeping humans around because humans can interfere with their plans and use resources that the AIs would want.
Now, could the AIs really kill us? Y&S claim the answer is a clear obvious yes. Because the AIs are so smart, they’ll be able to come up with ideas that humans could never fathom and come up with a bunch of clever schemes for killing everyone.
Y&S think the thesis of their book is pretty obvious. If the AIs get built, they claim, it’s approximately a guarantee that everyone dies. They think this is about as obvious as that a human would lose in chess to Stockfish. For this reason, their strategy for dealing with superintelligent AI is basically “ban or bust.” Either we get a global ban or we all die, probably soon.
I disagree with this thesis. I agreed with Will MacAskill when he summarized his core view as:
My p(doom)—which is to say, the odds I give to misaligned AI killing or disempowering everyone—is 2%. My credence that AI will be used to cause human extinction or permanent disempowerment in other ways in the near future is higher but below 10%—maybe about 8%. Though I think most value loss doesn’t come from AIs causing extinction and that the more pressing threat is value loss from suboptimal futures.
For this reason, I thought I’d review IABIED and explain why I disagree with their near certainty in AI-driven extinction. If you want a high-level review of the book, read Will’s. My basic takes on the book are as follows:
I don’t want this to get too bogged down so I’ll often have a longer response to objections in a footnote. Prepare for very long and mostly optional footnotes!
2 My core takes about why we’re not definitely all going to die
There are a number of ways we might not all die. For us to die, none of the things that would block doom can happen. I think there are a number of things that plausibly block doom including:
(Note: each of these probabilities are conditioned on the others not working out. So, e.g., I think AI killing everyone has 70% odds given we build superintelligence, don’t get alignment, and no decisive near-misses).
Even if you think there’s a 90% chance that things go wrong in each stage, the odds of them all going wrong is only 59%. If they each have an 80% chance, then the odds of them all happening is just about one in three. Overall with my probabilities you end up with a credence in extinction from misalignment of 2%.[1]
Which, I want to make clear, is totally fucking insane. I am, by the standards of people who have looked into the topic, a rosy optimist. And yet even on my view, I think odds are one in fifty that AI will kill you and everyone you love, or leave the world no longer in humanity’s hands. I think that you are much likelier to die from a misaligned superintelligence killing everyone on the planet than in a car accident. I don’t know the exact risks, but my guess is that if you were loaded into a car driven by a ten-year-old with no driving experience, your risk of death would be about 2%. The world has basically all been loaded in a car driven by a ten year old.
So I want to say: while I disagree with Yudkowsky and Soares on their near-certainty of doom, I agree with them that the situation is very dire. I think the world should be doing a lot more to stop AI catastrophe. I’d encourage many of you to try to get jobs working in AI alignment, if you can.
Part of what I found concerning about the book was that I think you get the wrong strategic picture if you think we’re all going to die. You’re left with the picture “just try to ban it, everything else is futile,” rather than the picture I think is right which is “alignment research is hugely important, and the world should be taking more actions to reduce AI risk.”
Before looking into the specific arguments, I want to give some high-level reasons to be doubtful of extreme pessimism:
Some high-level things that make me more worried about doom:
A lot of ridiculously smart people have high p(dooms)—at least, much higher than mine. Ord is at about 10%. Eli Lifland is at 1/3. So is Scott Alexander. Carl Shulman is at 20%. Am I really confident at 10:1 odds that Shulman’s p(doom) is unreasonable? And note: high and low p(dooms) are asymmetric with respect to probabilities. If you’re currently at 1%, and then you start thinking that there’s a 90% chance that 1% is right, a 2% chance that 30% is right, and an 8% chance that 0% is right, your p(doom) will go up.
My response to this is that if we take the outside view on each step, there is considerable uncertainty about many steps in the doom argument. So we’ll still probably end up with some p(doom) near to mine. I’m also a bit wary about just deferring to people in this way when I think your track record would have been pretty bad if you’d done that on other existential risks. Lastly, when I consider the credences of the people with high p(dooms) they seem to have outlier credences across a number of areas. Overall, however, given how much uncertainty there is, I don’t find having a p(doom) nearer to 30% totally insane.
3 Alignment by default
I think there’s about a 70% chance that we get no catastrophic misalignment by default. I think that if we just do RLHF hard enough on AI, odds are not terrible that this avoids catastrophic misalignment. Y&S think there’s about a 0% chance of avoiding catastrophic misalignment by default. This is a difference of around 70%.
I realize it’s a bit blurry what exactly counts as alignment by default. Buck Shlegeris’s alignment plan looks pretty good, for instance, but it’s arguably not too distant from an “alignment by default,” scenario. I’m thinking of the following definitions: you get catastrophic misalignment by default if building a superintelligence with roughly the methods we’re currently using (RLHF) would kill or disempower everyone.
Why do I think this? Well, RLHF nudges the AI in some direction. It seems the natural result of simply training the AI on a bunch of text and then prompting it when it does stuff we like is: it becomes a creature we like. This is also what we’ve observed. The AI models that exist to date are nice and friendly.
And we can look into the AIs current chain of thought which is basically its thinking process before it writes anything and which isn’t monitored—nor is RLHF done to modify it. Its thought process looks pretty nice and aligned.
I think a good analogy for reinforcement learning with AI is a rat. Imagine that you fed a rat every time it did some behavior, and shocked it every time it did a different behavior. It learns, over time, to do the first behavior and not the second. I think this can work for AI. As we prompt it in more and more environments, my guess is that we get AI doing the stuff we like by default. This piece makes the case in more detail.
Now, one objection that you might have to alignment by default is: doesn’t the AI already try to blackmail and scheme nefariously? A paper by Anthropic found that leading AI models were willing to blackmail and even bring about a death in order to prevent themselves from getting shutdown. Doesn’t this disprove alignment by default?
No. Google DeepMind found that this kind of blackmailing was driven by the models just getting confused and not understanding what sort of behavior they were supposed to carry out. If you just ask them nicely not to try to resist shutdown, then they don’t (and a drive towards self-preservation isn’t causally responsible for its behavior). So with superintelligence, this wouldn’t be a threat.
The big objection of Y&S: maybe this holds when the AIs aren’t super smart, like the current ones. But when the AIs get superintelligent, we should expect them to be less compliant and friendly. I heard Eliezer in a podcast give the analogy that as people get smarter, they seem like they’d get more willing to—instead of passing on their genes directly—create a higher-welfare child with greater capabilities. As one gets smarter, they get less “aligned” from the standpoint of evolution. Y&S write:
I don’t buy this for a few reasons:
Overall, I still think there’s some chance of misalignment by default as models get smarter and in more alien environments. But overall I lean towards alignment by default. This is the first stop where I get off the doom train.
The other important reason I don’t expect catastrophic misalignment by default: to get it, it seems you need unbounded maximization goals. Where does this unbounded utility maximizing set of goals come from? Why is this the default scenario? As far as I can tell, the answers to this are:
4 Will we build artificial superintelligent agenty things?
Will we build artificial superintelligence? I think there’s about a 90% chance we will. But even that puts me below Y&S’s near 100% chance of doom. The reason I think it’s high is that:
Why am I not more confident in this? A few reasons:
Again, I don’t think any of this stuff is that likely. But 10% strikes me as a reasonable estimate. Y&S basically give the arguments I gave above, but none of them strike me as so strong as to give above 90% confidence that we’ll build AI agents. My sense is they also think that the coherence theorems give some reason for why the AI will, when superintelligent, become an agent with a utility function—see section 3 for why I don’t buy that.
5 70% that we can solve alignment
Even if we don’t get alignment by default, I think there’s about a 70% chance that we can solve alignment. Overall, I think alignment is plausibly difficult but not impossible. There are a number of reasons for optimism:
Y&S give some reasons why they think alignment will be basically impossible on a short time frame.
First, they suggest that difficult problems are hard to solve unless you can tinker. For example, space probes sometimes blow up because we can’t do a ton of space probe trial and error. My reply: but they also often don’t blow up! Also, I think we can do experimentation with pre-superintelligence AI, and that this will, in large part, carry over.
Second—and this is their more important response—they say that the schemes that will work out when the AI is dumb enough that you can tinker with it won’t necessarily carry over to misalignment. As an analogy, imagine that your pet dog Fluffy was going to take a pill that would make it 10,000 times smarter than the smartest person who ever lived. Your attempt to get it to do what you want by prompting it with treats before-hand wouldn’t necessarily carry over to how it behaves afterward.
I agree that there’s some concern about failure to generalize. But if we work out all sorts of sophisticated techniques to get a being to do what we want, then I’d expect these would hold decently well even with smarter beings. If you could directly reach in and modify Fluffy’s brain, read his thoughts, etc, use the intermediate intelligence Fluffy to modify that smarter one, and keep modifying him as he gets smarter, then I don’t expect inevitable catastrophic Fluffy misalignment. He may still, by the end, like belly-rubs and bones!
Now, Yudkowsky has argued that you can’t really use AI for alignment because if the AI is smart enough to come up with schemes for alignment, there’s already serious risk it’s misaligned. And if it’s not, then it isn’t much use for alignment. However:
I think the most serious objection to the AI doom case is that we might get aligned AI. I was thus disappointed that the book didn’t discuss this objection in very much detail.
6 Warning shots
Suppose that AI is on track to take over the world. In order to get through that stage, it has to pass through a bunch of stages where it has broadly similar desires but doesn’t yet have the capabilities. My guess is that in such a scenario we’d get “warning shots.” I think, in other words, that before the AI takes over the world, it would go rogue in some high-stakes way. Some examples:
I would be very surprised if the AI’s trajectory is: low-level non-threatening capabilities—>destroying the world, without any in-between. My guess is that if there were high-level warning shots, where AI tried credibly to take over the world, people would shut it down. There’s precedent for this—when there was a high-profile disaster with Chernobyl, nuclear energy was shutdown, despite very low risks. If AI took over a city, I’d bet it would be shut down too.
Now, I think there could be some low-level warning shots—a bit like the current ones with blackmailing of the kind discussed in the anthropic paper—without any major shutdown. But sufficiently dramatic ones, I’d guess, would lead to a ban.
Y&S say on their website, asked whether there will be warning shots, “Maybe. If we wish to make use of them, we must prepare now.” They note that there have already been some warning shots, like blackmailing and AI driving people to suicide. But these small errors are very different from the kinds of warning shots I expect which come way before the AI takes over the world. I expect intermediate warning shots larger than Chernobyl before world-taking over AI. It just seems super unlikely that this kind of global scheming abilities would go from 0 to 100 with no intermediate stages.
Again, I’m not totally certain of this. And some warning shots wouldn’t lead to a ban. But I give it around coinflip odds, which is, by itself, enough to defuse near certainty of doom. Y&S say “The sort of AI that can become superintelligent and kill every human is not the sort of AI that makes clumsy mistakes and leaves an opportunity for a plucky band of heroes to shut it down at the last second.” This is of course right, but that doesn’t mean that the AI that precedes it wouldn’t be! They then say:
They give the example that AI being used for bioweapons development by a terrorist might be used by the labs to justify further restrictions on private development. But they could still rush ahead with lab-development. I find this implausible:
They further note that humanity isn’t good at responding to risks, citing that COVID wasn’t used to amp up lab safety regulations. This is right, but “amping up regulations on old technology that obviously must exist,” is very different from “ban new technology that just—uncontroversially, and everyone can see—killed millions of people.”
Y&S seem to spend a lot of their response arguing “we shouldn’t feel safe just relying on warning shots, and should prepare now,” which is right. But that’s a far cry from “warning shots give us virtually no reason to think we won’t all die, so that imminent death is still near-certain.” That is the thesis of their book.
7 Could AI kill everyone?
Would AI be able to kill everyone? The argument in its favor is that the AI would be superintelligent, and so it would be able to cook up clever new technologies. The authors write:
I do think this is pretty plausible. Nonetheless, it isn’t anything like certain. It could either be:
One intuition pump: Von Neumann is perhaps the smartest person who ever lived. Yet he would not have had any ability to take over the world—least of all if he was hooked up to a computer and had no physical body. Now, ASI will be a lot smarter than Von Neumann, but there’s just no guarantee that intelligence alone is enough.
And in most of the analogous scenarios, it wasn’t just intelligence that enabled domination. Civilizations that dominated other civilizations didn’t do it through intelligence alone. They had a big army and the ability to run huge numbers of scientific experiments.
No number of parables and metaphors about how technology often offers huge advances rules out either of these possibilities. Repeating that AI can beat humans in chess doesn’t rule them out. Real life is not chess. In chess, mating with a horse is good. In my view, the authors give no very strong arguments against these scenarios. For this reason, I’m giving only 70% chance that the AI would be able to kill everyone. See here for more discussion.
Now, it’s true that the AI is likely to be hooked up to real world systems. But still, it seems like there’s some non-zero chance that the AI could be shut down with enough effort. Yudkowsky and Soares often act like the AI could be copied and uploaded on a private computer, but this wouldn’t work if AI continues needing massive data centers to run.
8 Conclusion
I think of people’s worldview on AI risk as falling into one of the following four categories:
I’m in camp 2, but I can see a reasonable case for being in camp 3. I find camps 1 and 4 pretty unreasonable—I just don’t think the evidence is anywhere good enough to justify the kind of near-certainty needed for either camp. Y&S’s book is mostly about arguing for camp 4.
Yet I found their arguments weak at critical junctures. They did not deal adequately with counterarguments. Often they’d present a parable, metaphor, or analogy, and then act like their conclusion was certain. I often felt like their arguments were fine for establishing that some scenario was possible. But if you tell a story where something happens, your takeaway should be “this thing isn’t logically impossible,” rather than “I am 99.9% sure that it will happen.”
I think there are a number of stops on the doom train where one can get off. There are not knockdown arguments against getting off at many of these stops, but there also aren’t totally knockdown arguments for getting off at any of them. This leaves open a number of possible scenarios: maybe we get alignment by default, maybe we get alignment through hard work and not by default, maybe the AI can’t figure out a way to kill everyone. But if a few critical things go wrong, everyone dies. So while Y&S are wrong in their extreme confidence, they are right that this is a serious risk, and that the world is sleepwalking into potential oblivion.
I was thinking of adding in some other number as odds that we don’t get doomed for some other reason I haven’t thought of. But I didn’t do this for two reasons:
There is some serious question about how much to trust them. Superforecasters seem to mostly apply fairly general heuristics like “most things don’t turn out that badly.” These work pretty well, but can be overridden by more specific arguments. And as mentioned before, they’ve underestimated AI progress. I am a lot more pessimistic than the superforecasters, and unlike them, I predict AI having hugely transformative impacts on the world pretty soon. But still, given the range of disagreement, it strikes me as unreasonable to be near certain that there won’t be any doom.
There’s a common response that people give to these outside view arguments where they point out that the superforecasters haven’t considered the doom arguments in extreme detail. This is true to some degree—they know about them, but they’re not familiar with every line of the dialectic. However, there’s still reason to take the outside view somewhat seriously. I can imagine climate doomers similarly noting that the superforecasters probably haven’t read their latest doom report. Which might be right. But often expertise can inform whether you need to look at the inside view.
This also doesn’t address the more central point which isn’t just about superforecasters. Lots of smart people—Ord, MacAskill, Carlsmith, Neil Nanda, etc—have way lower p(dooms) than Y&S. Even people who broadly agree with their picture of how AI will play out, like Eli Lifland and Scott Alexander, have much lower p(dooms). I would feel pretty unsure being astronomically certain that I’m right and Neil Nanda is wrong.
Now, you might object: doesn’t this make my p(doom) pretty unreasonable? If we shouldn’t be near-certain in a domain this complex, given peer disagreement, why am I more than 97% confident that things will go well? This is one of the things that pushes me towards a higher p(doom). Still, the people who I find most sensible on the topic tend to have low p(dooms). Most experts still seem to have low p(dooms) not too far from mine. And because the doom argument has a number of steps, if you have uncertainty from higher-order evidence about each of them, you’d still end up with a p(doom) that was pretty low. Also, my guess is people who followed this protocol consistently historically would have gotten lots wrong. Von Neumann—famously pretty smart—predicted nuclear war would cause human extinction. If you’d overindexed on that, you’d have been mislead.
For example, I could imagine someone saying “look, inside views are just too hard here, I’ll go 50% on each of these steps.” If so, they’d end up with a p(doom) of 1/32=3.125%.
A common response to this is that it’s the so-called anthropic shadow. You can never observe yourself going extinct. For this reason, every single person who is around late in history will always be able to say “huh, we’ve never gone extinct, so extinction is unlikely.” This is right but irrelevant. The odds that we’d reach late history at all are a lot higher given non-extinction than extinction.
As an analogy, suppose every day you think maybe your food is poisoned. You think this consistently, every day, for 27 years. One could similarly say: “well, you can’t observe yourself dying from the poisoned food, so there’s an anthropic shadow.” But this is wrong. The odds you’d be alive today are just a lot higher if threats generally aren’t dangerous than if they are. This also follows on every leading view of anthropics, though I’ll leave proving that as an exercise for the reader.
A more serious objection is that we should be wary about these kinds of inductive inferences. Do predictions about, say, whether climate change would be existential from 1975 give us much evidence about AI doom? And one can make other, opposite inductive arguments like “every time in the past a species with significant and vastly greater intelligence has existed, it’s taken over and dominated the fate of the future.”
I think these give some evidence but there’s reason for caution. The takeaway from these should be “it’s easy to come up with a plausible sounding scenario for doom, but these plans often don’t take root in reality.” That should make us more skeptical of doom, but it shouldn’t lead us to write doom off entirely. AI is different enough from other stuff that other stuff doesn’t give us no evidence concerning its safety—but neither does it give us total assurance.
The other argument that previous intelligence booms have led to displacement is a bit misleading. There’s only one example: human evolution. And there are many crucial disanalogies: chimps weren’t working on human alignment, for example. So while I think it is a nice analogy for communicating a pretty high-level conclusion, it’s not any sort of air-tight argument.
Eliezer’s response to this on podcasts has been that while there might be model errors, model errors tend to make things worse not better. It’s hard to design a rocket. But if your model that says the rocket doesn’t work is wrong, it’s unlikely to be wrong in a way that makes the rocket work exactly right. But if your model is “X won’t work out for largely a priori reasons,” rather than based on highly-specific calculations, then you should have some serious uncertainty about that. If you had an argument for why you were nearly certain that humans wouldn’t be able to invent space flight, you should have a lot more uncertainty about whether your argument is right than about whether we would be able to invent space flight given your argument being right.
Eliezer often claims that this is the multiple stage fallacy, which one commits by improperly reasoning about the multiple stages in an argument. Usually it involves underestimating the conditional probability of each fact given the others. For example, Nate Silver arguably committed it in the following event:
I certainly agree that this is an error that people can make. By decomposing things into enough stages, combined with faux modesty about each stage, they can make almost any event sound improbable. But still, this doesn’t automatically disqualify every single attempt to reason probabilistically across multiple stages. People often commit the conjunction fallacy, where they fail to multiply together the many probabilities needed for an argument to be right. Errors are possible in both directions.
I don’t think I’m committing it here. I’m explicitly conditioning on the failure of the other stages. Even if, say, there aren’t warning shots, we build artificial agents, and they’re misaligned, it doesn’t seem anything like a guarantee that we all die. Even if we get misalignment by default, alignment still seems reasonably likely. So all-in-all, I think it’s reasonable to treat the fact that the doom scenario has a number of controversial steps as a reason for skepticism. Contrast that with the Silver argument—if Trump passed through the first three stages, seems very likely that he’d pass through them all.
Now, you might object that scenarios once the AI gets superintelligent will inevitably be off-distribution. But we’ll be able to do RLHF as we place it in more and more environments. So we can still monitor its behavior and ensure it’s not behaving nefariously. If the patterns it holds generalize across the training data, it would be odd if they radically broke down in new environments. It would be weird, for instance, if the AI was aligned until it set foot on Mars, and then started behaving totally differently.
Now, you could argue that predictively generating text is the relevant analogue. Writing the sorts of sentences it writes is analogous to the drives that lead humans to perform actions that enhance their reproductive success. But the natural generalization of the heuristics that lead it to behave in morally scrupulous and aligned ways in text generalization wouldn’t randomly lead to some other goal in a different setting.
The reply is that the patterns you pick up in training might not carry over. For example, you might, in training, pick up the pattern “do the thing that gets me the most reward.” Then, in the real world, that implies rewiring yourself to rack up arbitrarily high reward. But this doesn’t strike me as that plausible. We haven’t observed such behavior being contemplated in existing AIs. If we go by the evolution analogy, evolution gave us heuristics that tended to promote fitness. It didn’t just get us maximizing for some single metric that was behind evolutionary optimization. So my guess is that at the very least we’d get partial alignment, rather than AI values being totally unmoored from what they were trained to be.
If you believe in the Yudkowsky Foom scenario, according to which there will be large discontinuous jumps in progress, AI being used for alignment is less likely. But I think Foom is pretty unlikely—AI is likely to accelerate capabilities progress, but not to the degree of Foom. I generally think LLM-specific projections are a lot more useful than trying to e.g. extrapolate from chess algorithms and human evolution.