Firstly, thanks for writing this, sending me a draft in advance to review, and incorporating some of my feedback. I agree that my review of review was somewhat sloppy, i.e. I didn't argue for my perspective clearly. To frame things, my overall perspective here is that (1) AI misalignment risk is not "fake" or trivially low, but it is far lower than the book's authors claim (2) The MIRI-AI-doom cluster relies too much on arguments by analogy and spherical cow game theoretic agent models while neglecting to incorporate the empirical evidence from modern AI/ML development. I recently wrote a different blogpost trying to steelman their arguments from a more empirical perspective (as well as possible counterarguments that reduce but not cancel the strength of the main argument).
I plan to actually read and review the book "for real" once I get access to it (and have the time).
Some concrete comments on this review^3:
Nina is clearly trying to provide an in-depth critique of the book itself
It may have come across that way, but that was not my goal. Though implicit in the post is my broader knowledge of MIRI's arguments, so it's slightly based on that and not just Scott's review.
Nina says “The book seems to lack any explanation for why our inability to give AI specific goals will cause problems,” but it seems pretty straightforward to me
That's a misquote. In my post I say that the "basic case for AI danger", as presented by Scott (presumably based on the book), lacks any explanation for why our inability to “give AI specific goals” will result in a superintelligent AI acquiring and pursuing a goal that involves killing everyone. It's possible the book makes the case better than Scott does (this is a review of a review after all), but from my experience reading other things from MIRI, they make numerous questionable assumptions when arguing that a model that hasn't somehow perfectly internalized the ideal objective function will become an unimaginably capable entity taking all its actions in service of a single goal that requires the destruction of humanity. I don't think the analogies, stories, and math are strong enough evidence for this being anywhere near inevitable considering the number of assumptions required, and the counterevidence from modern ML.
She says humans are a successful species, but I think she’s conflating success in human terms with success in evolutionary terms
This is a fair point. I should have stuck to why evolution is a biased analogy rather than appear to defend that humans are somehow “aligned” with it. Though we're not egregiously misaligned with evolution.
nobody’s arguing against examples, they’re just saying it might be more reassuring if the architecture directly included the reward function in the model itself. For instance, if the model was at every turn searching over possible actions and choosing one that will maximize this reward function
What does this mean though, if one doesn't think there's a single Master Reward Function (you may think there is one, but I don't)? Modern ML works by saying "here is a bunch of examples of what you should do in different scenarios", or "here are bunch of environments with custom reward functions based on what we want achieved in that environment". Unless you predict a huge paradigm shift, the global reward function is not well-defined. You could say, oh it would be good if the model directly included all our custom reward functions but then that is like saying oh it would be good if models just memorized their datasets.
Nina complains that the Mink story is simplistic with how Mink “perfectly internalizes this goal of maximizing user chat engagement”. The authors say the Mink scenario is a very simple “fairytale”—they argue that in this simple world, things go poorly for humanity, and that more complexity won’t increase safety.
Here I mean to point out that the implication that AI will eventually start behaving like a perfect game-theoretic agent that ruthlessly optimizes for a strange goal is simplistic, not that the specifics of the Mink story (ie. what specifically happens, what goal emerges) is simplistic.
I think Nina has a different kind of complexity in mind, which the authors don’t touch on. It seems like she thinks real models won’t be so perfectly coherent and goal-directed. But I don’t really think Nina spells out why she believes this and there are a lot of counter-arguments. The main counter-argument in the book is that there are good incentives to train for that kind of coherence.
Yes, I don't properly present the arguments in my review^2. I do that a bit more in this post which is linked in my review^2. And I don't mean to dismiss the possibility entirely, just to argue that presenting this as an inevitability is misleading.
A thing I can't quite tell if you're incorporating into your model – the thing the book is about is:
"AI that is either more capable than the rest of humanity combined, or is capable of recursively self-improving and situationally aware enough to maneuever itself into having the resources to do so (and then being more capable than the rest of humanity combined), and which hasn't been designed in a fairly different way from the way current AIs are created."
I'm not sure if you're more like "if that happened, I don't see why it'd be particularly likely to behave like an ideal agent ruthlessly optimizing for alien goals", or if you're more like "I don't really buy that this can/will happen in the first place."
(the book is specifically about that type of AI, and has separate arguments for "someday someone will make that" and "when they do, here's how we think it'll go")
and which hasn't been designed in a fairly different way from the way current AIs are created
Is this double negative intended? Do you mean has been designed in a fairly different way?
No. The argument is "the current paradigm will produce the Bad Thing by default, if it continues on what looks like it's default trajectory." (i.e. via training, in a fashion where it's not super predictable in advance what behaviors the training will result in in various off-distribution scenarios)
OK, cool. For some reason the sentence read weirdly to me so I wanted to clarify before replying (because if it was the case that the book was premised on a sudden paradigm shift in AI development that I didn't think would occur by default, then that would indeed be an important step in the argument that I don't address at all).
To answer your question directly, I think I disagree with the authors both on how capable AIs will become in the short-to-medium term (let's say over the next century, to be concrete), and on the extent to which very capable AIs will be well-modeled as ideal agents optimizing for alien-to-us goals. As mentioned, I'm not saying it's necessarily impossible, just very far from an inevitability. My view is based on (1) predictions about how SGD behaves and how AI training approaches will evolve (within the current paradigm) (2) physical constraints on data and compute (3) pricing in prosaic safety measures that labs are already taking and will (hopefully!) continue to take.
Because I don't predict fast take-off I also think that if things are turning out to look worse than I expect, we'll see warnings.
Nod.
FYI I don't think the book is making a particular claim that any of this will happen soon, merely that when it happens, the outcome will be very likely to be human extinction. The point is not that it'll happen at a particular time/way – the LLM/ML paradigm might hit a wall, there might need to be algorithmic advances, it might instead route through narrow AI getting really good at conducting and leveraging neuroscience and making neuromorphic AI or whatever.
But, the fact that we know human brains run on a relatively low amount of power and training data means we should expect this to happen sooner or later. (but meanwhile, it does sure seem like both the current paradigm keeps advancing, and a lot of money is being poured in, so it seems at least reasonably likely the that it'll be sooner rather than later).
The book doesn't argue a particular timeline for that, but, it personally (to me) seems weird to me to expect it to take another century, in particular when you can leverage narrower pseudogeneral AI to help you make advances. And I have a hard time imagining takeoff taking longer than than a decade, or really even a couple years, once you hit full generality.
Well, I hope we can have tea with our great-grandchildren in 100 years and discuss which predictions panned out!
Curious if there are any bets you'd make where, if they happened in the next 10 years or so, you'd significantly re-evaluate your models here?
I'd have to think about this more to have concrete bet ideas. Though feel free to suggest some.
Maybe one example is that I think the likelihood of >10% yearly GDP growth in any year in the next 10 years is <10%.
Nod. Does anything in the "AI-accelerated AI R&D" space feel cruxy for you? Or "a given model seems to be semi-reliably be producing Actually Competent Work in multiple scientific fields?"
Oh man the misquote is bad sorry for that. What happened is I voice typed and WhisperFlow decided to put quotation marks around it. I will edit and remove the quotation marks now.
Unless you predict a huge paradigm shift
I think Nate and Eliezer want a huge paradigm shift (though they don't think it's likely; it's not like they're optimistic) and that they think things are extremely challenging with the status quo. I think they agree there might not be a great alternative, they only gesture at this issue to show why alignment is tricky. Tbc I am only trying to present my understanding of their arguments. I might not be understanding them well.
For instance, if the model was at every turn searching over possible actions and choosing one that will maximize this reward function. (Of course, no one knows how to do that in practice yet, but everyone’s on the same page about that.)
I'm interpreting you as saying "If we solved outer alignment & had a perfect reward function, it would be good if the model itself was optimizing for that reward function (ie inner alignment)"
In which case, we are not on the same page (ie inner & outer alignment decompose a hard problem into two more difficult ones).
For the book, it's interesting they went w/ the evolution argument. I still prefer the shard theory analogy of humans being misaligned w/ the reward system (ie I intentionally avoid taking fentanyl, even though that would be very rewarding/reinforcing), which can still end up in similar sharp-left turns if the model eg takes fentanyl (or other goals).
Evolution is still not believed by everyone in the US (though oddly ranging from 17%-37%), which can be offputting to some, & also you have to understand evolution to an extent. I assume most folks can sort'of see that if you really optimized for evolution, you'd do a lot more than we are to pass on genes; however, optimizing for "evolution" is underconstrained & can have arguments for "well we're actually still doing quite good by evolution's sake".
Now instead let's focus on optimizing for the human reward system. People believe in very addictive drugs & can see the effects. It's pretty easy to imagine "a drug addict becomes extremely powerful, and you try to stop them. What goes wrong?". It's also quite coherent what optimizing your reward system looks like!
The evolution analogy is still good under the inner-outer alignment frame, since humans would be evolution's seat & it seems difficult to avoid the same issues. Whereas the human reward system seems easier (eg give the AI fentanyl). This can be worked around by discussing how hard it is to design the perfect reward function which doesn't end up goodharting. 
If we can’t give them a specific goal, then the AIs will have a goal that is not exactly what we intended, and then their goal will be at odds with ours.
That's too compressed to make much sense. It's not the case that an AI has to have a goal, and therefore any AI that isn't given a goal will have one of its own. It is a the case that if an AI is trained into having an approximation of a human goal, it will only have an approximation of a it, but it's not at all clear that such an approximation would be "at odds".
The future, once shaped by AIs, will be very off-distribution from where AIs were trained, at the very least.
And that's a problem because...? You are probably assuming incorrigibility or value lock-in, since doomers tend to, but it ain't necessarily so.
"from my experience reading other things from MIRI, they make numerous questionable assumptions when arguing that a model that hasn’t somehow perfectly internalized the ideal objective function will become an unimaginably capable entity taking all its actions in service of a single goal that requires the destruction of humanity"
Yep.
In particular it seems like it might be a little unfair/bad if people can give copies of books only to sympathetic people before the book comes out
Based on the several negative-ish reviews from the press, its not my impression that only sympathetic people got advance copies. I assume that some people who have existing relationships with the authors got advance copies, as did some profesional book reviewers.
Since there's speculation about advance copies in this thread, and I was involved in a fair deal of that, I'll say some words on the general criteria for advance copies:
We didn't want to empower people who seemed to be at some risk of taking action to deflate the project ahead of release, and so had a pretty high bar for sharing there. We especially wouldn't share a copy with someone if we thought there was a significant chance the principal effect of doing so was early and authoritative deflation to a deferential audience who could not yet read the book themselves. This is because we wanted people to read the book, rather than relying on others for their views.
I agree with the person Eli's quoting that this introduces some selection bias in the early stages of the discourse. However, I will say that the vast majority of parties we shared advance copies with were, by default, neutral, toward us, either having never heard of us before, or having googled Eliezer and realized it might be worth writing/talking about. There was, to my knowledge, no deliberate campaign to seed the discourse, and many of our friends and allies who we had opportunity to score cheap social points with by sharing on advance copies did not receive them. Journalists do not agree in advance to cover you positively, and we've seen that several who were given access to early copies indeed covered the book negatively (e.g. NYT and Wired — two of the highest profile things that will happen around the book at all).
[the thing is happening where I put a number of words into this that is disproportionate to my feelings or the importance; I don't take anyone in this thread to be making accusations or being aggressive/unfair. I just see an opportunity to add value through a little bit of transparency.]
I am not aware of any critical people inside this community (e.g. people who would be similarly critical to me) getting copies in advance, except for people who write book reviews.
I also don't think sympathetic people who aren't writing book reviews got copies in-advance, so my best guess is stuff is relatively symmetric. I don't really know how copies were distributed, but my sense is not that many advance copies were distributed in-total (my sense is largely downstream of publisher preference).
IIRC Aella and Grimes got copies in advance and AFAIK haven't written book reviews (at least not in the sense Scott or the press did).
Aella is the partner of one of the authors! Of course she had advance access! I don't know about Grimes, seems plausible to me (though not super clear whether to count here as sympathetic or critical, I don't really know what she believes about this stuff, and also she has a lot of reach otherwise).
Not arguing with the main point, but in its current state that Wikipedia section of “Critical reception” appears to list many positive reviews and quotes.
For example:
Kevin Canfield wrote in San Francisco Chronicle that the book makes powerful arguments and recommended it.
Next, Nina argues that the fact that LLMs don't directly encode your reward function makes them less likely to be misaligned, not more, the way IABIED implies. I think maybe she’s straw-manning the concerns here. She asks “What would it mean for models to encode their reward functions without the context of training examples?” But nobody’s arguing against examples, they’re just saying it might be more reassuring if the architecture directly included the reward function in the model itself. For instance, if the model was at every turn searching over possible actions and choosing one that will maximize this reward function. (Of course, no one knows how to do that in practice yet, but everyone’s on the same page about that.)
Can you expand a little bit on this? I don't understand why replacing "here are some examples of world states and what actions led to good/bad outcomes, try to take actions which are similar to the ones which led to good outcomes in similar world states in the past" with "here's a reward function directly mapping world-state -> goodness" would be reassuring rather than alarming.
Having more insight into exactly what past world states are most salient for choosing the next action, and why it thinks those world states are relevant, is desirable. But "we don't currently have enough insight with today's models for technical reasons" doesn't feel like a good reason to say "and therefore we should throw away this entire promising branch of the tech tree and replace it with one that has had [major problems every time we've tried it](https://en.wikipedia.org/wiki/Goodhart%27s_law)".
Am I misinterpreting what you're saying though, and there's a different thing which everyone is on the same page about?
A review of Nina Panickssery’s review of Scott Alexander’s review of the book "If Anyone Builds It, Everyone Dies" (IABIED).
This essay is not my best work but I just couldn't resist. Thanks to Nina and others for comments/feedback.
I confess I mostly wrote this because I think a review of a review of a review is funny. But I also have a lot of disagreements with both Nina and the authors of IABIED. Nina’s review is the first time a skeptic has substantively argued against the book (because the book isn’t out and the authors haven’t given an advance copy to very many skeptics). I want the discourse around critiques of the book to be good. I want people to understand the real limitations of the authors’ arguments, not straw-mans of them.
Although she frames her writing as a review of Scott’s review, Nina is clearly trying to provide an in-depth critique of the book itself. Unfortunately, it’s hard for her to avoid straw-manning, since she hasn’t read the book and can only go off of Scott’s review. I happen to have read IABIED, which gives me a leg up on writing about it.
It’s of course not Nina’s fault she hasn’t been given an advance copy of the book. I feel torn about whether or not it’s bad form to review a book she hasn’t read.[1]
(I may publish my own review which goes into my views when IABIED comes out.)
Nina complains about part of the book giving a sci-fi story about how everything could go wrong because she doesn't like the idea of generalizing from fictional evidence. But I think it's clearly very valuable to lay out a concrete scenario of AI disaster. The authors bemoan for several pages why writing a scenario like this is somewhat fraught. I think if they had refused to give a concrete scenario, it would have seemed like a cop-out or an admission that their arguments rely on hand-waving and magic. I really like concrete scenarios and wish there were more of them because they head-off this obvious and common critique. They’re labor-intensive to write and hard to pull off, so I appreciate whenever people try.
I also think talking through potential future scenarios is a good way to concretize and understand things. If someone says AI takeover could never happen, laying out one story for how AI takeover could happen among a sea of possible futures that could unfold can build intuitions and help people understand where their understandings diverge. I think the scenario makes it much easier for me to talk about where I disagree with Nate and Eliezer.
Of course, concrete scenarios can also be misleading. You can hide a lot of sleight-of-hand by saying, “Well, naturally, this exact scenario is unlikely, but I had to pick a specific scenario, and any specific scenario is unlikely.” But I don't really think the scenario in the book is guilty of doing this intentionally and on net I’m very glad it’s in the book.
I also think Nate and Eliezer do a good job communicating that their scenario is just one possibility of many, and things are almost certainly not going to go as they describe. They also go out of their way to make it a story that is not crafted to be entertaining. The main issue Eliezer has with generalizing from fictional evidence is that it is designed to be a fun read, which comes at the cost of realism. But Nate and Eliezer try dutifully to avoid this.
Nina says the book lacks an explanation for why our inability to give AI specific goals will cause problems, but it seems pretty straightforward to me. If we can't give them a specific goal, then the AIs will have a goal that is not exactly what we intended, and then their goal will be at odds with ours.
The authors do a fine job providing the standard arguments for why an AI that is not aligned with humanity's goals might kill everyone (although I think they don't spend enough time on why it would definitely kill literally everyone as opposed to only most people). Their arguments include:
I agree with Nina's qualms that the authors don’t do a great job representing the counterarguments. There are stronger counter-arguments out there, although they are not the counter-arguments that most people reading the book will have in their minds.
This is probably my biggest disagreement with Nina. She says humans are a successful species, but I think she’s conflating success in human terms with success in evolutionary terms (especially compared to other species—the bacteria are crushing us). I think that if humanity really set its eyes on evolutionary success, we could be way, way more evolutionarily successful than we currently are, except for the fact that evolution isn't exactly optimizing for a coherent thing. If what we wanted was lots of copies of our literal DNA everywhere, we could probably start sticking it into bacteria or something and having it replicate.[2]
Human values don’t seem totally orthogonal to what evolution was optimizing for, and the outcomes so far are still fairly good from an evolutionary perspective. I wish Nate and Eliezer touched on that more. But we are still clearly somewhat misaligned.[3]
Nina also says “Evolution is much less efficient and slower than any method used to train AI models and hasn’t finished. If you take a neural network in the middle of training it sure won’t be good at maximizing its objective function.” This doesn’t seem persuasive to me. Evolution will not succeed anywhere near as well as it would have if humanity were truly trying to maximize evolution's objectives. Even if humanity were to go extinct, evolution would probably still not reign because it would evolve another species like humans that would be misaligned.
To extend Nina’s analogy, if training your AI causes a superintelligent model to arise that is misaligned with your values, then maybe your training is never going to finish.
Nina also talks about how the environment is constantly changing and humans are off-distribution from the ancestral environment. But I think this is part of the point the authors are trying to make. The future, once shaped by AIs, will be very off-distribution from where AIs were trained, at the very least. The authors argue a stronger claim—that there's a distributional shift once the model is capable of taking over the world, though I don't think they defend this claim very well personally.
Next, Nina argues that the fact that LLMs don't directly encode your reward function makes them less likely to be misaligned, not more, the way IABIED implies. I think maybe she’s straw-manning the concerns here. She asks “What would it mean for models to encode their reward functions without the context of training examples?” But nobody’s arguing against examples, they’re just saying it might be more reassuring if the architecture directly included the reward function in the model itself. For instance, if the model was at every turn searching over possible actions and choosing one that will maximize this reward function. (Of course, no one knows how to do that in practice yet, but everyone’s on the same page about that.)
Nina claims it's good that AIs learn by generalizing from examples to really understand what we want, but I think this is missing some of the point. The main concern outlined in IABIED is not an outer alignment failure—it’s not that we won't be able to articulate what we want. The concern is that even if we articulate what we want totally perfectly and all of our examples demonstrate what we want really well, there is no guarantee that the AI we get will optimize for what we specified.
Next, Nina complains that the Mink story is simplistic with how Mink “perfectly internalizes this goal of maximizing user chat engagement”. The authors say the Mink scenario is a very simple “fairytale”—they argue that in this simple world, things go poorly for humanity, and that more complexity won’t increase safety.
I think Nina has a different kind of complexity in mind, which the authors don’t touch on. It seems like she thinks real models won’t be so perfectly coherent and goal-directed. But I don’t really think Nina spells out why she believes this and there are a lot of counter-arguments. The main counter-argument in the book is that there are good incentives to train for that kind of coherence.
I agree with Nina and Scott about sharp left turns seeming less likely than Nate and Eliezer think.
I think that covers Nina's important substantive claims where I disagree.
In particular it seems like it might be a little unfair/bad if people can give copies of books only to sympathetic people before the book comes out, drum up support and praise since the people they gave copies of can read the book, and then not face criticism until much later when skeptical people can get ahold of the book. But also maybe that’s just the norm in book circles? And I don’t have an amazing alternative.
Is that really what evolution is optimizing for? I'm not sure, but I think a lot of the point is that it's under-constrained, and the thing that the AI is “really” optimizing for is also under-constrained. Is it reward? And if it's reward, is it the number in some register going up, or the kind of scenario where the number in the register would go up if the register wasn't hacked? I'm not sure.
I've heard a lot of arguments about these points, but one thing I haven't heard brought up before is that evolution might have an easier time than humanity at having its values be preserved. Replicating yourself is really useful. Evolution's “goal” was picked to be something that was really useful, not picked to be something particularly complex and glorious. So it's kind of natural in some ways that humanity wouldn't go against it very much.