A Review of Nina Panickssery’s Review of Scott Alexander’s Review of “If Anyone Builds It, Everyone Dies”

[-]Nina Panickssery2mo433

Firstly, thanks for writing this, sending me a draft in advance to review, and incorporating some of my feedback. I agree that my review of review was somewhat sloppy, i.e. I didn't argue for my perspective clearly. To frame things, my overall perspective here is that (1) AI misalignment risk is not "fake" or trivially low, but it is far lower than the book's authors claim (2) The MIRI-AI-doom cluster relies too much on arguments by analogy and spherical cow game theoretic agent models while neglecting to incorporate the empirical evidence from modern AI/ML development. I recently wrote a different blogpost trying to steelman their arguments from a more empirical perspective (as well as possible counterarguments that reduce but not cancel the strength of the main argument).

I plan to actually read and review the book "for real" once I get access to it (and have the time).

Some concrete comments on this review^3:

Nina is clearly trying to provide an in-depth critique of the book itself

It may have come across that way, but that was not my goal. Though implicit in the post is my broader knowledge of MIRI's arguments, so it's slightly based on that and not just Scott's review.

Nina says “The book seems to lack any explanation for why our inability to give AI specific goals will cause problems,” but it seems pretty straightforward to me

That's a misquote. In my post I say that the "basic case for AI danger", as presented by Scott (presumably based on the book), lacks any explanation for why our inability to “give AI specific goals” will result in a superintelligent AI acquiring and pursuing a goal that involves killing everyone. It's possible the book makes the case better than Scott does (this is a review of a review after all), but from my experience reading other things from MIRI, they make numerous questionable assumptions when arguing that a model that hasn't somehow perfectly internalized the ideal objective function will become an unimaginably capable entity taking all its actions in service of a single goal that requires the destruction of humanity. I don't think the analogies, stories, and math are strong enough evidence for this being anywhere near inevitable considering the number of assumptions required, and the counterevidence from modern ML.

She says humans are a successful species, but I think she’s conflating success in human terms with success in evolutionary terms

This is a fair point. I should have stuck to why evolution is a biased analogy rather than appear to defend that humans are somehow “aligned” with it. Though we're not egregiously misaligned with evolution.

nobody’s arguing against examples, they’re just saying it might be more reassuring if the architecture directly included the reward function in the model itself. For instance, if the model was at every turn searching over possible actions and choosing one that will maximize this reward function

What does this mean though, if one doesn't think there's a single Master Reward Function (you may think there is one, but I don't)? Modern ML works by saying "here is a bunch of examples of what you should do in different scenarios", or "here are bunch of environments with custom reward functions based on what we want achieved in that environment". Unless you predict a huge paradigm shift, the global reward function is not well-defined. You could say, oh it would be good if the model directly included all our custom reward functions but then that is like saying oh it would be good if models just memorized their datasets.

Nina complains that the Mink story is simplistic with how Mink “perfectly internalizes this goal of maximizing user chat engagement”. The authors say the Mink scenario is a very simple “fairytale”—they argue that in this simple world, things go poorly for humanity, and that more complexity won’t increase safety.

Here I mean to point out that the implication that AI will eventually start behaving like a perfect game-theoretic agent that ruthlessly optimizes for a strange goal is simplistic, not that the specifics of the Mink story (ie. what specifically happens, what goal emerges) is simplistic.

I think Nina has a different kind of complexity in mind, which the authors don’t touch on. It seems like she thinks real models won’t be so perfectly coherent and goal-directed. But I don’t really think Nina spells out why she believes this and there are a lot of counter-arguments. The main counter-argument in the book is that there are good incentives to train for that kind of coherence.

Yes, I don't properly present the arguments in my review^2. I do that a bit more in this post which is linked in my review^2. And I don't mean to dismiss the possibility entirely, just to argue that presenting this as an inevitability is misleading.

[-]Raemon2mo60

A thing I can't quite tell if you're incorporating into your model – the thing the book is about is:

"AI that is either more capable than the rest of humanity combined, or is capable of recursively self-improving and situationally aware enough to maneuever itself into having the resources to do so (and then being more capable than the rest of humanity combined), and which hasn't been designed in a fairly different way from the way current AIs are created."

I'm not sure if you're more like "if that happened, I don't see why it'd be particularly likely to behave like an ideal agent ruthlessly optimizing for alien goals", or if you're more like "I don't really buy that this can/will happen in the first place."

(the book is specifically about that type of AI, and has separate arguments for "someday someone will make that" and "when they do, here's how we think it'll go")

[-]Nina Panickssery2mo20

and which hasn't been designed in a fairly different way from the way current AIs are created

Is this double negative intended? Do you mean has been designed in a fairly different way?

[-]Raemon2mo20

No. The argument is "the current paradigm will produce the Bad Thing by default, if it continues on what looks like it's default trajectory." (i.e. via training, in a fashion where it's not super predictable in advance what behaviors the training will result in in various off-distribution scenarios)

[-]Nina Panickssery2mo92

OK, cool. For some reason the sentence read weirdly to me so I wanted to clarify before replying (because if it was the case that the book was premised on a sudden paradigm shift in AI development that I didn't think would occur by default, then that would indeed be an important step in the argument that I don't address at all).

To answer your question directly, I think I disagree with the authors both on how capable AIs will become in the short-to-medium term (let's say over the next century, to be concrete), and on the extent to which very capable AIs will be well-modeled as ideal agents optimizing for alien-to-us goals. As mentioned, I'm not saying it's necessarily impossible, just very far from an inevitability. My view is based on (1) predictions about how SGD behaves and how AI training approaches will evolve (within the current paradigm) (2) physical constraints on data and compute (3) pricing in prosaic safety measures that labs are already taking and will (hopefully!) continue to take.

Because I don't predict fast take-off I also think that if things are turning out to look worse than I expect, we'll see warnings.

[-]Raemon2mo40

Nod.

FYI I don't think the book is making a particular claim that any of this will happen soon, merely that when it happens, the outcome will be very likely to be human extinction. The point is not that it'll happen at a particular time/way – the LLM/ML paradigm might hit a wall, there might need to be algorithmic advances, it might instead route through narrow AI getting really good at conducting and leveraging neuroscience and making neuromorphic AI or whatever.

But, the fact that we know human brains run on a relatively low amount of power and training data means we should expect this to happen sooner or later. (but meanwhile, it does sure seem like both the current paradigm keeps advancing, and a lot of money is being poured in, so it seems at least reasonably likely the that it'll be sooner rather than later).

The book doesn't argue a particular timeline for that, but, it personally (to me) seems weird to me to expect it to take another century, in particular when you can leverage narrower pseudogeneral AI to help you make advances. And I have a hard time imagining takeoff taking longer than than a decade, or really even a couple years, once you hit full generality.

[-]Nina Panickssery2mo20

Well, I hope we can have tea with our great-grandchildren in 100 years and discuss which predictions panned out!

[-]Raemon2mo30

Curious if there are any bets you'd make where, if they happened in the next 10 years or so, you'd significantly re-evaluate your models here?

[-]Nina Panickssery2mo20

I'd have to think about this more to have concrete bet ideas. Though feel free to suggest some.

[-]Nina Panickssery2mo132

Maybe one example is that I think the likelihood of >10% yearly GDP growth in any year in the next 10 years is <10%.

[-]Raemon2mo20

Nod. Does anything in the "AI-accelerated AI R&D" space feel cruxy for you? Or "a given model seems to be semi-reliably be producing Actually Competent Work in multiple scientific fields?"

[-]GradientDissenter2mo50

Oh man the misquote is bad sorry for that. What happened is I voice typed and WhisperFlow decided to put quotation marks around it. I will edit and remove the quotation marks now.

[-]GradientDissenter2mo30

Unless you predict a huge paradigm shift

I think Nate and Eliezer want a huge paradigm shift (though they don't think it's likely; it's not like they're optimistic) and that they think things are extremely challenging with the status quo. I think they agree there might not be a great alternative, they only gesture at this issue to show why alignment is tricky. Tbc I am only trying to present my understanding of their arguments. I might not be understanding them well.

[-]Logan Riggs2mo57

For instance, if the model was at every turn searching over possible actions and choosing one that will maximize this reward function. (Of course, no one knows how to do that in practice yet, but everyone’s on the same page about that.)

I'm interpreting you as saying "If we solved outer alignment & had a perfect reward function, it would be good if the model itself was optimizing for that reward function (ie inner alignment)"

In which case, we are not on the same page (ie inner & outer alignment decompose a hard problem into two more difficult ones).

For the book, it's interesting they went w/ the evolution argument. I still prefer the shard theory analogy of humans being misaligned w/ the reward system (ie I intentionally avoid taking fentanyl, even though that would be very rewarding/reinforcing), which can still end up in similar sharp-left turns if the model eg takes fentanyl (or other goals).

Evolution is still not believed by everyone in the US (though oddly ranging from 17%-37%), which can be offputting to some, & also you have to understand evolution to an extent. I assume most folks can sort'of see that if you really optimized for evolution, you'd do a lot more than we are to pass on genes; however, optimizing for "evolution" is underconstrained & can have arguments for "well we're actually still doing quite good by evolution's sake".

Now instead let's focus on optimizing for the human reward system. People believe in very addictive drugs & can see the effects. It's pretty easy to imagine "a drug addict becomes extremely powerful, and you try to stop them. What goes wrong?". It's also quite coherent what optimizing your reward system looks like!

The evolution analogy is still good under the inner-outer alignment frame, since humans would be evolution's seat & it seems difficult to avoid the same issues. Whereas the human reward system seems easier (eg give the AI fentanyl). This can be worked around by discussing how hard it is to design the perfect reward function which doesn't end up goodharting.

[-]TAG1mo*40

If we can’t give them a specific goal, then the AIs will have a goal that is not exactly what we intended, and then their goal will be at odds with ours.

That's too compressed to make much sense. It's not the case that an AI has to have a goal, and therefore any AI that isn't given a goal will have one of its own. It is a the case that if an AI is trained into having an approximation of a human goal, it will only have an approximation of a it, but it's not at all clear that such an approximation would be "at odds".

The future, once shaped by AIs, will be very off-distribution from where AIs were trained, at the very least.

And that's a problem because...? You are probably assuming incorrigibility or value lock-in, since doomers tend to, but it ain't necessarily so.

@Nina Panickssery

"from my experience reading other things from MIRI, they make numerous questionable assumptions when arguing that a model that hasn’t somehow perfectly internalized the ideal objective function will become an unimaginably capable entity taking all its actions in service of a single goal that requires the destruction of humanity"

Yep.

[-]Eli Tyre2mo40

In particular it seems like it might be a little unfair/bad if people can give copies of books only to sympathetic people before the book comes out

Based on the several negative-ish reviews from the press, its not my impression that only sympathetic people got advance copies. I assume that some people who have existing relationships with the authors got advance copies, as did some profesional book reviewers.

[-]yams1mo120

Since there's speculation about advance copies in this thread, and I was involved in a fair deal of that, I'll say some words on the general criteria for advance copies:

Some double-digit number of people were solicited for comments during the drafting stage.
Some low three-digit number of people (my guess is '200') were solicited for blurbs, often receiving an advance copy. Most of these people simply did not respond. These were, broadly:
1. People who have blurbed similar books (e.g. Life 3.0, The Precipice, etc)
2. Random influential people we already knew (both within AI safety and without; Grimes goes in this category, for those wondering)
3. Random influential people we thought we could get a line to through some other route (friends of friends, colleagues of colleagues, people whose email the publicist had, people whose contact information is ~public), who seemed ~able to be convinced
Journalists, content creators, podcasters, and other people in a position to amplify the book by talking about it often received advance copies, since you have to get all your press lined up well ahead of release, and they usually want to read the book (or pay someone else to read it, in many cases), before agreeing to have you on. My guess is this was about 100 copies.

We didn't want to empower people who seemed to be at some risk of taking action to deflate the project ahead of release, and so had a pretty high bar for sharing there. We especially wouldn't share a copy with someone if we thought there was a significant chance the principal effect of doing so was early and authoritative deflation to a deferential audience who could not yet read the book themselves. This is because we wanted people to read the book, rather than relying on others for their views.

I agree with the person Eli's quoting that this introduces some selection bias in the early stages of the discourse. However, I will say that the vast majority of parties we shared advance copies with were, by default, neutral, toward us, either having never heard of us before, or having googled Eliezer and realized it might be worth writing/talking about. There was, to my knowledge, no deliberate campaign to seed the discourse, and many of our friends and allies who we had opportunity to score cheap social points with by sharing on advance copies did not receive them. Journalists do not agree in advance to cover you positively, and we've seen that several who were given access to early copies indeed covered the book negatively (e.g. NYT and Wired — two of the highest profile things that will happen around the book at all).

[the thing is happening where I put a number of words into this that is disproportionate to my feelings or the importance; I don't take anyone in this thread to be making accusations or being aggressive/unfair. I just see an opportunity to add value through a little bit of transparency.]

[-]Buck2mo80

I am not aware of any critical people inside this community (e.g. people who would be similarly critical to me) getting copies in advance, except for people who write book reviews.

[-]habryka1mo5-1

I also don't think sympathetic people who aren't writing book reviews got copies in-advance, so my best guess is stuff is relatively symmetric. I don't really know how copies were distributed, but my sense is not that many advance copies were distributed in-total (my sense is largely downstream of publisher preference).

[-]Matrice Jacobine1mo10

IIRC Aella and Grimes got copies in advance and AFAIK haven't written book reviews (at least not in the sense Scott or the press did).

[-]habryka1mo81

Aella is the partner of one of the authors! Of course she had advance access! I don't know about Grimes, seems plausible to me (though not super clear whether to count here as sympathetic or critical, I don't really know what she believes about this stuff, and also she has a lot of reach otherwise).

[-]Raemon2mo40

Interested in links to the press reviews you're thinking of.

[-]Matrice Jacobine1mo30

https://en.wikipedia.org/wiki/If_Anyone_Builds_It,_Everyone_Dies#Critical_reception

[-]Nina Panickssery1mo52

Not arguing with the main point, but in its current state that Wikipedia section of “Critical reception” appears to list many positive reviews and quotes.

For example:

Kevin Canfield wrote in San Francisco Chronicle that the book makes powerful arguments and recommended it.

[-]faul_sname2mo42

Next, Nina argues that the fact that LLMs don't directly encode your reward function makes them less likely to be misaligned, not more, the way IABIED implies. I think maybe she’s straw-manning the concerns here. She asks “What would it mean for models to encode their reward functions without the context of training examples?” But nobody’s arguing against examples, they’re just saying it might be more reassuring if the architecture directly included the reward function in the model itself. For instance, if the model was at every turn searching over possible actions and choosing one that will maximize this reward function. (Of course, no one knows how to do that in practice yet, but everyone’s on the same page about that.)

Can you expand a little bit on this? I don't understand why replacing "here are some examples of world states and what actions led to good/bad outcomes, try to take actions which are similar to the ones which led to good outcomes in similar world states in the past" with "here's a reward function directly mapping world-state -> goodness" would be reassuring rather than alarming.

Having more insight into exactly what past world states are most salient for choosing the next action, and why it thinks those world states are relevant, is desirable. But "we don't currently have enough insight with today's models for technical reasons" doesn't feel like a good reason to say "and therefore we should throw away this entire promising branch of the tech tree and replace it with one that has had [major problems every time we've tried it](https://en.wikipedia.org/wiki/Goodhart%27s_law)".

Am I misinterpreting what you're saying though, and there's a different thing which everyone is on the same page about?

^{^}

In particular it seems like it might be a little unfair/bad if people can give copies of books only to sympathetic people before the book comes out, drum up support and praise since the people they gave copies of can read the book, and then not face criticism until much later when skeptical people can get ahold of the book. But also maybe that’s just the norm in book circles? And I don’t have an amazing alternative.

^{^}

Is that really what evolution is optimizing for? I'm not sure, but I think a lot of the point is that it's under-constrained, and the thing that the AI is “really” optimizing for is also under-constrained. Is it reward? And if it's reward, is it the number in some register going up, or the kind of scenario where the number in the register would go up if the register wasn't hacked? I'm not sure.

^{^}

I've heard a lot of arguments about these points, but one thing I haven't heard brought up before is that evolution might have an easier time than humanity at having its values be preserved. Replicating yourself is really useful. Evolution's “goal” was picked to be something that was really useful, not picked to be something particularly complex and glorious. So it's kind of natural in some ways that humanity wouldn't go against it very much.

LESSWRONG
LW

LESSWRONG
LW

69

A Review of Nina Panickssery’s Review of Scott Alexander’s Review of “If Anyone Builds It, Everyone Dies”

69

69

On Fictional Scenarios and Concrete Examples

On AI Goals and Alignment

On the Evolution Analogy

On Reward Functions and Model Behavior

Other Tidbits