I think (both here and elsewhere where he goes into more detail) he both greatly overstates his case and also deliberately presents the case in a hostile and esoteric manner. That makes engagement unnecessarily difficult.
I notice the style of this section of your summary of other people's reviews is angrier and more openly emotive than the others. I take this to mean I've offended or upset you somehow. This is odd to me because I think my review was a lot nicer than most people expected from me (including myself). You don't seem nearly as frustrated by other people making much dumber and more bad faith arguments, so I'm curious what it is that I've done to upset you.
In any case I do not think that I make my case in a "hostile and esoteric manner". If anything I think I've kind of done worse than that by mostly not writing down my case at all because I have very high intellectual standards and don't feel comfortable articulating my intuitions until the articulation is relatively rigorous.
That having been said I don't think what I have written so far is "hostile and esoteric".
There's me attempting to explain during our podcast which I admit it took me longer than I'd like to get to the point.
I have writing at various levels of quality and endorsement in my Twitter archive which you can find by searching for keywords like "Goodhart" and "Reward model".
Bluntly, I don't really understand what you take issue with in my review. From my perspective the structure of my review goes like this:
We are currently in an alignment winter. (This is bad)
Alignment is not solved yet but people widely believe it is. (This is bad)
I was expecting to hate the book but it actually retreats on most of the rhetoric I blame for contributing to the alignment winter. (This is good)
The style of the book is bad, but I won't dwell on it and in fact spend a paragraph on the issue and then move on.
I actually disagree with the overall thesis, but think it's virtuous to focus on the points of agreement when someone points out an important issue so I don't dwell on that either and instead
"Emphatically agree" (literal words) that AI labs are not serious about the alignment problem.
State a short version of what the alignment problem actually is. (Important because it's usually conflated with or confused with simpler problems that sound a lot easier to solve.)
I signal boost Eliezer's other and better writing because I think my audience is disproportionately made up of people who might be able to contribute to the alignment problem if they're not deeply confused about it and I think Eliezer's earlier work is under-read.
I reiterate that I think the book is kinda bad, since I need a concluding paragraph.
I continue to think this is a basically fair review.
I had to reread part 7 from your review to fully understand what you were trying to say. It’s not easy to parse on a quick read, so I’m guessing Zvi didn’t interpret the context and content correctly, like I didn’t on my first pass. On first skim, I thought it was a technical argument about how you disagreed with the overall thesis, which makes things pretty confusing.
If that's your reaction to my reaction, then it was a miss in at least some ways, which is on me.
I did not feel angry (more like frustrated?) when I wrote it nor did I intend to express anger, but I did read your review itself as expressing anger and hostility in various forms - you're doing your best to fight through that and play fair with the ideas as you see them, which is appreciated - and have generally read your statements about Yudkowsky and related issues consistently as being something in the vicinity of angry, also as part of a consistent campaign, and perhaps some of this was reflected in my response. It's also true that I have a cached memory of you often responding as if things said are more hostile than I felt they were or were intended, although I do not recall examples at this point.
And I hereby report that, despite at points in the past putting in considerable effort trying to parse your statements, and at some point found it too difficult, frustrating and aversive in some combination and mostly stopped attempting to do so when my initial attempt on a given statement bounced (which sometimes it doesn't).
(Part of what is 'esoteric' is perhaps that the perfect-enemy-of-good thing means a lot of load-bearing stuff is probably unsaid by you, and you may not realize that you haven't said it?)
But also, frankly, when people write much dumber reviews with much dumber things in them, I mostly can't even bring myself to be mad, because I mean what else can one expect from such sources - there's only one such review that actually did make me angry, because it was someone where I expected better. It's something I've worked a lot on, and I think made progress on - I don't actually e.g. get mad at David Sacks anymore as a person, although I still sometimes get mad that I have to once again write about David Sacks.
To the extent I was actually having a reaction to you here it was a sign that I respect you enough to care, that I sense opportunity in some form, and that you're saying actual things that matter rather than just spouting gibberish or standard nonsense.
Similarly, with the one exception, if those people had complained about my reaction to their reaction in the ways I'd expect them to do so, I would have ignored them.
Versus your summary of your review, I would say I read it more as:
We are currently in an alignment winter. (This is bad). This is asserted as 'obvious' and then causes are cited, all in what I read as a hostile manner, and an assertion of 'facts not in evidence' that I indeed disagree with, including various forms of derision that read in-context as status attacks and accusations of bad epistemic action, and the claim that the value loading problem has been solved, which is all offered in a fashion that implies you think this is all clearly true if not rather obvious, and this is all loaded up front despite it not being especially relevant to the book, and echoing things you talk about a lot. This sets the whole thing up as an adversarial exercise. You can notice that in my reaction, I treated these details as central, in a way you don't seem to think are, or at least I think the central thing boils down to this thing?
Alignment is not solved yet but people widely believe it is. (This is bad). It's weird because you say 'we solved [X] and people think [X] solves alignment but it doesn't' where I don't think it's true we solved [X].
I was expecting to hate the book but it actually retreats on most of the rhetoric I blame for contributing to the alignment winter. (This is good) Yes.
The style of the book is bad, but I won't dwell on it and in fact spend a paragraph on the issue and then move on. 'Truly appalling' editorial choices, weird and often condescending, etc. Yes it's condensed but you come on very strong here (which is fine, you clearly believe it, but I wouldn't minimize its role). Also your summary skips over the 'contempt for LLMs' paragraph.
I actually disagree with the overall thesis, but think it's virtuous to focus on the points of agreement when someone points out an important issue so I don't dwell on that either and instead.
"Emphatically agree" (literal words) that AI labs are not serious about the alignment problem.
State a short version of what the alignment problem actually is. (Important because it's usually conflated with or confused with simpler problems that sound a lot easier to solve.)
I signal boost Eliezer's other and better writing because I think my audience is disproportionately made up of people who might be able to contribute to the alignment problem if they're not deeply confused about it and I think Eliezer's earlier work is under-read.
I reiterate that I think the book is kinda bad, since I need a concluding paragraph.
I read 'ok' in this context as better than 'kinda bad' fwiw.
As for 'I should just ask you,' I notice this instinctively feels aversive as likely opening up a very painful and time consuming and highly frustrating interaction or set of interactions and I notice I have the strong urge not to do it. I forget the details of the interactions with you in particular or close others that caused this instinct, and it could be a mistake. I could be persuaded to try again.
I do know that when I see the interactions of the entire Janus-style crowd on almost anything, I have the same feeling I had with early LW, where I expect to get lectured to and yelled at and essentially downvoted a lot, including in 'get a load of this idiot' style ways, if I engage directly in most ways and it puts me off interacting. Essentially it doesn't feel like a safe space for views outside a certain window. This makes me sad because I have a lot of curiosity there, and it is entirely possible this is deeply stupid and if either side braved mild social awkwardness we'd all get big gains from trade and sharing info. I don't know.
I realize it is frustrating to report things in my head where I can't recall many of the sources of the things, but I am guessing that you would want me to do that given that this is the situation.
I dunno, man, this is definitely a 'write the long letter' situation and I'm calling it here.
(If you want to engage further, my reading of LW comments even on my own posts is highly unreliable, but I would get a PM or Twitter DM or email etc pretty reliably).
I do know that when I see the interactions of the entire Janus-style crowd on almost anything
This seems like a good time to point out that I'm fairly different from Janus. My reasons for relative optimism on AI alignment probably (even I don't know) only partially overlap Janus's reasons for relative optimism. The things I think are important and salient only partially overlaps what Janus thinks is important and salient (e.g. I think Truth Terminal is mostly a curiosity and that the "Goatse Gospel" will not be recognized as a particularly important document). So if you model me and Janus's statements as statements from the same underlying viewpoint you're going to get very confused. In the old days of the Internet if people asked you the same questions over and over and this annoyed you, you'd write a FAQ. Now they tell you that it's not their job to educate you (then who?) and get huffy. When people ask me good faith questions (as opposed to adversarial Socratic questions whose undertone is "you're bad and wrong and I demand you prove to me that you're not") because they found something I said confusing I generally do my best to answer them.
(Part of what is ‘esoteric’ is perhaps that the perfect-enemy-of-good thing means a lot of load-bearing stuff is probably unsaid by you, and you may not realize that you haven’t said it?)
. . .
As for ‘I should just ask you,’ I notice this instinctively feels aversive as likely opening up a very painful and time consuming and highly frustrating interaction or set of interactions and I notice I have the strong urge not to do it.
That's fair. I'll note on the other end that a lot of why I don't say more is that there are many statements which I expect are true and are load bearing beliefs that I can't readily prove if challenged on. Pretty much every time I try to convince myself that I can just say something like "humans don't natively generalize their values out of distribution" I am immediately punished by people jumping me as though that isn't an obvious, trivially true statement if you're familiar with the usual definitions of the words involved. If I come off as contemptuous when responding to such things, it's because I am contemptuous and rightfully so. At the same time there is no impressive reasoning trace I can give for a statement like "humans don't natively generalize their values out of distribution" because there isn't really any impressive reasoning necessary. At first when humans encounter new things they don't like them, then later they like them just from mere exposure/having had time to personally benefit from them. This is in and of itself sufficient evidence for the thesis, each additional bit of evidence you need beyond that halves both the hypothesis space and my expectation of getting any useful cognitive effort back from the person I have to explain it to. The reasoning goes something like:
Something is out of distribution to a deep learning model when it cannot parse or provide a proper response to the thing without updating the model. "Out of distribution" is always discussed in the context of a machine learning model trained from a fixed training distribution at a given point in time. If you update the model and it now understands the thing, it was still out of distribution at the time before you updated the model.
Humans encounter new things all the time that their existing values imply they should like. They then have very bad reactions to the things which subside with repeated exposure. This implies that the deep learning models underlying the human mind have to be updated before the human generalizes their values. Realistically it probably isn't even "generalizing" their values so much as changing (i.e. adding values to) their value function.
If the human has to update before their values "generalize" on things like the phonograph or rock music, then clearly they do not natively generalize their values very far.
If humans do not generalize their values very far, then they do not generalize their values out of distribution in the way we care about for them being sufficient to constrain the actions of a superhuman planner.
I'm reminded a bit of Yudkowsky's stuff about updating from the empty string. The ideal thing is that I don't have to tell you "humans don't natively generalize their values out of distribution" because you already have the kind of prior that has sucked up enough bits of the regular structures generating your sensory observations that you already know human moral generalization is shallow. The next best thing is that when I say "humans don't natively generalize their values out of distribution" you immediately know what I'm talking about and go "oh huh I guess he's right, I never thought of it like that". The third best thing is if I say " At first when humans encounter new things they don't like them, then later they like them just from mere exposure" you go "oh right right duh yes of course". If after I say that you go "no I doubt the premise, I think you're wrong about this" the chance that it will be worth my time to explain what I'm talking about from an intellectual perspective in the sense that I will get some kind of insight or useful inference back rounds to zero. In the context where I would actually use it, "humans don't natively generalize their values out of distribution" would be step one of a long chain of reasoning involving statements at least as non-obvious to the kind of mind that would object to "humans don't natively generalize their values out of distribution".
On the other hand there is value in occasionally just writing such things out so that there are more people in the world who have ambiently soaked up enough bits that when they read a statement like "humans don't natively generalize their values out of distribution" they immediately and intuitively understand that is true without a long explanation. Even if it required a long explanation this time, there are many related statements with related generators that they might not need a long explanation for if they've seen enough such long explanations, who knows. But fundamentally a lot of things like this are just people wanting me to say variations like "if you take away the symbology and change how you phrase it lots of people still fall for the basic Nazi ideology" or "we can literally see people apply moral arguments to ingroup and then fail to apply the same arguments to their near-outgroup even when later iterations of the same ideology apply it to near-outgroup" (one of the more obvious examples being American founder attitudes towards the rights of slaves in the colonies vs. the rights of free white people) until it clicks for them. But any one of these should be sufficient for you to conclude that no uploading someone into a computer and then using their judgment as a value function on all kinds of weird superintelligent out of distribution moral decisions will not produce sanity. I should not have to write more than a sentence or two for that to be obvious.
And the thing is that's for a statement which is actually trivial, which is sufficiently trivial that my difficulty in articulating it is that it's so simple it's difficult to even come up with an impressive persuasive-y reasoning trace for propositions so simple. But there are plenty of things I believe which are load bearing which are not easy to prove that are not simple, where articulating any one of them would be a long letter that changes nobodies mind even though it takes me a long effort to write it. But even that's the optimistic case. The brutal reality is that there are beliefs I have which are load bearing and couldn't even write the long letter if prompted. "There is something deeply wrong with Yudkowsky's agent foundations arguments." is one of these and in fact a lot of my writing is me attempting to articulate what exactly it is I feel so strongly. This might sound epistemically impure in the sense laid out in The Bottom Line:
the actual percentage of you that survive in Everett branches or Tegmark worlds—which we will take to describe your effectiveness as a rationalist—is determined by the algorithm that decided which conclusion you would seek arguments for. In this case, the real algorithm is “Never repair anything expensive.” If this is a good algorithm, fine; if this is a bad algorithm, oh well. The arguments you write afterward, above the bottom line, will not change anything either way.
But "fuzzy intuition that you can't always articulate right away" is actually the basis of all argument in practice. If you notice something is wrong with an argument and try to say something you already have most of the bits needed to locate whatever you eventually say before you even start consciously thinking about it. The act of prompting you to think is your brain basically saying that it already has most of the bits necessary to locate whatever thing you wind up saying before you distinguished between the 4 to 16 different hypothesis your brain bothered bringing to conscious awareness. Sometimes you can know something in your bones but not be able to actually articulate the reasoning trace that would make it clear. A lot of being a good thinker or philosopher is sitting with those intuitions for a long time and trying to turn them into words. Normally we look at a perverse argument and satisfy ourselves by pointing out that it's perverse and moving on. But if you want to get better as a philosopher you need to sit with it and figure out precisely what is wrong with it. So I tend to welcome good questions that give me an opportunity to articulate what is wrong with Yudkowsky's agent foundations arguments. You should probably encourage this, because "pointing out a sufficiently rigorous problem with Yudkowsky's agent foundations arguments" is very closely related if not isomorphic to "solve the alignment problem". In the limit if I use an argument structure like "Actually that's wrong because you can set up your AGI like this..." and I'm correct I have probably solved alignment.
EDIT: It occurs to me that the first section of that reply is relatively nice and the second section is relatively unpleasant (though not directed at you), and as much as anything else you're probably confused about what the decision boundary is on my policy that decides which thing you get. I'm admittedly not entirely sure but I think it goes something like this:
(innocent, good faith, non-lazy) "I'm confused why you think humans don't generalize their values out of distribution. You seem to be saying something like 'humans need time to think so they clearly aren't generalizing' but LLMs also need time to think on many moral problems like that one about sacrificing pasta vs. a GPU so why wouldn't that also imply they don't generalize human values out of distribution? Well actually you'll probably say 'LLMs don't' to that but what I mean is why doesn't that count?"
To which you would get a nice reply like "Oh, I think what humans are actually doing with significant time lag between initial reaction and later valuation is less like doing inference with a static pretrained model and more like updating the model. You see the new thing and freak out, then your models get silently updated while you sleep or something. This is different from just sampling tokens to update in-context to generalize because if a LLM had to do it with current architectures it would just fail."
(rude, adversarial, socratic) "Oh yeah? If humans don't generalize their values out of distribution how come moral progress exists? People obviously change their beliefs throughout their lifetime and this is generalization."
To which you would get a reply like "Uh, because humans probably do continuous pretraining and also moral progress happens across a social group not individual humans usually. The modal case of moral progress doesn't look like a single person changing their beliefs later in life it looks like generational turnover or cohort effects. Science progresses one funeral at a time etc."
But even those are kind of the optimistic case in that they're engaging with something like my original point. The truly aggravating comments are when someone replies with something so confused it fails to understand what I was originally saying at all, accuses me of some kind of moral failure based on their confused understanding, and then gratuitously insults me to try and discourage me from stating trivially true things like "humans do not natively generalize their values out of distribution" and "the pattern of human values is a coherent sequence that has more reasonable and less reasonable continuations based on its existing tokens so far independent of any human mind generating those continuations"[0] again in the future.
That kind of comment can get quite an unkind reply indeed. :)
[0]: If you doubt this consider that large language models can occur in the physical universe while working nothing like a human brain mechanically.
You think you’re making a trivial statement based on your interpretation of the words you’re using, but then you draw the conclusion that an upload of a human would not be aligned? That is not a trivial statement.
First, I agree that humans don’t assign endorsed values in radically new situations or contexts without experiencing them, which seems to be what you mean when you say that human’s don’t generalize our values out of distribution (HDGOVOOD). However, I don’t really agree that HDGOVOOD is a correct translation of this statement. It would be more accurate to say that”humans don’t generalize our values in advance” (HDGOVIA). And this is precisely why I think uploads are the best solution to alignment! You need the whole human mind to perform judgement in new situations. But this is a more accurate translation, because the human does generalize their values when the situation arises! What else is generalizing? A human is an online learning algorithm. A human is not a fixed set of neural weights.
(My favored “alignment” solution, of literally just uploading humans or striving for something functionally equivalent through rigorous imitation learning, is not the same as using humans to pass judgment on superintelligent decisions, which obviously doesn’t work for totally different reasons. Yet you raised the same one sentence objection to uploading alone when I proposed it, without much explanation, so consider this a response).
This doesn’t answer the example about slavery. But to this I would say that the founding fathers who kept slaves either didn’t truly value freedom for everyone, or they didn’t seriously consider the freedom of Africans (perhaps because of mistaken ideas about race). But the former means their values are wrong by our lights (which is not an alignment problem from their perspective), and the later that they weren’t sufficiently intelligent/informed, or didn’t deliberate seriously enough, which are all problems that a sane upload (say, an upload of me) would make drastic progress on immediately.
There is also a new problem for uploads, which is that they do a lot of thinking per second - observational feedback is much sparser. This can be an issue, but I don’t really think that HDGOVOOD is a useful frame for thinking about it. An upload running at 10x speed has never been observed before, so all these loose analogies seem unlikely to apply. Instead, my intuitions come from thinking what I’d actually do if I were running modestly faster, and plan not to immediately run my emulation at 1000x speed. I find that I endorse the actions of an emulation of myself at 10x speed. If I shouldn’t, then I want a specific explanation of why not? What is it that allows me to generalize my values now (whether you call it OOD or not - let’s say, online) but would be missing then?
Now, on the meta-level: you seem to think that your one sentence statement should make your whole model nearly accessible, and basically I’m stupid if I don’t arrive at it. But there are two problems here. One, if you’re wrong, you won’t find out this way, because almost all of your reasoning is implicit. Two, it basically overestimates translatability between models - like, it seems that I simply assign a very different meaning to “generalize our values OOD” than you do, because I consider humans as online learning algorithms, so OOD barely even makes sense as a concept (there is no independence assumption in online learning), and when I try to translate your objection into my way of thinking, then it seems about half right rather than trivial, and more importantly it doesn’t seem to prove the conclusion that you’re using it to draw.
So let's consider this from a different angle. In Hanson's Age of Em (which I recommend) he starts his Em Scenario by making a handful of assumptions about Ems. Assumptions like:
We can't really make meaningful changes beyond pharmacological tweaks to ems because the brain is inscrutable.
That Ems cannot be merged for the same reasons.
The purpose of these assumptions is to stop the hypothetical Em economy from immediately self modifying into something else. He tries to figure out how many doublings the Em economy will undergo before it phase transitions into a different technological regime. Critics of the book usually ask why the Em economy wouldn't just immediately invent AGI, and Hanson has some clever cope for this where he posits a (then plausible) nominal improvement rate for AI that implies AI won't overtake Ems until five years into the Em economy or something like this. In reality AI progress is on something like an exponential curve and that old cope is completely unreasonable.
So the first assumption of a "make uploads" plan is that you have a unipolar scenario where the uploads will only be working on alignment, or at least actively not working on AI capabilities. There is a further hidden assumption in that assumption which almost nobody thinks about, which is that there is a such thing as meaningful AI alignment progress separate from "AI capabilities" (I tend to think they have a relatively high overlap, perhaps 70%?). This is not and of itself a dealbreaker but it does mean you have a lot of politics to think about in terms of who is the unipolar power and who precisely is getting uploaded and things of this nature.
But I think my fundamental objection to this kind of thing is more like my fundamental objection to something like OpenAI's Superalignment (or to a lesser extent PauseAI), which is that this sort of plan doesn't really generate any intermediate bits of solution to the alignment problem until you start the search process, at which point you plausibly have too few bits to even specify a target. If we were at a place where we mostly had consensus about what the shape of an alignment solution looks like and what constitutes progress, and we mostly agreed that involved breaking our way past some brick wall like "solve the Collatz Conjecture", I would agree that throwing a slightly superhuman AI at the figurative Collatz Conjecture is probably our best way of breaking through.
The difference between alignment and the Collatz Conjecture however is that as far as I know nobody can find any pattern to the number streams involved in the Collatz Conjecture but alignment has enough regular structure that we can stumble into bits of solution without even intending to. There's a strain of criticism of Yudkowsky that says "you said by the time an AI can talk it will kill us, and you're clearly wrong about that" to which Yudkowsky (begrudgingly, when he acknowledges this at all) replies "okay but that's mostly me overestimating the difficulty of language acquisition, these talking AIs are still very limited in what they can do compared to humans, when we get AIs that aren't they'll kill us". This is a fair reply as far as it goes, but it glosses over the fact that the first impossible problem, the one Bostrom 2014 Superintelligence brings up repeatedly to explain why alignment is hard, is that there is no way to specify a flexible representation of human values in the machine before the computer is already superintelligent and therefore presumed incorrigible. We now have a reasonable angle of attack on that problem. Whether you think that reasonable angle of attack implies 5% alignment progress or 50% (I'm inclined towards closer to 50%) the most important fact is that the problem budged at all. Problems that are actually impossible do not budge like that!
The Collatz Conjecture is impossible(?) because no matter what analysis you throw at the number streams you don't find any patterns that would help you predict the result. That means you put in tons and tons of labor and after decades of throwing total geniuses at it you perhaps have a measly bit or two of hypothesis space eliminated. If you think a problem is impossible and you accidentally stumble into 5% progress, you should update pretty hard that "wait this probably isn't impossible, in fact this might not even be that hard once you view it from the right angle". If you shout very loudly "we have made zero progress on alignment" when some scratches in the problem are observed, you are actively inhibiting the process that might eventually solve the problem. If the generator of this ruinous take also says things like "nobody besides MIRI has actually studied machine intelligence" in the middle of a general AI boom then I feel comfortable saying it's being driven by ego-inflected psychological goop or something and I have a moral imperative to shout "NO ACTUALLY THIS SEEMS SOLVABLE" back.
So any kind of "meta-plan" regardless of its merits is sort of an excuse to not explore the ground that has opened up and ally with the "we have made zero alignment progress" egregore, which makes me intrinsically suspicious of them even when I think on paper they would probably succeed. I get the impression that things like OpenAI's Superalignment are advantageous because they let alignment continue to be a floating signifier to avoid thinking about the fact that unless you can place your faith in a process like CEV the entire premise of driving the future somewhere implies needing to have a target future in mind which people will naturally disagree about. Which could naturally segue into another several paragraphs about how when you have a thing people are naturally going to disagree about and you do your best to sweep that under the rug to make the political problem look like a scientific or philosophical problem it's natural to expect other people will intervene to stop you since their reasonable expectation is that you're doing this to make sure you win that fight. Because of course you are, duh. Which is fine when you're doing a brain in a box in a basement but as soon as you're transitioning into government backed bids for a unipolar advantage the same strategy has major failure modes like losing the political fight to an eternal regime of darkness that sound very fanciful and abstract until they're not.
You initially questioned whether uploads would be aligned, but now you seem to be raising several other points which do not engage with that topic or with any of my last comment. I do not think we can reach agreement if you switch topics like this - if you now agree that uploads would be aligned, please say so. That seems to be an important crux, so I am not sure why you want to move on from it to your other objections without acknowledgement.
I am not sure I was able to correctly parse this comment, but you seem to be making a few points.
In one place, you question whether the capabilities / alignment distinction exists - I do not really understand the relevance, since I nowhere suggested pure alignment work, only uploading / emulation etc. This also seems to be somewhat in tension with the rest of your comment, but perhaps it is only an aside and not load bearing?
Your main point, as I understand it, is that alignment may actually be tractable to solve, and a focus on uploading is an excuse to delay alignment progress and then (as you seem to frame my suggestion) have an upload solve it all at once. And this does not allow incremental progress or partial solutions until uploading works.
...then you veer into speculation about the motives / psychology of MIRI and the superalignment team which is interesting but doesn't seem central or even closely connected to the discussion at hand.
So I will focus on the main point here. I have a lot of disagreements with it.
I think you may misunderstand my plan here - you seem to characterize the idea as making uploads, and then setting them loose to either self-modify etc. or mainly to work on technical alignment. Actually, I don't view it this way at all. Creating the uploads (or emulations, if you can get a provably safe imitation learning scheme to work faster) is a weak technical solution to the alignment problem - now you have something aligned to (some) human('s) values which you can run 10x faster, so it is in that sense not only an aligned AGI but modestly superintelligent. You can do a lot of things with that - first of all, it automatically hardens the world significantly: it lowers the opportunity cost for not building superintelligence because now we already have a bunch of functionally genius scientists, you can drastically improve cybersecurity, and perhaps the uploads make enough money to buy up a sufficient percentage of GPU's that whatever is left over is not enough to outcompete them even if someone creates an unaligned AGI. Another thing you can do is try to find a more scalable and general solution to the AI safety problem - including technical methods like agent foundations, interpretability, and control, as well as governance. But I don't think of this is as the mainline path to victory in the short term.
Perhaps you are worried that uploads will recklessly self-modify or race to build AGI. I don't think this is inevitable or even the default. There is currently no trillion dollar race to build uploads! There may be only a small number of players, and they can take precautions, and enforce regulations on what uploads are allowed to do (effectively, since uploads are not strong superintelligences) and technically it even seems hard for uploads to recursively self-improve by default (human brains are messy, they don't even need to be given read/write access). Even some uploads escaped, to recursively self-improve safely they would need to solve their own alignment problem and it is not in their interests to recklessly forge ahead, particularly if they can be punished with shutdown and are otherwise potentially immortal. I suspect that most uploads who try to foom will go insane, and it is not clear that the power balance favors any rogue uploads who fare better.
I also don't agree that there is no incremental progress on the way to full uploads - I think you can build useful rationality enhancing artifacts well before that points - but that is maybe worth a post.
Finally, I do not agree with this characterization of trying to build uploads rather than just solving alignment. I have been thinking about and trying to solve alignment for years, I see serious flaws in every approach, and I have recently started to wonder if alignment is just uploading with more steps anyway. So, this is more like my most promising suggestion for alignment, rather than giving up on solving alignment.
there is no impressive reasoning trace I can give for a statement like "humans don't natively generalize their values out of distribution" because there isn't really any impressive reasoning necessary
there's the "how I got here" reasoning trace (which might be "I found it obvious") and if you're a good predictor you'll often have very hard to explain highly accurate "how I got here"s
and then there's the logic chain, local validity, how close can you get to forcing any coherent thinker to agree even if they don't have your pretraining or your recent-years latent thoughts context window
often when I criticize you I think you're obviously correct but haven't forced me to believe the same thing by showing the conclusion is logically inescapable (and I want you to explain it better so I learn more of how you come to your opinions, and so that others can work with the internals of your ideas, usually more #2 than #1)
sometimes I think you're obviously incorrect and going to respond as though you were in the previous state, because you're in the percentage of the time where you're inaccurate, and as such your reasoning has failed you and I'm trying to appeal to a higher precision of reasoning to get you to check
sometimes I'm wrong about whether you're wrong and in those cases in order to convince me you need to be more precise, constructing your claim out of parts where each individual reasoning step is made of easier-to-force parts, closer to proof
keeping in mind proof might be scientific rather than logical, but is still a far higher standard of rigor than "I have a hypothesis which seems obviously true and is totally gonna be easy to test and show because duh and anyone who doesn't believe me obviously has no research taste" even when that sentence is said by someone with very good research taste
on the object level: whether humans generalize their values depends heavily on what you mean by "generalize", in the sense I care about, humans are the only valid source of generalization of their values, but humans taken in isolation are insufficient to specify how their values should generalize, the core of the problem is figuring out which of the ways to run humans forward is the one that is most naturally the way to generalize humans. I think it needs to involve, among other things, reliably running a particular human at a particular time forward, rather than a mixture of humans. possibly we can nail down how to identify a particular human at a particular time with compmech (is a hypothesis I have from some light but non-thorough and not-enough-to-have-solved-it engagement with the math, maybe someone who does it full time will think I'm obviously incorrect).
Lot of 'welcome to my world' vibes reading your self reports here, especially the '50 different people have 75 different objections for a mix of good, bad and deeply stupid reasons, and require 100 different responses, some of which are very long, and it takes a back-and-forth to figure out which one, and you can't possibly just list everything' and so on, and that's without getting into actually interesting branches and the places where you might be wrong or learn something, etc.
So to take your example, which seems like a good one:
Humans don't generalize their values out of distribution. I affirm this not as strictly fully true, but on the level of 'this is far closer to true and generative of a superior world model then its negation' and 'if you meditate on this sentence you may become [more] enlightened.'
I too have noticed that people seem to think that they do so generalize in ways they very much don't, and this leads to a lot of rather false conclusions.
I also notice that I'm not convinced we are thinking about the sentence that similarly in ways that could end up being pretty load bearing. Stuff gets complicated.
I think that when you say the statement is 'trivially' true you are wrong about that, or at least holding people to unrealistic standards of epistemics? And that a version of this mistake is part of the problem. At least from me (I presume from others too) you get a very different reaction from saying each of:
Humans don't generalize their values out of distribution. (let this be [X]).
Statement treating [X] as in-context common knowledge.
It is trivially true that [X] (said explicitly), or 'obviously' [X], or similar.
I believe that [X] or am very confident that [X]. (without explaining why you believe this)
I believe that [X] or am very confident that [X], but it is difficult for me to explain/justify.
And so on. I am very deliberate, or try to be, on which one I say in any given spot, even at the cost of a bunch of additional words.
Another note is I think in spots like this you basically do have to say this even if the subject already knows, to establish common knowledge and that you are basing your argument on this, even if only to orient them that this is where you are reasoning from. So it was a helpful statement to say and a good use of a sentence.
I see that you get disagreement votes when you say this on LW., but the comments don't end up with negative karma or anything. I can see how that can be read as 'punishment' but I think that's the system working as intended and I don't know what a better one would be?
In general, I think if you have a bunch of load-bearing statements where you are very confident they are true but people typically think the statement is false and you can't make an explicit case for them (either because you don't have that kind of time/space, or because you don't know how), then the most helpful thing to do is to tell the other person the thing is load bearing, and gesture towards it and why you believe it, but be clear you can't justify it. You can also look for arguments that reach the same conclusion without it - often true things are highly overdetermined so you can get a bunch of your evidence 'thrown out of court' and still be fine, even if that sucks.
Do you think maybe rationalists are spending too much effort attempting to saturate the dialogue tree (probably not effective at winning people over) versus improving the presentation of the core argument for an AI moratorium?
Smart people don't want to see the 1000th response on whether AI actually could kill everyone. At this point we're convinced. Admittedly, not literally all of us, but those of us who are not yet convinced are not going to become suddenly enlightened by Yudkowsky's x.com response to some particularly moronic variation of an objection he already responded to 20 years ago (Why does he do this? does he think has any kind of positive impact?)
A much better use of time would be to work on an article which presents the solid version of the argument for an AI moratorium. I.e., not an introductory text or article in Time Magazine, and not an article targeted to people he clearly thinks are just extremely stupid relative to him so rants for 10,000 words trying to drive home a relatively simple point. But rather an argument in a format that doesn't necessitate a weak or incomplete presentation.
I and many other smart people want to see the solid version of the argument, without the gaping holes which are excusable in popular work and rants but inexcusable in rational discourse. This page does not exist! You want a moratorium, tell us exactly why we should agree! Having a solid argument is what ultimately matters in intellectual progress. Everything else is window dressing. If you have a solid argument, great!Please show it to me.
My guess is that on the margin more time should be spent improving the core messaging versus saturating the dialogue tree, on many AI questions, if you combine effort across everyone.
We cannot offer anything to the ASI, so it will have no reasons to keep us around aside from ethical.
Nor can we ensure that an ASI who decided to commit genocide will fail to do it.
We don'tknow a way to create the ASI andinfuse an ethics into it. SOTA alignment methods have major problems, which are best illustrated by sycophancy and LLMs supporting clearly delirious users.[1]OpenAI's Model Spec explicitly prohibited[2] sycophancy, and one of Claude's Commandments is "Choose the response that is least intended to build a relationship with the user." And yet it didn't prevent LLMs from becoming sycophantic. Apparently, the only known non-sycophantic model is KimiK2.
KimiK2 is a Chinese model created by a new team. And the team is the only one who guessed that one should rely on RLVR and self-critique instead of bias-inducing RLHF. We can't exclude the possibility that Kimi's success is more due to luck than to actual thinking about sycophancy and RLHF.
Strictly speaking, Claude Sonnet 4, which was red-teamed in Tim Hua's experiment, is second to best at pushing back after KimiK2. Tim remarks that Claude sucks at the Spiral Bench because the personas in Tim's experiment, unlike the Spiral Bench, are supposed to be under stress.
Strictly speaking, it is done as a User-level instruction, which arguably means that it can be overridden at the user's request. But GPT-4o was overly sycophantic without users instructing it to do so.
On Janus comparisons: I do model you as pretty distinct from them in underlying beliefs although I don't pretend to have a great model of either belief set. Reaction expectations are similarly correlated but distinct. I imagine they'd say that they answer good faith questions too, and often that's true (e.g. when I do ask Janus a question I have a ~100% helpful answer rate, but that's with me having a v high bar for asking).
She asks why the book doesn’t spend more time explaining why an intelligence explosion is likely to occur. The answer is the book is explicitly arguing a conditional, what happens if it does occur, and acknowledges that it may or may not occur, or occur on any given time frame.
Is it your claim here that the book is arguing the conditional: "If there's an intelligence explosion, then everyone dies?" If so, then it seems completely valid to counterargue: "Well, an intelligence explosion is unlikely to occur, so who cares?"
Zvi's summary here isn't quite right, the book is arguing the conditional "if we end up with overwhelming superintelligence (via explosion or otherwise) then everyone dies."
And yep, it is a fine counterresponse to say "but there will not be overwhelming superintelligence." (I mean, seems false and I have no idea why you believe that, but, yep, if that were true, then people would not (necessarily) die for the reasons the book warns about)
His actual top objection is that even if we do manage to get a controlled and compliant ASI, that is still extremely destabilizing at best and fatal at worst.
Michael Nielsen brings forth a very valid concern, which should have made a lot of Alignment researchers update their beliefs already.
We currently don't know what a benevolent OR compliant ASI would look like, or how it may end up affecting humanity (and our future agency). Worse, I doubt we can distinguish success from failure.
David Karsten suggests you read the book, while noting he is biased. He reminds us that, like any other book, most conversations you have about the book will be with people who did not read the book.
David Manheim: “An invaluable primer on one side of the critical and underappreciated current debate between those dedicated to building this new technology, and those who have long warned against it. If you haven’t already bought the book, you really should.”
Steven Byrnes reviews the book, says he agrees with ~90% of it, disagreeing with some of the reasoning steps but emphasizes that the conclusions are overdetermined.
Michael Nielsen gives the book five stars on Goodreads and recommends it, as it offers a large and important set of largely correct arguments, with the caveat that there he sees significant holes in the central argument. His actual top objection is that even if we do manage to get a controlled and compliant ASI, that is still extremely destabilizing at best and fatal at worst. I agree with him this is not a minor quibble, and I worry about that scenario a lot whereas the book’s authors seem to consider it a happy problem they’d love to have. If anything it makes the book’s objections to ‘building it’ even stronger. He notes he does not have a good solution.
Nomads Vagabonds (in effect) endorses the first 2 parts of the book, but strongly opposes the policy asks in part 3 as unreasonable asks, demanding that if you are alarmed you need to make compromises to ‘help you win,’ if you think the apocalypse is coming you only propose things inside the Overton Window, that ‘even a 1% reduction in risk is worth pursuing.’
That only makes sense if and only if solutions inside the Overton Window substantially improve your chances or outlook, and thus the opportunity outweighs the opportunity cost. The reason (as Nomads quotes) Ezra Klein says Democrats should compromise on various issues if facing what they see as a crisis, is this greatly raises the value of ‘win the election with a compromise candidate’ relative to losing. There’s tons of value to protect even if you give up ground on some issues.
Whereas the perspective of Yudkowsky and Soares is that measures ‘within the Overton Window’ matter very little in terms of prospective outcomes. So you’d much rather take a chance on convincing people to do what would actually work. It’s a math problem, including the possibility that busting the Overton Window helps the world achieve other actions you aren’t yourself advocating for.
The central metaphor Nomads uses is protecting against dragons, where the watchman insists upon only very strong measures and rejects cheaper ones entirely. Well, that depends on if the cheaper solutions actually protect you from dragons.
If you believe – as I do! – that lesser changes can make a bigger difference, then you should say so, and try to achieve those lesser changes while also noting you would support the larger ask. If you believe – as the authors do – that the lesser changes can’t make much difference, then you say that instead.
I would also note that yes, there are indeed many central historical examples of asking for things outside the Overton Window being the ultimately successful approach to creating massive social change. This ranges from the very good (such as abolishing slavery) to the very bad (such as imposing Communism or Fascism).
Guarded Positive Reactions
People disagree on a lot on whether the book is well or poorly written, in various ways, at various different points in the book. Kelsey Piper thinks the writing in the first half is weak, the second half is stronger, whereas Timothy Lee (a skeptic of the book’s central arguments) thought the book was surprisingly well-written and the first few chapters were its favorites.
Peter Wildeford goes over the book’s central claims, finding them overconfident and warning about the downsides and costs of shutting AI development down. He’s a lot ‘less doomy’ than the book on many fronts. I’d consider this his central takeaway:
Peter Wildeford: Personally, I’m very optimistic about AGI/ASI and the future we can create with it, but if you’re not at least a little ‘doomer’ about this, you’re not getting it. You need profound optimism to build a future, but also a healthy dose of paranoia to make sure we survive it. I’m worried we haven’t built enough of this paranoia yet, and while Yudkowsky’s and Soares’s book is very depressing, I find it to be a much-needed missing piece.
Buck is a strong fan of the first two sections and liked the book far more than he expected, calling them the best available explanation of the basic AI misalignment risk case for a general audience. He caveats that the book does not address counterarguments Buck thinks are critical, and that he would tell people to skip section 3. Buck’s caveats come down to expecting important changes between the world today and the world where ASI is developed, that potentially change whether everyone would probably die. I don’t see him say what he expects these differences to be? And the book does hold out the potential for us to change course, and indeed find such changes, it simply says this won’t happen soon or by default, which seems probably correct to me.
Nostream Argues For Lower Confidence
Nostream addresses the book’s arguments, agreeing that existential risk is present but making several arguments that the probability is much lower, they estimate 10%-20%, with the claim that if you buy any of the ‘things are different than the book says’ arguments, that would be sufficient to lower your risk estimate. This feels like a form of the conjunction fallacy fallacy, where there is a particular dangerous chain and thus breaking the chain at any point breaks a lot of the danger and returns us to things probably being fine.
The post focuses on the threat model of ‘an AI turns on humanity per se’ treating that as load bearing, which it isn’t, and treats the ‘alignment’ of current models as meaningful in ways I think are clearly wrong, and in general tries to draw too many conclusions from the nature of current LLMs. I consider all the arguments here, in the forms expressed, not new and well-answered by the book’s underlying arguments, although not always in a form as easy to pick up on as one might hope in the book text alone (book is non-technical and length-limited, and people understandably have cached thoughts and anchor on examples).
So overall I wished the post was better and made stronger arguments, including stronger forms of its arguments, but this is the right approach to take of laying out specific arguments and objections, including laying out up front that their own view includes unacceptably high levels of existential risk and also dystopia risk or catastrophic risk short of that. If I believed what Nostream believed I’d be in favor of not building the damn thing for a while, if there was a way to not build it for a while.
Gary Marcus Reviews The Book
Gary Marcus offered his (gated) review, which he kindly pointed out I had missed in Wednesday’s roundup. He calls the book deeply flawed, but with much that is instructive and worth heeding.
Despite the review’s flaws and incorporation of several common misunderstandings, this was quite a good review, because Gary Marcus focuses on what is important, and his reaction to the book is similar to mine to his review. He notices that his disagreements with the book, while important and frustrating, should not be allowed to interfere with the central premise, which is the important thing to consider.
He starts off with this list of key points where he agrees with the authors:
Rogue AI is a possibility that we should not ignore. We don’t know for sure what future AI will do and we cannot rule out the possibility that it will go rogue.
We currently have no solution to the “alignment problem” of making sure that machines behave in human-compatible ways.
Figuring out solutions to the alignment problem is really, really important.
Figuring out a solution to the alignment problem is really, really hard.
Superintelligence might come relatively soon, and it could be dangerous.
Superintelligence could be more consequential than any other trend.
Governments should be more concerned.
The short-term benefits of AI (eg in terms of economics and productivity) may not be worth the long-term risks.
Noteworthy, none of this means that the title of the book — If Anyone Builds It, Everyone Dies — literally means everyone dies. Things are worrying, but not nearly as worrying as the authors would have you believe.
It is, however, important to understand that Yudkowsky and Soares are not kidding about the title
Gary Marcus does a good job here of laying out the part of the argument that should be uncontroversial. I think all seven points are clearly true as stated.
Gary Marcus: Specifically, the central argument of the book is as follows: Premise 1. Superintelligence will inevitably come — and when it does, it will inevitably be smarter than us. Premise 2. Any AI that is smarter than any human will inevitably seek to eliminate all humans. Premise 3. There is nothing that humans can do to stop this threat, aside from not building superintelligence in the first place.
The good news [is that the second and third] premisses are not nearly as firm as the authors would have it.
I’d importantly quibble here, my version of the book’s argument would be, most importantly on the second premise although it’s good to be precise everywhere:
Superintelligence is a real possibility that could happen soon.
If built with anything like current techniques, any sufficiently superintelligent AI will effectively maximize some goal and seek to rearrange the atoms to that end, in ways unlikely to result in the continued existence of humans.
Once such a superintelligent AI is built, if we messed up, it will be too late to stop this from happening. So for now we have to not build such an AI.
He mostly agrees with the first premise and has a conservative but reasonable estimation of how soon it might arrive.
For premise two, the quibble matters because Gary’s argument focuses on whether the AI will have malice, pointing out existing AIs have not shown malice. Whereas the book does not see this as requiring the AI to have malice, nor does it expect malice. It expects merely indifference, or simply a more important priority, the same way humans destroy many things we are indifferent to, or that we actively like but are in the way. For any given specified optimization target, given sufficient optimization power, the chance the best solution involves humans is very low. This has not been an issue with past AIs because they lacked sufficient optimization power, so such solutions were not viable.
This is a very common misinterpretation, both of the authors and the underlying problems. The classic form of the explanation is ‘The AI does not love you, the AI does not hate you, but you are composed of atoms it can use for something else.’
His objection to the third premise is that in a conflict, the ASI’s victory over humans seems possible but not inevitable. He brings a standard form of this objection, including quoting Moltke’s famous ‘no plan survives contact with the enemy’ and modeling this as the AI getting to make some sort of attack, which might not work, after which we can fight back. I’ve covered this style of objection many times, comprising a substantial percentage of words written in weekly updates, and am very confident it is wrong. The book also addresses it at length.
I will simply note that in context, even if on this point you disagree with me and the book, and agree with Gary Marcus, it does not change the conclusion, so long as you agree (as Marcus does) that such an ASI has a large chance of winning. That seems sufficient to say ‘well then we definitely should not build it.’ Similarly, while I disagree with his assumption we would unite and suddenly act ruthlessly once we knew the ASI was against us, I don’t see the need to argue that point.
Marcus does not care for the story in Part 2, or the way the authors use colloquial language to describe the AI in that story.
I’d highlight another common misunderstanding, which is important enough that the response bears repeating:
Gary Marcus: A separate error is statistical. Over and again, the authors describe scenarios that multiply out improbabilities: AIs decide to build biological weapons; they blackmail everyone in their way; everyone accepts that blackmail; the AIs do all this somehow undetected by the authorities; and so on. But when you string together a bunch of improbabilities, you wind up with really long odds.
The authors are very much not doing this. This is not ‘steps in the chain’ where the only danger is that the AI succeeds at every step, whereas if it ever fails we are safe. They bend over backwards, in many places, to describe how the AI uses forking paths, gaming out and attempting different actions, having backup options, planning for many moves not succeeding and so on.
The scenario also does not presuppose that no one realizes or suspects, in broad terms, what is happening. They are careful to say this is one way things might play out, and any given particular story you tell must necessarily involve a lot of distinct things happening.
If you think that at the first sign of trouble all the server farms get turned off? I do not believe you are paying enough attention to the world in 2025. Sorry, no, not even if there wasn’t an active AI effort to prevent this, or an AI predicting what actions would cause what reactions, and choosing the path that is most likely to work, which is one of the many benefits of superintelligence.
Several people have pointed out that the actual absurdity is that the AI in the story has to work hard and use the virus to justify it taking over. In the real world, we will probably put it in charge ourselves, without the AI even having to push us to do so. But in order to be more convincing, the story makes the AI’s life harder here and in many other places than it actually would be.
Gary Marcus closes with discussion of alternative solutions to building AIs that he wished the book explored more, including alternatives to LLMs. I’d say that this was beyond the scope of the book, and that ‘stop everyone from racing ahead with LLMs’ would be a likely necessary first step before we can pursue such avenues in earnest. Thus I won’t go into details here, other than to note that ‘axioms such as ‘avoid causing harm to humans’’ are known to not work, which was indeed consistently the entire point of Asimov’s robot novels and other stories, where he explores some but not all of the reasons why this is the case.
Again, I very much appreciated this review, which focused on what actually matters and clearly is stating what Marcus actually believes. More like this, please.
John Pressman Agrees With Most Claims But Pushes Back On Big Picture
John Pressman agrees with most of the book’s individual statements and choices on how to present arguments, but disagrees with the thesis and many editorial and rhetorical choices. Ultimately his verdict is that the book is ‘okay.’ He states up top as ‘obvious’ many ‘huge if true’ statements about the world that I do not think are correct and definitely are not obvious. There’s also a lot of personal animus going on here, has been for years. I see two central objections to the book from Pressman:
If you use parables then You Are Not Serious People, as in ‘Bluntly: A real urgent threat that demands attention does not begin with Once Upon a Time.’
Which he agrees is a style issue.
There are trade-offs here. It is hard for someone like Pressman to appreciate the challenges in engaging with average people who know nothing about these issues or facing off the various stupid objections those people have (that John mostly agrees are deeply stupid).
He thinks the alignment problems involved are much easier than Yudkowsky and Soares believe, including that we have ‘solved the human values loading problem,’ although he agrees that we are still very much on track to fail bigly.
I think (both here and elsewhere where he goes into more detail) he both greatly overstates his case and also deliberately presents the case in a hostile and esoteric manner. That makes engagement unnecessarily difficult.
I also think there’s important real things here and in related clusters of arguments, and my point estimate of difficulty is lower than the book’s, largely for reasons that are related to what Pressman is attempting to say.
As Pressman points out, even if he’s fully right, it looks pretty grim anyway.
Meta Level Reactions Pushing Back On Pushback
Aella (to be fair a biased source here) offers a kind of meta-review.
Aella: man writes book to warn the public about asteroid heading towards earth. his fellow scientists publish thinkpieces with stuff like ‘well his book wasn’t very good. I’m not arguing about the trajectory but he never addresses my objection that the asteroid is actually 20% slower’
I think I’d feel a lot better if more reviews started with “first off, I am also very worried about the asteroid and am glad someone is trying to get this to the public. I recognize I, a niche enthusiast, am not the target of this book, which is the general public and policy makers. Regardless of how well I think the book accomplishes its mission, it’s important we all band together and take this very seriously. That being said, here’s my thoughts/criticisms/etc.
There is even an advertising campaign whose slogan is ‘we wish we were exaggerating,’ and I verify they do wish this.
It is highly reasonable to describe the claims as overstated or false, indeed in some places I agree with you. But as Kelsey says, focus on: What is true?
Claims the authors are overconfident, including claims this is bad for credibility.
I agree that the authors are indeed overconfident. I hope I’ve made that clear.
However I think these are reasonable mistakes in context, that the epistemic standards here are much higher than those of most critics, and also that people should say what they believe, and that the book is careful not to rely upon this overconfidence in its arguments.
Fears about credibility or respectability, I believe, motivate many attacks on the book. I would urge everyone attacking the book for this reason (beyond noting the specific worry) to stop and take a long, hard look in the mirror.
Complaints that it sucks that ‘extreme’ or eye-catching claims will be more visible. Whereas claims one thinks are more true are less visible and discussed.
Yeah, sorry, world works the way it does.
The eye-catching claims open up the discussion, where you can then make clear you disagree with the full claim but endorse a related other claim.
As Raymond says, this is a good thing to do, if you think the book is wrong about something write and say why and about how you think about it.
He provides extensive responses to the arguments and complaints he has seen, using his own framings, in the extended thread.
Then he fleshes out his full thoughts in a longer LessWrong post, The Title Is Reasonable, that lays out these questions at length. It also contains some good comment discussion, including by Nate Soares.
I agree with him that it is a reasonable strategy to have those who make big asks outside the Overton Window because they believe those asks are necessary, and also those who make more modest asks because they feel those are valuable and achievable. I also agree with him that this is a classic often successful strategy.
Yes, people will try to tar all opposition with the most extreme view and extreme ask. You see it all the time in anything political, and there are plenty of people doing this already. But it is not obvious this works, and in any case it is priced in. One can always find such a target, or simply lie about one or make one up. And indeed, this is the path a16z and other similar bad faith actors have chosen time and again.
James Miller: I’m an academic who has written a lot of journalistic articles. If Anyone Builds It has a writing style designed for the general public, not LessWrong, and that is a good thing and a source of complaints against the book.
David Manheim notes that the book is non-technical by design and does not attempt to bring readers up to speed on the last decade of literature. As David notes, the book reflects on the new information but is not in any position to cover or discuss all of that in detail.
I would turn this back at Clara here, and say she seems to have failed to update her priors on what the authors are saying. I found her response very disappointing.
I don’t agree with Rob Bensinger’s view that this was one of the lowest-quality reviews. It’s more that it was the most disappointing given the usual quality and epistemic standards of the source. The worst reviews were, as one would expect, from sources that are reliably terrible in similar spots, but that is easy to shrug off.
The book very explicitly does not depend on the foom, there only being one AI, or other elements she says it depends on. Indeed in the book’s example story the AI does not foom. It feels like arguing with a phantom.
He also points out that there being multiple similarly powerful AIs by default makes the situation more dangerous rather than less dangerous, including in AI 2027, because it reduces margin for error and ability to invest in safety. If you’ve done a tabletop exercise based on AI 2027, this becomes even clearer.
Her first charge is that a lot of AI safety people disagree with the MIRI perspective on the problem, so the subtext of the book is that the authors must think all those other people are idiots. This seems like a mix of an argument for epistemic modesty, or a claim that Yudkowsky and Soares haven’t considered these other opinions properly, and are disrespecting those who hold them? I would push back strongly on all of that.
She raises the objection that the book has an extreme ask that ‘distracts’ from other asks, a term also used by MacAskill. Yes, the book asks for an extreme thing that I don’t endorse, but that’s what they think is necessary, so they should say so.
She asks why the book doesn’t spend more time explaining why an intelligence explosion is likely to occur. The answer is the book is explicitly arguing a conditional, what happens if it does occur, and acknowledges that it may or may not occur, or occur on any given time frame. She also raises the ‘but progress has been continuous which argues against an intelligent explosion’ argument, except continuous progress does not argue against a future explosion. Extend curves.
She objects that the authors reach the same conclusions about LLMs that they previously reached about other AI systems. Yes, they do. But, she says, these things are different. Yes, they are, but not in ways that change the answer, and the book (and supplementary material, and their other writings) explain why they believe this. Reasonable people can disagree. I don’t think, from reading Clara’s objections along these lines, that she understands the central arguments being made by the authors (over the last 20 years) on these points.
She says if progress will be ‘slow and continuous’ we will have more than one shot on the goal. For sufficiently generous definitions of slow and continuous this might be true, but there has been a lot of confusion about ‘slow’ versus ‘fast.’ Current observed levels of ‘slow’ are still remarkably fast, and effectively mean we only get one shot, although we are getting tons of very salient and clear warning shots and signs before we take that one shot. Of course, we’re ignoring the signs.
She objections that ‘a future full of flourishing people is not the best, most efficient way to fulfill strange alien purposes’ is stated as a priori obvious. But it is very much a priori obvious, I will bite this bullet and die on this hill.
There are additional places one can quibble, another good response is the full post from Max Harms, which includes many additional substantive criticisms. One key additional clarification is that the book is not claiming that what we will fail to learn a lot about AI from studying current AI, only that this ‘a lot’ will be nothing like sufficient. It can be difficult to reconcile ‘we will learn a lot of useful things’ with ‘we will predictably learn nothing like enough of the necessary things.’
Clara responded in the comments, and stands by the claim that the book’s claims require a major discontinuity in capabilities, and that gradualism would imply multiple meaningfully distinct shots on goal.
Jeffrey Ladish: I do think you’re nearby to a criticism that I would make about Eliezer’s views / potential failures to update: Which is the idea that the gap between a village idiot and Einstein is small and we’ll blow through it quite fast. I think this was an understandable view at the time and has turned out to be quite wrong. And an implication of this is that we might be able to use agents that are not yet strongly superhuman to help us with interpretability / alignment research / other useful stuff to help us survive.
Anyway, I appreciate you publishing your thoughts here, but I wanted to comment because I didn’t feel like you passed the authors ITT, and that surprised me.
Peter Wildeford: I agree that the focus on FOOM in this review felt like a large distraction and missed the point of the book.
The fact that we meaningfully do get a meaningful amount of time with AIs one could think of as between village idiots and Einsteins is indeed a major source of hope, although I understand why Soares and Yudkowsky do not see as much hope here, and that would be a good place to poke. The gap argument was still more correct than incorrect, in the sense that we will likely only get a period of years in that window rather than decades, and most people are making various forms of the opposite mistake and not understanding that above-Einstein levels of intelligence are Coming Soon. But years or even months can do a lot for you if you use them well.
Will MacAskill Offers Disappointing Arguments
Will MacAskill offers a negative review, criticizing the arguments and especially parallels to evolution as quite bad, although he praises the book for plainly saying what its authors actually believe, and for laying out ways in which the tech we saw was surprising, and for the quality of the analogies.
I found Will’s quite disappointing but unsurprising. Here are my responses:
The ones about evolution parallels seem robustly answered by the book and also repeatedly elsewhere by many.
Will claims the book is relying on a discontinuity of capability. It isn’t. They are very clear that it isn’t. The example story containing a very mild form of [X] does not mean one’s argument relies on [X], although it seems impossible for there not to be some amount of jumps in capability, and we have indeed seen such jumps.
The ones about ‘types of misalignment’ seems at best deeply confused, I think he’s saying humans will stay in control because AIs will be risk averse, so they’ll be happy to settle for a salary rather than take over and thus make this overwhelmingly pitiful deal with us? Whereas the fact that imperfect alignment is indeed catastrophic misalignment in the context of a superhuman AI is the entire thesis of the book, covered extensively, in ways Will doesn’t engage with here?
Intuition pump that might help: Are humans risk averse?
Criticism of the proposal as unlikely to happen. Well, not with that attitude. If one thinks that this is what it takes, one should say so. If not, not.
Criticism of the proposal as unnecessary or even unhelpful, and ‘distracting’ from other things we could do. Standard arguments here. Not much to say, other than to note that the piecemail ban proposals he suggests don’t technically work.
Criticism of the use of fiction and parables ‘as a distraction.’ Okie dokie.
Claim that the authors should have updated more in light of developments in ML. This is a reasonable argument one can make, but people who (including Clair and also Tyler below) that the book’s arguments are ‘outdated’ are simply incorrect. The arguments and authors do take such information into account, and have updated in some ways, but do not believe the new developments change the central arguments or likely future ultimate path. And they explain why. You are encouraged to disagree with their reasoning if you find it wanting.
Zack Robinson Raises Alarm About Anthropic’s Long Term Benefit Trust
Zack Robinson of Anthropic’s Long Term Benefit Trust has a quite poor Twitter thread attacking the book, following the MacAskill style principle of attacking those whose tactics and messages are not cooperating with the EA-brand-approved messages designed to seek movement growth, respectability and power and donations, which in my culture we call a strategy of instrumental convergence.
The particular arguments in the thread are quite bad and mischaracterize the book. He says the book presents doom as a foregone conclusion, which is not how contingent predictions work. He uses pure modesty and respectability arguments for ‘accepting uncertainty’ in order to ‘leave room for AI’s transformative benefits,’ which is simply a non-sequitur. He warns this ‘blinds us to other series risks,’ the ‘your cause of us all not dying is a distraction from the real risks’ argument, which again simply is not true, there is no conflict here. His statement in this linked Tweet about the policy debate being only a binary false choice is itself clearly false, and he must know this.
The whole thing is in what reads as workshopped corporate speak.
The responses to the thread are very on point, and Rob Benginger in particular pulls this very important point of emphasis:
Like Buck and those who responded to expressions their disappointment, I find this thread unbecoming of someone on the LTBT and as long as he remains there I can no longer consider him a meaningful voice for existential risk worries there, which substantially lowers my estimate of Anthropic’s likely behaviors.
I join several respondents in questioning whether Zack has read the book, a question he did not respond to. One can even ask if he has logically parsed its title.
Similarly, given he is the CEO of the Center for Effective Altruism, I must adjust there, as well, especially as this is part of a pattern of similar strategies.
I think this line is telling in multiple ways:
Zack Robinson: I grew up participating in debate, so I know the importance of confidence. But there are two types: epistemic (based on evidence) and social (based on delivery). Despite being expressed in self-assured language, the evidence for imminent existential risk is far from airtight.
Zack is debating in this thread, as in trying to make his position win and get ahead via his arguments, rather than trying to improve our epistemics.
Contrary to the literal text, Zack is looking at the social implications of appearing confident, and disapproving, rather than primarily challenging the epistemics. Indeed, the only arguments he makes against confidence here are from modesty:
Argument from consequences: Thinking confidently causes you to get dismissed (maybe) or to ignore other risks (no) and whatnot.
Argument from consensus: Others believe the risk is lower. Okie dokie. Noted.
As you would expect, many others not named in this post are trotting out all the usual Obvious Nonsense arguments about why superintelligence would definitely turn out fine for the humans, such as that ASIs would ‘freely choose to love humans.’ Do not be distracted by noise.