Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
Rohin Shah

I along with several AI Impacts researchers recently talked to talked to Rohin Shah about why he is relatively optimistic about AI systems being developed safely. Rohin Shah is a 5th year PhD student at the Center for Human-Compatible AI (CHAI) at Berkeley, and a prominent member of the Effective Altruism community.

Rohin reported an unusually large (90%) chance that AI systems will be safe without additional intervention. His optimism was largely based on his belief that AI development will be relatively gradual and AI researchers will correct safety issues that come up.

He reported two other beliefs that I found unusual: He thinks that as AI systems get more powerful, they will actually become more interpretable because they will use features that humans also tend to use. He also said that intuitions from AI/ML make him skeptical of claims that evolution baked a lot into the human brain, and he thinks there’s a ~50% chance that we will get AGI within two decades via a broad training process that mimics the way human babies learn.

A full transcript of our conversation, lightly edited for concision and clarity, can be found here.

By Asya Bergal

New to LessWrong?

New Comment
58 comments, sorted by Click to highlight new comments since: Today at 7:07 AM

Can please you make the audio recording of this available? (Or let me know if you can send it to me directly.) I find it taxing to read interview transcripts but have gotten used to listening to interviews (on podcasts for example).

We're not going to do this because we weren't planning on making these public when we conducted the conversations, so we want to give people a chance to make edits to transcripts before we send them out (which we can't do with audio).

That makes sense. Though if AI Impacts does more conversations like these in future, I’d be very interested in listening to them via a podcast app.

Neat idea!

If someone here thinks this is easy to do or that they can make it easy for us to do it, let me know.

I hope someone else answers your question properly, but here are two vaguely relevant things from Rob Wiblin.

Rohin reported an unusually large (90%) chance that AI systems will be safe without additional intervention.

This sentence makes two claims. Firstly that Rohin reports 90% credence in safe AI by default. Secondly that 90% is unusually large compared with the relevant reference class (which I interpret to be people working full-time on AI safety).

However, as far as I can tell, there's no evidence provided for the second claim. I find this particularly concerning because it's the sort of claim that seems likely to cause (and may already have caused) information cascades, along the lines of "all these high status people think AI x-risk is very likely, so I should too".

It may well be true that Rohin is an outlier in this regard. But it may also be false: a 10% chance of catastrophe is plenty high enough to motivate people to go into the field. Since I don't know of many public statements from safety researchers stating their credence in AI x-risk, I'm curious about whether you have strong private evidence.

Indeed, quite a lot of experts are more optimistic than it seems. See this or this . Well, I collected a lot of quotes from various experts about the future of human extinction due to AI here. Maybe someone is interested.

I've started collecting estimates of existential/extinction/similar risk from various causes (e.g., AI risk, biorisk). Do you know of a quick way I could find estimates of that nature (quantified and about extreme risks) in your spreadsheet? It seems like an impressive piece of work, but my current best idea for finding this specific type of thing in it would be to search for "%", for which there were 384 results...

(I apologize in advance for my English). Well, only the fifth column shows an expert’s assessment of the impact of AI on humanity. Therefore, any other percentages can be quickly skipped. It took me a few seconds to examine 1/10 of the table through Ctrl+F, so it would not take long to fully study the table by such a principle. Unfortunately, I can't think of anything better.

The claim is a personal impression that I have from conversations, largely with people concerned about AI risk in the Bay Area. (I also don't like information cascades, and may edit the post to reflect this qualification.) I'd be interested in data on this.

[-]Wei Dai4yΩ460

I collected some AI risk estimates in this EA forum comment and also made the complaint that it's hard to compare existing statements about AI risk because everyone seems to include different things in the "risk" they're estimating.

Here, the 90% figure originally came in response to the question "what do you think is the credence that by default things go well, without additional intervention by us doing safety research or something like that?" but then it gets reported here in the OP as "chance that AI systems will be safe without additional intervention" and elsewhere as "even without any additional intervention from current longtermists, advanced AI systems will not cause human extinction by adversarially optimizing against humans".

(I haven't read the whole transcript yet, so I'm not sure whether Rohin says or implies that 90% applies to "human extinction by adversarially optimizing against humans" somewhere, but I couldn't find it with some CTRL-F searching. Also if 90% does apply to all of these definitions of "risk" then it would imply ~0% chance that AI is unsafe or leads to a bad outcome in a way that doesn't involve human extinction which seems way too strong to me.)

I'm bringing this up again here in the hope that people will be more sensitive about different ways of describing or defining AI risk, and also maybe some organization will read this and decide to do some kind of well-considered survey to collect people's estimates in a way that's easy to compare.

There's a note early in the transcript that says that basically everything I say in the interview is about adversarial optimization against humans only, which includes the 90% figure.

I wouldn't take the number too seriously. If you asked me some other time of day, or in some other context, or with slightly different words, I might have said 80%, or 95%. I doubt I would have said 70%, or 99%.

Another reason to not take the numbers too seriously: arguably, by the numbers, I have a large disagreement with Will MacAskill, but having discussed it with him I think we basically agree on most things, except how to aggregate the considerations together. I expect that I agree with Will more than I agree with a random AI safety researcher who estimates the same 90% number that I estimated.

even without any additional intervention from current longtermists, advanced AI systems will not cause human extinction by adversarially optimizing against humans

This is the best operationalization of the ones you've listed.

[-]Wei Dai4yΩ450

There’s a note early in the transcript that says that basically everything I say in the interview is about adversarial optimization against humans only, which includes the 90% figure.

Can you quote that please? I can't find it even with this clue. "adversarial optimization against humans" is also ambiguous to me and I wonder if the original language was clearer or included a longer explanation. (ETA: E.g., is it meant to exclude someone deliberately using AI to cause human extinction? Or are you contrasting "adversarial optimization against humans" with something else, like AI causing dangerous technologies to be developed faster than ways of safeguarding them?)

even without any additional intervention from current longtermists, advanced AI systems will not cause human extinction by adversarially optimizing against humans

This is the best operationalization of the ones you’ve listed.

Ok, I'm curious how likely you think it is that an (existential-level) bad outcome happens due to AI by default, without involving human extinction. (As an example of what I mean, the default development of AI causes human values to be corrupted or just locked in or frozen, in a way that would cause enormous regret if we found out what our "actual" values are.)

ETA: Also, what was your motivation for talking about a fairly narrow kind of AI risk, when the interviewer started with a more general notion?

Can you quote that please?

It's here:

[Note: In this interview, Rohin was only considering risks arising because of AI systems that try to optimize for goals that are not our own, not other forms of existential risks from AI.]

Ok, I'm curious how likely you think it is that an (existential-level) bad outcome happens due to AI by default, without involving human extinction.

I mostly want to punt on this question, because I'm confused about what "actual" values are. I could imagine operationalizations where I'd say > 90% chance (e.g. if our "actual" values are the exact thing we would settle on after a specific kind of reflection that we may not know about right now), and others where I'd assign ~0% chance (e.g. the extremes of a moral anti-realist view).

I do lean closer to the stance of "whatever we decide based on some 'reasonable' reflection process is good", which seems to encompass a wide range of futures, and seems likely to me to happen by default.

ETA: Also, what was your motivation for talking about a fairly narrow kind of AI risk, when the interviewer started with a more general notion?

I mean, the actual causal answer is "that's what I immediately thought about", it wasn't a deliberate decision. But here are some rationalizations after the fact, most of which I'd expect are causal in that they informed the underlying heuristics that caused me to immediately think of the narrow kind of AI risk:

  • My model was that the interviewer(s) were talking about the narrow kind of AI risk, so it made sense to talk about that.
  • Initially, there wasn't any plan for this interview to be made public, so I was less careful about making myself broadly understandable, and instead tailoring my words to my "audience" of 3 people.
  • I mostly think about and have expertise on the narrow kind (adversarial optimization against humans).
  • I expect that technical solutions are primarily important only for the narrow kind of AI risk (I'm more optimistic about social coordination for the general kind). So when I'm asked a question positing "without additional intervention by us doing safety research", I tend to think of adversarial optimization, since that's what I expect to be addressed by safety research.
[-]Wei Dai4yΩ9190

I do lean closer to the stance of “whatever we decide based on some ‘reasonable’ reflection process is good”, which seems to encompass a wide range of futures, and seems likely to me to happen by default.

I think I disagree pretty strongly, and this is likely an important crux. Would you be willing to read a couple of articles that point to what I think is convincing contrary evidence? (As you read the first article, consider what would have happened if the people involved had access to AI-enabled commitment or mind-modification technologies.)

If these articles don't cause you to update, can you explain why? For example do you think it would be fairly easy to design reflection/deliberation processes that would avoid these pathologies? What about future ones we don't yet foresee?

... I'm not sure why I used the word "we" in the sentence you quoted. (Maybe I was thinking about a group of value-aligned agents? Maybe I was imagining that "reasonable reflection process" meant that we were in a post-scarcity world, everyone agreed that we should be doing reflection, everyone was already safe? Maybe I didn't want the definition to sound like I would only care about what I thought and not what everyone else thought? I'm not sure.)

In any case, I think you can change that sentence to "whatever I decide based on some 'reasonable' reflection process is good", and that's closer to what I meant.

I am much more uncertain about multiagent interactions. Like, suppose we give every person access to a somewhat superintelligent AI assistant that is legitimately trying to help them. Are things okay by default? I lean towards yes, but I'm uncertain. I did read through those two articles, and I broadly buy the theses they advance; I still lean towards yes because:

  • Things have broadly become better over time, despite the effects that the articles above highlight. The default prediction is that they continue to get better. (And I very uncertainly think people from the past would agree, given enough time to understand our world?)
  • In general, we learn reasonably well from experience; we try things and they go badly, but then things get better as we learn from that.
  • Humans tend to be quite risk-averse at trying things, and groups of humans seem to be even more risk-averse. As a result, it seems unlikely that we try a thing that ends up having a "direct" existentially bad effect.
  • You could worry about an "indirect" existentially bad effect, along the lines of Moloch, where there isn't any single human's optimization causing bad things to happen, but selection pressure causes problems. Selection pressure has existed for a long time and hasn't caused an existentially-bad outcome yet, so the default is that it won't in the future.
  • Perhaps AI accelerates the rate of progress in a way where we can't adapt fast enough, and this is why selection pressures can now cause an existentially bad effect. But this didn't happen with the Industrial Revolution. (That said, I do find this more plausible than the other scenarios.)

But in fact I usually don't aim to make claims about these sorts of scenarios; as I mentioned above I'm more optimistic about social solutions (that being the way we have solved this in the past).

Not on topic, but from the first article:

In reality, much of the success of a government is due to the role of the particular leaders, particular people, and particular places. If you have a mostly illiterate nation, divided 60%/40% into two tribes, then majoritarian democracy is a really, really bad idea. But if you have a homogeneous, educated, and savvy populace, with a network of private institutions, and a high-trust culture, then many forms of government will work quite well. Much of the purported success of democracy is really survivorship bias. Countries with the most human capital and strongest civic institutions can survive the chaos and demagoguery that comes with regular mass elections. Lesser countries succumb to chaos, and then dictatorship.

I don't see why having a well-educated populace would protect you against the nigh-inevitable value drift of even well-intentioned leaders when they ascend to power in a highly authoritarian regime.

I agree that if you just had one leader with absolute power then it probably won't work, and that kind of government probably isn't included in the author's "many forms of government will work quite well". I think what he probably has in mind are governments that look authoritarian from the outside but still has some kind of internal politics/checks-and-balances that can keep the top leader(s) from going off the rails. I wish I had a good gears-level model of how that kind of government/politics works though. I do suspect that "work quite well" might be fragile/temporary and dependent on the top leaders not trying very hard to take absolute power for themselves, but I'm very uncertain about this due to lack of knowledge and expertise.

[-]Wei Dai4yΩ340

[Note: In this interview, Rohin was only considering risks arising because of AI systems that try to optimize for goals that are not our own, not other forms of existential risks from AI.]

"goals that are not our own" is ambiguous to me. Does it include a goal that someone currently thinks they have or behaves as if they have, but isn't really part of their "actual" values? Does it include a goal that someone gets talked into by a superintelligent AI? Are you including risks that come from AI not being value-neutral, in other words, the AI being better at optimizing for some kinds of values over others, to the extent that the future is dominated by the the goals of a small group of humans?

Also, you've been using "adversarial optimization" a lot in this thread but a search on this site doesn't show you as having defined or used it before, except in https://www.lesswrong.com/posts/9mscdgJ7ao3vbbrjs/an-70-agents-that-help-humans-who-are-still-learning-about but that part wasn't even written by you so I'm not sure if you mean the same thing by it. If you have defined it somewhere, can you please link to it? (I suspect there may be some illusion of transparency going on where you think terms like "adversarial optimization" and "goals that are not our own" have clear and obvious meanings...)

I mostly want to punt on this question, because I’m confused about what “actual” values are. I could imagine operationalizations where I’d say > 90% chance (e.g. if our “actual” values are the exact thing we would settle on after a specific kind of reflection that we may not know about right now), and others where I’d assign ~0% chance (e.g. the extremes of a moral anti-realist view).

I think even with extreme moral anti-realism, there's still a significant risk that AIs could learn values that are wrong enough (i.e., different enough from our values, or are otherwise misaligned enough) to cause an existential-level bad outcome, but not human extinction. Can you confirm that you really endorse the ~0% figure?

I expect that technical solutions are primarily important only for the narrow kind of AI risk (I’m more optimistic about social coordination for the general kind). So when I’m asked a question positing “without additional intervention by us doing safety research”, I tend to think of adversarial optimization, since that’s what I expect to be addressed by safety research.

Can I convince you that you should be uncertain enough about this, and that enough other people disagree with you about this (in particular that social coordination may be hard enough that we should try to solve a wider kind of AI risk via technical means), that more careful language to distinguish between different kinds of risk and different kinds of research would be a good idea to facilitate thinking and discussion? (I take your point that you weren't expecting this interview to be make public, so I'm just trying to build a consensus about what should ideally happen in the future.)

"goals that are not our own" is ambiguous to me. Does it include a goal that someone currently thinks they have or behaves as if they have, but isn't really part of their "actual" values? Does it include a goal that someone gets talked into by a superintelligent AI?

"goals that are our own" is supposed to mean our "actual" values, which I don't know how to define, but shouldn't include a goal that you are "incorrectly" persuaded of by a superintelligent AI. The best operationalization I have is the values that I'd settle on after some "reasonable" reflection process. There are multiple "reasonable" reflection processes; the output of any of them is fine. But even this isn't exactly right, because there might be some values that I end up having in the world with AI, that I wouldn't have come across with any reasonable reflection process because I wouldn't have thought about the weird situations that occur once there is superintelligent AI, and I still want to say that those sorts of values could be fine.

Are you including risks that come from AI not being value-neutral, in other words, the AI being better at optimizing for some kinds of values over others, to the extent that the future is dominated by the the goals of a small group of humans?

I was not including those risks (if you mean a setting where there are N groups of humans with different values, but AI can only help M < N of them, and so those M values dominate the future instead of all N).

I suspect there may be some illusion of transparency going on where you think terms like "adversarial optimization" and "goals that are not our own" have clear and obvious meanings...

I don't think "goals that are not our own" is philosophically obvious, but I think that it points to a fuzzy concept that cleaves reality at its joints, of which the central examples are quite clear. (The canonical example being the paperclip maximizer.) I agree that once you start really trying to identify the boundaries of the concept, things get very murky (e.g. what if an AI reports true information to you, causing you to adopt value X, and the AI is also aligned with value X? Note that since you can't understand all information, the AI has necessarily selected what information to show you. I'm sure there's a Stuart Armstrong post about this somewhere.)

By "adversarial optimization", I mean that the AI system is "trying to accomplish" some goal X, while humans instead "want" some goal Y, and this causes conflict between the AI system and humans.

(I could make it sound more technical by saying that the AI system is optimizing some utility function, while humans are optimizing some other utility function, which leads to conflict between the two because of convergent instrumental subgoals. I don't think this is more precise than the previous sentence.)

I think even with extreme moral anti-realism, there's still a significant risk that AIs could learn values that are wrong enough (i.e., different enough from our values, or are otherwise misaligned enough) to cause an existential-level bad outcome, but not human extinction. Can you confirm that you really endorse the ~0% figure?

Oh, whoops, I accidentally estimated the answer to "(existential-level) bad outcome happens due to AI by default, without involving adversarial optimization". I agree that you could get existential-level bad outcomes that aren't human extinction due to adversarial optimization. I'm not sure how likely I find that -- it seems like that depends on what the optimal policy for a superintelligent AI is, which, who knows if that involves literally killing all humans. (Obviously, to be consistent with earlier estimates, it must be <= 10%.)

Can I convince you that you should be uncertain enough about this, and that enough other people disagree with you about this (in particular that social coordination may be hard enough that we should try to solve a wider kind of AI risk via technical means), that more careful language to distinguish between different kinds of risk and different kinds of research would be a good idea to facilitate thinking and discussion? (I take your point that you weren't expecting this interview to be make public, so I'm just trying to build a consensus about what should ideally happen in the future.)

Yeah, I do try to do this already. The note I quoted above is one that I asked to be added post-conversation for basically this reason. (It's somewhat hard to do so though, my brain is pretty bad at keeping track of uncertainty that doesn't come from an underlying inside-view model.)


While I didn't make the claim, my general impression from conversations (i.e. not public statements) is that the claim is broadly true for AI safety researchers weighted by engagement with AI safety, in the Bay Area at least, and especially true when comparing to MIRI.

However, as far as I can tell, there's no evidence provided for the second claim.

I don't know how much this will mean, but the eighth chapter of Superintelligence is titled, "Is the default outcome doom?" which is highly suggestive of a higher than 10% chance of catastrophe. Of course, that was 2014, so the field has moved on...

I had a chat with Rohin about portions of this interview in an internal slack channel, which I'll post as replies to this comment (there isn't much shared state between different threads, I think).

DF

I think it would be… AGI would be a mesa optimizer or inner optimizer, whichever term you prefer. And that that inner optimizer will just sort of have a mishmash of all of these heuristics that point in a particular direction but can’t really be decomposed into ‘here are the objectives, and here is the intelligence’, in the same way that you can’t really decompose humans very well into ‘here are the objectives and here is the intelligence’.

... but it leads to not being as confident in the original arguments. It feels like this should be pushing in the direction of ‘it will be easier to correct or modify or change the AI system’. Many of the arguments for risk are ‘if you have a utility maximizer, it has all of these convergent instrumental sub-goals’ and, I don’t know, if I look at humans they kind of sort of pursued convergent instrumental sub-goals, but not really.

Huh, I see your point as cutting the opposite way. If you have a clean architectural separation between intelligence and goals, I can swap out the goals. But if you have a mish-mash, then for the same degree of vNM rationality (which maybe you think is unrealistic), it's harder to do anything like 'swap out the goals' or 'analyse the goals for trouble'.

in general, I think the original arguments are: (a) for a very wide range of objective functions, you can have agents that are very good at optimising them (b) convergent instrumental subgoals are scary

I think 'humans don't have scary convergent instrumental subgoals' is an argument against (b), but I don't think (a) or (b) rely on a clean architectural separation between intelligence and goals.

RS I agree both (a) and (b) don’t depend on an architectural separation. But you also need (c): agents that we build are optimizing some objective function, and I think my point cuts against that

DF somewhat. I think you have a remaining argument of 'if we want to do useful stuff, we will build things that optimise objective functions, since otherwise they randomly waste resources', but that's definitely got things to argue with.

(Looking back on this, I'm now confused why Rohin doesn't think mesa-optimisers wouldn't end up being approximately optimal for some objective/utility function)

I predict that Rohin would say something like "the phrase 'approximately optimal for some objective/utility function' is basically meaningless in this context, because for any behaviour, there's some function which it's maximising".

You might then limit yourself to the set of functions that defines tasks that are interesting or relevant to humans. But then that includes a whole bunch of functions which define safe bounded behaviour as well as a whole bunch which define unsafe unbounded behaviour, and we're back to being very uncertain about which case we'll end up in.


That would probably be part of my response, but I think I'm also considering a different argument.

The thing that I was arguing against was "(c): agents that we build are optimizing some objective function". This is importantly different from "mesa-optimisers [would] end up being approximately optimal for some objective/utility function" when you consider distributional shift.

It seems plausible that the agent could look like it is "trying to achieve" some simple utility function, and perhaps it would even be approximately optimal for that simple utility function on the training distribution. (Simple here is standing in for "isn't one of the weird meaningless utility functions in Coherence arguments do not imply goal-directed behavior, and looks more like 'maximize happiness' or something like that".) But if you then take this agent and place it in a different distribution, it wouldn't do all the things that an EU maximizer with that utility function would do, it might only do some of the things, because it isn't internally structured as a search process for sequences of actions that lead to high utility behavior, it is instead structured as a bunch of heuristics that were selected for high utility on the training environment that may or may not work well in the new setting.

(In my head, the Partial Agency sequence is meandering towards this conclusion, though I don't think that's actually true.)

(I think people have overupdated on "what Rohin believes" from the coherence arguments post -- I do think that powerful AI systems will be agent-ish, and EU maximizer-ish, I just don't think that it is going to be a 100% EU maximizer that chooses actions by considering reasonable sequences of actions and doing the one with the best predicted consequences. With that post, I was primarily arguing against the position that EU maximization is required by math.)

DF

I don’t know that MIRI actually believes that what we need to do is write a bunch of proofs about our AI system, but it sure sounds like it, and that seems like a too difficult, and basically impossible task to me, if the proofs that we’re trying to write are about alignment or beneficialness or something like that.

FYI: My understanding of what MIRI (or at least Buck) thinks is that you don't need to prove your AI system is beneficial, but you should have a strong argument that stands up to strict scrutiny, and some of the sub-arguments will definitely have to be proofs.

RS Seems plausible, I think I feel similarly about that claim

DF

A straw version of this, which isn’t exactly what I mean but sort of is the right intuition, would be like maybe if you run the same… What’s the input that maximizes the output of this neuron? You’ll see that this particular neuron is a deception classifier. It looks at the input and then based on something, does some computation with the input, maybe the input’s like a dialogue between two people and then this neuron is telling you, “Hey, is person A trying to deceive person B right now?” That’s an example of the sort of thing I am imagining.

Huh - plausible that I'm misunderstanding you, but I imagine this being insufficient for safety monitoring because (a) many non-deceptive AIs are going to have the concept of deception anyway, because it's useful, (b) statically you can't tell whether or not the network is going to aim for deception just from knowing that it has a representation of deception, and (c) you don't have a hope of monitoring it online to check if the deception neuron is lighting up when it's talking to you.

FWIW I believe in the negation of some version of my point (b), where some static analysis reveals some evaluation and planning model, and you find out that in some situations the agent prefers itself being deceptive, where of course this static analysis is significantly more sophisticated than current techniques

RS Yeah, I agree with all of these critiques. I think I’m more pointing at the intuition at why we should expect this to be easier than we might initially think, rather than saying that specific idea is going to work.

E.g. maybe this is a reason that (relaxed) adversarial training actually works great, since the adversary can check whether the deception neuron is lighting up

DF Seems fair, and I think this kind of intuition is why I research what I do.

DF From your AI impacts interview:

And then I claim that conditional on that scenario having happened, I am very surprised by the fact that we did not know this deception in any earlier scenario that didn’t lead to extinction. And I don’t really get people’s intuitions for why that would be the case. I haven’t tried to figure that one out though.

I feel like I believe that people notice deception early on but are plausibly wrong about whether or not they've fixed it

RS After a few failures, you’d think we’d at least know to expect it?

DF Sure, but if your AI is also getting smarter, then that probably doesn't help you that much in detecting it, and only one person has to be wrong and deploy (if actually fixing takes a significantly longer time than sort of but not really fixing it) [this comment was written with less than usual carefulness]

RS Seems right, but in general human society / humans seem pretty good at being risk-averse (to the point that it seems to me that on anything that isn’t x-risk the utilitarian thing is to be more risk-seeking), and I’m hopeful that the same will be true here. (Also I’m assuming that it would take a bunch of compute, and it’s not that easy for a single person to deploy an AI, though even in that case I’d be optimistic, given that smallpox hasn’t been released yet.)

DF sorry by 'one person' I meant 'one person in charge of a big team'

RS The hope is that they are constrained by all the typical constraints on such people (shareholders, governments, laws, public opinion, the rest of the team, etc.) Also this significantly decreases the number of people who can do the thing, restricts it to people who are “broadly reasonable” (e.g. no terrorists), and allows us to convince each such person individually. Also I rarely think there is just one person — at the very least you need one person with a bunch of money and resources and another with the technical know-how, and it would be very difficult for these to be the same person

DF Sure. I guess even with those caveats my scenario doesn't seem that unlikely to me.

RS Sure, I don’t think this is enough to say “yup, this definitely won’t happen”. I think we do disagree on the relative likelihood of it happening, but maybe not by that much. (I’m hesitant to write a number because the scenario isn’t really fleshed out enough yet for us to agree on what we’re writing a number about.)

DF

And the concept of 3D space seems like it’s probably going to be useful for an AI system no matter how smart it gets. Currently, they might have a concept of 3D space, but it’s not obvious that they do. And I wouldn’t be surprised if they don’t.

Presumably at some point they start actually using the concept of 4D locally-Minkowski spacetime instead (or quantum loops or whatever)

and in general - if you have things roughly like human notions of agency or cause, but formalised differently and more correctly than we would, that makes them harder to analyse.

RS I suspect they don’t use 4D spacetime, because it’s not particularly useful for most tasks, and takes more computation.

But I agree with the broader point that abstractions can be formalized differently, and that there can be more alien abstractions. But I’d expect that this happens quite a bit later

DF I mean maybe once you've gotten rid of the pesky humans and need to start building dyson spheres... anyway I think curved 4d spacetime does require more computation than standard 3d modelling, but I don't think that using minkowski spacetime does.

RS Yeah, I think I’m often thinking of the case where AI is somewhat better than humans, rather than building Dyson spheres. Who knows what’s happening at Dyson sphere level. Probably should have said that in the conversation. (I think about it this way because it seems more important to align the first few AIs, and then have them help with aligning future ones.)

DF Sure. But even when you have AI that's worrying about signal transmission between different cities and the GPS system, SR is not that much more computationally intensive than Newtonian 3D space, and critical for accuracy.

Like I think the additional computational cost is in fact very low, but non-negative.

RS So like in practice if robots end up doing tasks like the ones we do, they develop intuitive physics models like ours, rather than Newtonian mechanics. SR might be only a bit more expensive than Newtonian, but I think most of the computational cost is in switching from heuristics / intuitive physics to a formal theory

(If they do different tasks than what we do, I expect them to develop their own internal physics which is pretty different from ours that they use for most tasks, but still not a formal theory)

DF Ooh, I wasn't accounting for that but it seems right.

I do think that plausibly in some situations 'intuitive physics' takes place in minkowski spacetime.

DF

I also don’t think there’s a discrete point at which you can say, “I’ve won the race.” I think it’s just like capabilities keep improving and you can have more capabilities than the other guy, but at no point can you say, “Now I have won the race.”

I think that (a) this isn't a disanalogy to nuclear arms races and (b) it's a sign of danger, since at no point do people feel free to slow down and test safety.

RS I’m confused by (a). Surely you “win” the nuclear arms race once you successfully make a nuke that can be dropped on another country?

(b) seems right, idr if I was arguing for safety or just arguing for disanalogies and wanting more research

DF re (a), if you have nukes that can be dropped on me, I can then make enough nukes to destroy all your nukes. So you make more nukes, so I make more nukes (because I'm worried about my nukes being destroyed) etc. This is historically how it played out, see mid-20th C discussion of the 'missile gap'.

re (b) fair enough

(it doesn't actually necessarily play out as clearly as I describe: maybe you get nuclear submarines, I get nuclear submarine detection skills...)

RS (a) Yes, after the first nukes are created, the remainder of the arms race is relatively similar. I was thinking of the race to create the first nuke. (Arguably the US should have used their advantage to prevent all further nukes.)

DF I guess it just seems more natural to me to think of one big long arms race, rather than a bunch of successive races - like, I think if you look at the actual history of nuclear armament, at no point before major powers have tons of nukes are they in a lull, not worrying about making more. But this might be an artefact of me mostly knowing about the US side, which I think was unusual in its nuke production and worrying.

RS Seems reasonable, I think which frame you take will depend on what you’re trying to argue, I don’t remember what I was trying to argue with that. My impression was that when people talk about the “nuclear arms race”, they were talking about the one leading to the creation of the bomb, but I’m not confident in that (and can’t think of any evidence for it right now)

DF

My impression was that when people talk about the “nuclear arms race”, they were talking about the one leading to the creation of the bomb

ah, I did not have that impression. Makes sense.

FWIW I think I've only ever heard "nuclear arms race" used to refer to the buildup of more and more weapons, more advancements, etc., not a race to create the first nuclear weapon. And the Wikipedia article by that name opens with:

The nuclear arms race was an arms race competition for supremacy in nuclear warfare between the United States, the Soviet Union, and their respective allies during the Cold War.

This page uses the phrase 'A "Race" for the bomb' (rather than "nuclear arms race") to describe the US and Nazi Germany's respective efforts to create the first nuclear weapon. My impression is that this "race" was a key motivation in beginning the Manhattan Project and in the early stages, but I'm not sure to what extent that "race" remained "live" and remained a key motivation for the US (as opposed the US just clearly being ahead, and now being motivated by having invested a lot and wanting a powerful weapon to win the war sooner). That page says "By 1944, however, the evidence was clear: the Germans had not come close to developing a bomb and had only advanced to preliminary research."

Yeah I think I was probably wrong about this (including what other people were talking about when they said "nuclear arms race").

Thanks for recording this conversation! Some thoughts:

AI development will be relatively gradual and AI researchers will correct safety issues that come up.

I was pretty surprised to read the above--most of my intuitions about AI come down to repeatedly hearing the point that safety issues are very unpredictable and high variance, and that once a major safety issue happens, it's already too late. The arguments I've seen for this (many years of Eliezer-ian explanations of how hard it is to come out on top against superintelligent agents who care about different things than you) also seem pretty straightforward. And Rohin Shah isn't a stranger to them. So what gives?

Well, look at the summary on top of the full transcript link. Here are some quotes reflecting the point that Rohin is making which is most interesting to me--

From the summary:

Shah doesn’t believe that any sufficiently powerful AI system will look like an expected utility maximizer.

and, in more detail, from the transcript:

Rohin Shah: ... I have an intuition that AI systems are not well-modeled as, “Here’s the objective function and here is the world model.” Most of the classic arguments are: Suppose you’ve got an incorrect objective function, and you’ve got this AI system with this really, really good intelligence, which maybe we’ll call it a world model or just general intelligence. And this intelligence can take in any utility function, and optimize it, and you plug in the incorrect utility function, and catastrophe happens.
This does not seem to be the way that current AI systems work. It is the case that you have a reward function, and then you sort of train a policy that optimizes that reward function, but… I explained this the wrong way around. But the policy that’s learned isn’t really… It’s not really performing an optimization that says, “What is going to get me the most reward? Let me do that thing.”

If I was very convinced of this perspective, I think I'd share Rohin's impression that AI Safety is attainable. This is because I also do not expect highly strategic and agential actions focused on a single long-term goal to be produced by something that "has been given a bunch of heuristics by gradient descent that tend to correlate well with getting high reward and then it just executes those heuristics." To elaborate on some of this with my own perspective:

  • If our superintelligent AI is just a bunch of well developed heuristics, it is unlikely that those heuristics will be generatively strategic enough to engage in super-long-term planning
  • If our superintelligent AI gets punished based on any proxy for "misleading humans" and it can't do super-long-term planning, it is unlikely to come up with a good reward-attaining strategy that involves misleading humans
  • If our superintelligent AI does somehow develop a heuristic that misleads humans, it is yet more unlikely that the heuristic will be immediately well-developed enough to mislead humans long enough to cause an extinction level event. Instead, it will probably mislead the humans for more short-term gains at first--which will allow us to identify safety measures in advance

So I agree that we have a good chance of ensuring that this kind of AI is safe--mainly because I don't think the level of heuristics involved invoke an AI take-off slow enough to clearly indicate safety risks before they become x-risks.

On the other hand, while I agree with Rohin and Hanson's side that there isn't One True Learning Algorithm, there are potentially a multitude of advanced heuristics that approximate extremely agent-y and strategic long-term optimizations. We even have a real-life, human-level example of this. His name is Eliezer Yudkowsky[1]. Moreover, if I got an extra fifty IQ points and a slightly different set of ethics, I wouldn't be surprised if the set of heuristics composing my brain could be an existential threat. I think Rohin would agree with this belief in heuristic kludges that are effecively agential despite not being a One True Algorithm and, alone, this belief doesn't imply existential risk. If these agenty heuristics manifest gradually over time, we can easily stop them just by noticing them and turning the AI off before they get refined into something truly dangerous.

However, I don't think that machine-learned heuristics are the only way we can get highly dangerous agenty heuristics. We've made a lot of mathematical process on understanding logic, rationality and decision theory and, while machine-learned heuristics may figure out approximately Perfect Reasoning Capabilities just by training, I think it's possible that we can directly hardcode heuristics that do the same thing based on our current understanding of things we associate with Perfect Reasoning Capabilities.

In other words, I think that the dangerously agent-y heuristics which we can develop through gradual machine-learning processes could also be developed by a bunch of mathematicians teaming up and building a kludge that is similarly agent-y right out of the box. The former possibility is something we can mitigate gradually (for instance, by not continuing to build AI once they start doing things that look too agent-y) but the latter seems much more dangerous.

Of course, even if mathematicians could directly kludge some heuristics that can perform long-term strategic planning, implementing such a kludge seems obviously dangerous to me. It also seems rather unnecessary. If we could also just get superintelligent AI that doesn't do scary agent-y stuff by just developing it as a gradual extension of our current machine-learning technology, why would you want to do it the risky and unpredictable way? Maybe it'd be orders of magnitude faster but this doesn't seem worth the trade--especially when you could just directly improve AI-compute capabilities instead.

As of finishing this comment, I think I'm less worried about AI existential risks than I was before.

[1] While this sentence might seem glib, I phrased it the way I did specifically most, while most people display agentic behaviors, most of us aren't that agentic in general. I do not know Eliezer personally but the person who wrote a whole set of sequences on rationality, developed a new decision theory and started up a new research institute focused on saving the world is the best example of an agenty person I can come up with off the top of my head.


I enjoyed this comment, thanks for thinking it through! Some comments:

If our superintelligent AI is just a bunch of well developed heuristics, it is unlikely that those heuristics will be generatively strategic enough to engage in super-long-term planning

This is not my belief. I think that powerful AI systems, even if they are a bunch of well developed heuristics, will be able to do super-long-term planning (in the same way that I'm capable of it, and I'm a bunch of heuristics, or Eliezer is to take your example).

Obviously this depends on how good the heuristics are, but I do think that heuristics will get to the point where they do super-long-term planning, and my belief that we'll be safe by default doesn't depend on assuming that AI won't do long-term planning.

I think Rohin would agree with this belief in heuristic kludges that are effecively agential despite not being a One True Algorithm

Yup, that's correct.

So I agree that we have a good chance of ensuring that this kind of AI is safe--mainly because I don't think the level of heuristics involved invoke an AI take-off slow enough to clearly indicate safety risks before they become x-risks.

Should "I don't think" be "I do think"? Otherwise I'm confused. With that correction, I basically agree.

However, I don't think that machine-learned heuristics are the only way we can get highly dangerous agenty heuristics. We've made a lot of mathematical process on understanding logic, rationality and decision theory and, while machine-learned heuristics may figure out approximately Perfect Reasoning Capabilities just by training, I think it's possible that we can directly hardcode heuristics that do the same thing based on our current understanding of things we associate with Perfect Reasoning Capabilities.

I would be very surprised if this worked in the near term. Like, <1% in 5 years, <5% in 20 years, and really I want to say < 1% that this is the first way we get AGI (no matter when), but I can't actually be that confident.

My impression is that many researchers at MIRI would qualitatively agree with me on this, though probably with less confidence.


Thanks for replying!

This is not my belief. I think that powerful AI systems, even if they are a bunch of well developed heuristics, will be able to do super-long-term planning (in the same way that I'm capable of it, and I'm a bunch of heuristics, or Eliezer is to take your example).

Yeah, I intended that statement to be more of an elaboration on my own perspective than to imply that it represented your beliefs. I also agree that its wrong in the context of superintelligent AI we are discussing.

Should "I don't think" be "I do think"? Otherwise I'm confused.

Yep! Thanks for the correction.

I would be very surprised if this worked in the near term. Like, <1% in 5 years, <5% in 20 years and really I want to say < 1% that this is the first way we get AGI (no matter when)

Huh, okay... On reflection, I agree that directly hardcoded agent-y heuristics are unlikely to happen because AI-Compute tends to beat it. However, I continue to think that mathematicians may be able to use their knowledge of probability & logic to cause heuristics to develop in ways that are unusually agent-y at a fast enough rate to imply surprising x-risks.

This mainly boils down to my understanding that similarly well-performing but different heuristics for agential behavior may have very different potentials for generalizing to agential behavior on longer time-scales/chains-of-reasoning than the ones trained on. Consequently, I think there are particular ways of defining AI problem objectives and AI architecture that are uniquely suited to AI becoming generally agential over arbitrarily long time-frames and chains of reasoning.

However, I think we can address this kind of risk with the same safety solutions that could help us deal with AI that just have significantly better reasoning capabilities than us (but have not reasoning capabilities that have fully generalized!). Paul Christiano's work on amplification, for instance.

So the above is only a concern if people a) deliberately try to get AI in the most reckless way possible and b) get lucky enough that it doesn't get bottle-necked somewhere else. I'll buy the low estimates you're providing.

Suppose [...] you’ve got this AI system with this really, really good intelligence, which maybe we’ll call it a world model or just general intelligence. And this intelligence can take in any utility function, and optimize it, and you plug in the incorrect utility function, and catastrophe happens.

I've seen various people make the argument that this is not how AI works and it's not how AGI will work--it's basically the old "tool AI" vs "agent AI" debate. But I think the only reason current AI doesn't do this is because we can't make it do this yet: the default customer requirement for a general intelligence is that it should be able to do whatever task the user asks it to do.

So far the ability of AI to understand a request is very limited (poor natural language skills). But once you have an agent that can understand what you're asking, of course you would design it to optimize new objectives on request, bounded of course by some built-in rules about not committing crimes or manipulating people or seizing control of the world (easy, I assume). Otherwise, you'd need to build a new system for every type of goal, and that's basically just narrow AI.

If our superintelligent AI is just a bunch of well developed heuristics, it is unlikely that those heuristics will be generatively strategic enough to engage in super-long-term planning

If the heuristics are optimized for "be able to satisfy requests from humans" and those requests sometimes require long-term planning, then the skill will develop. If it's only good at satisfying simple requests that don't require planning, in what sense is it superintelligent?

I am not arguing that we'll end up building tool AI; I do think it will be agent-like. At a high level, I'm arguing that the intelligence and agentiness will increase continuously over time, and as we notice the resulting (non-existential) problems we'll fix them, or start over.

I agree with your point that long-term planning will develop even with a bunch of heuristics.

If the heuristics are optimized for "be able to satisfy requests from humans" and those requests sometimes require long-term planning, then the skill will develop. If it's only good at satisfying simple requests that don't require planning, in what sense is it superintelligent?

Yeah, that statement is wrong. I was trying to make a more subtle point about how an AI that learns long-term planning on a shorter time-frame is not necessarily going to be able to generalize to longer time-frames (but in the context of superintelligent AIs capable of doing human leve tasks, I do think it will generalize--so that point is kind of irrelevant). I agree with Rohin's response.




He thinks that as AI systems get more powerful, they will actually become more interpretable because they will use features that humans also tend to use

I find this fairly persuasive, I think. One way of putting it is that in order for an agent to be recursively self-improving in any remotely intelligent way, it needs to be legible to itself. Even if we can't immediately understand its components in the same way that it does, it must necessarily provide us with descriptions of its own ways of understanding them, which we could then potentially co-opt. (relevant: https://www.lesswrong.com/posts/bNXdnRTpSXk9p4zmi/book-review-design-principles-of-biological-circuits )

This may be useful in the early phases, but I'm skeptical as to whether humans can import those new ways of understanding fast enough to be permitted to stand as an air-gap for very long. There is a reason, for instance, we don't have humans looking over and approving every credit card transaction. Taking humans out of the loop is the entire reason those systems are useful. The same dynamic will pop up with AGI.

This xkcd comic seems relevant https://xkcd.com/2044/ ("sandboxing cycle")

There is a tension between connectivity and safe isolation and navigating it is hard.

You could imagine a situation where for some reason the US and China are like, “Whoever gets to AGI first just wins the universe.” And I think in that scenario maybe I’m a bit worried, but even then, it seems like extinction is just worse, and as a result, you get significantly less risky behavior? But I don’t think you get to the point where people are just literally racing ahead with no thought to safety for the sake of winning.

My interpretation of what Rohin is saying there is:

  • 1) Extinction is an extremely bad outcome.
  • 2) It's much worse than 'losing' an international competition to 'win the universe'.
  • 3) Countries/institutions/people will therefore be significantly inclined to avoid risking extinction, even if doing so would increase the chances of 'winning' an international competition to 'win the universe'.

I agree with claim 1.

I agree with some form of claim 3, in that:

  • I think the badness of extinction will reduce the risks people are willing to take
  • I also "don’t think you get to the point where people are just literally racing ahead with no thought to safety for the sake of winning."
  • But I don't think the risks will be reduced anywhere near as much as they should be. (That said, I also believe that odds are in favour of things "going well by default", just not as much in favour of that as I'd like).

This is related to my sense that claim 2 is somewhat tricky/ambiguous. Are we talking about whether it is worse, or whether the relevant actors will perceive it as worse? One common argument for why existential risks are neglected is that it's basically a standard market failure. The vast majority of the harm from x-risks are externalities, and x-risk reduction is a global public good. Even if we consider deaths/suffering in the present generation, even China and India absorb less than half of that "cost", and most countries absorb less than 1% of them. And I believe most people focused on x-risk reduction are at least broadly longtermist, so they'd perceived the overwhelming majority of the costs to be to future generations, and thus also externalities.

So it seems like, unless we expect the relevant actors to act in accordance with something close to impartial altruism, we should expect them to avoid risks somewhat to avoid existential risks (or extinction specifically), but far less than they really should. (Roughly this argument is made in The Precipice, and I believe by 80k.)

(Rohin also discusses right after that quote why he doesn't "think that differences in who gets to AGI first are going to lead to you win the universe or not", which I do think somewhat bolsters the case for claim 2.)

So it seems like, unless we expect the relevant actors to act in accordance with something close to impartial altruism, we should expect them to avoid risks somewhat to avoid existential risks (or extinction specifically), but far less than they really should. (Roughly this argument is made in The Precipice, and I believe by 80k.)

I agree that actors will focus on x-risk far less than they "should" -- that's exactly why I work on AI alignment! This doesn't mean that x-risk is high in an absolute sense, just higher than it "should" be from an altruistic perspective. Presumably from an altruistic perspective x-risk should be very low (certainly below 1%), so my 10% estimate is orders of magnitude higher than what it "should" be.

Also, re: Precipice, it's worth noting that Toby and I don't disagree much -- I estimate 1 in 10 conditioned on no action from longtermists; he estimates 1 in 5 conditioned on AGI being developed this century. Let's say that action from longtermists can halve the risk; then my unconditional estimate would be 1 in 20, and would be very slightly higher if we condition on AGI being developed this century (because we'd have less time to prepare), so overall there's a 4x difference, which given the huge uncertainty is really not very much.

Thanks for this reply!

Perhaps I should've been clear that I didn't expect what I was saying was things you hadn't heard. (I mean, I think I watched an EAG video of you presenting on 80k's ideas, and you were in The Precipice's acknowledgements.)

I guess I was just suggesting that your comments there, taken by themselves/out of context, seemed to ignore those important arguments, and thus might seem overly optimistic. Which seemed mildly potentially important for someone to mention at some point, as I've seen this cited as an example of AI researcher optimism. (Though of course I acknowledge your comments were off the cuff and not initially intended for public consumption, and any such interview will likely contain moments that are imperfectly phrased or open to misinterpretation.)

Also, re: Precipice, it's worth noting that Toby and I don't disagree much -- I estimate 1 in 10 conditioned on no action from longtermists; he estimates 1 in 5 conditioned on AGI being developed this century. Let's say that action from longtermists can halve the risk; then my unconditional estimate would be 1 in 20[...] (emphasis added)

I find this quite interesting. Is this for existential risk from AI as a whole, or just "adversarial optimisation"/"misalignment" type scenarios? E.g., does it also include things like misuse and "structural risks" (e.g., AI increasing risks of nuclear war by forcing people to make decisions faster)?

I'm not saying it'd be surprisingly low if it does include those things. I'm just wondering, as estimates like this are few and far between, so now that I've stumbled upon one I want to understand its scope and add it to my outside view.

Also, I bolded conditioned and unconditional, because that seems to me to suggest that you also currently expect the level of longtermist intervention that would reduce the risk to 1 in 20 to happen. Like, for you, "there's no action from longtermists" would be a specific constraint you have to add to your world model? That also makes sense; I just feel like I've usually not seen things presented that way.

I imagine you could also condition on something like "surprisingly much action from longtermists", which would reduce your estimated risk further?

I guess I was just suggesting that your comments there, taken by themselves/out of context, seemed to ignore those important arguments, and thus might seem overly optimistic.

Sure, that seems reasonable.

Is this for existential risk from AI as a whole, or just "adversarial optimisation"/"misalignment" type scenarios?

Just adversarial optimization / misalignment. See the comment thread with Wei Dai below, especially this comment.

Like, for you, "there's no action from longtermists" would be a specific constraint you have to add to your world model?

Oh yeah, definitely. (Toby does the same in The Precipice; his position is that it's clearer not to condition on anything, because it's usually unclear what exactly you are conditioning on, though in person he did like the operationalization of "without action from longtermists".)

Like, my model of the world is that for any sufficiently important decision like the development of powerful AI systems, there are lots of humans bringing many perspectives to the table, which usually ends up with most considerations being brought up by someone, and an overall high level of risk aversion. On this model, longtermists are one of the many groups that argue for being more careful than we otherwise would be.

I imagine you could also condition on something like "surprisingly much action from longtermists", which would reduce your estimated risk further?

Yeah, presumably. The 1 in 20 number was very made up, even more so than the 1 in 10 number. I suppose if our actions were very successful, I could see us getting down to 1 in 1000? But if we just exerted a lot more effort (i.e. "surprisingly much action"), the extra effort probably doesn't help much more than the initial effort, so maybe... 1 in 25? 1 in 30?

(All of this is very anchored on the initial 1 in 10 number.)

Quite interesting. Thanks for that response.

And yes, this does seem quite consistent with Ord's framing. E.g., he writes "my estimates above incorporate the possibility that we get our act together and start taking these risks very seriously." So I guess I've seen it presented this way at least that once, but I'm not sure I've seen it made explicit like that very often (and doing so seems useful and retrospectively-obvious).

But if we just exerted a lot more effort (i.e. "surprisingly much action"), the extra effort probably doesn't help much more than the initial effort, so maybe... 1 in 25? 1 in 30?

Are you thinking roughly that (a) returns diminish steeply from the current point, or (b) that effort will likely ramp up a lot in future and pluck a large quantity of the low hanging fruit that currently remain, such that even more ramping up would face steeply diminishing returns?

That's a vague question, and may not be very useful. The motivation for it is that I was surprised you saw the gap between business as usual and "surprisingly much action" as being as small as you did, and wonder roughly what portion of that is about you thinking additional people working on this won't be very useful, vs thinking very super useful additional people will eventually jump aboard "by default".

Are you thinking roughly that (a) returns diminish steeply from the current point, or (b) that effort will likely ramp up a lot in future and pluck a large quantity of the low hanging fruit that currently remain, such that even more ramping up would face steeply diminishing returns?

More like (b) than (a). In particular, I'm think of lots of additional effort by longtermists, which probably doesn't result in lots of additional effort by everyone else, which already means that we're scaling sublinearly. In addition, you should then expect diminishing marginal returns to more research, which lessens it even more.Also, a thing that I realized

Also, I was thinking about this recently, and I am pretty pessimistic about worlds with discontinuous takeoff, which should maybe add another ~5 percentage points to my risk estimate conditional on no intervention by longtermists, and ~4 percentage points to my unconditional risk estimate.

Interesting (again!).

So you've updated your unconditional estimate from ~5% (1 in 20) to ~9%? If so, people may have to stop citing you as an "optimist"... (which was already perhaps a tad misleading, given what the 1 in 20 was about)

(I mean, I know we're all sort-of just playing with incredibly uncertain numbers about fuzzy scenarios anyway, but still.)

If so, people may have to stop citing you as an "optimist"

I wouldn't be surprised if the median number from MIRI researchers was around 50%. I think the people who cite me as an optimist are people with those background beliefs. I think even at 5% I'd fall on the pessimistic side at FHI (though certainly not the most pessimistic, e.g. Toby is more pessimistic than I am.

It may be useful.

’Actually, the people Tim is talking about here are often more pessimistic about societal outcomes than Tim is suggesting. Many of them are, roughly speaking, 65%-85% confident that machine superintelligence will lead to human extinction, and that it’s only in a small minority of possible worlds that humanity rises to the challenge and gets a machine superintelligence robustly aligned with humane values.’ — Luke Muehlhauser, https://lukemuehlhauser.com/a-reply-to-wait-but-why-on-machine-superintelligence/

’In terms of falsifiability, if you have an AGI that passes the real no-holds-barred Turing Test over all human capabilities that can be tested in a one-hour conversation, and life as we know it is still continuing 2 years later, I’m pretty shocked. In fact, I’m pretty shocked if you get up to that point at all before the end of the world.’ — Eliezer Yudkowsky, https://www.econlib.org/archives/2016/03/so_far_my_respo.html

Interesting interview, thanks for sharing it!

Asya Bergal: It seems like people believe there’s going to be some kind of pressure for performance or competitiveness that pushes people to try to make more powerful AI in spite of safety failures. Does that seem untrue to you or like you’re unsure about it?
Rohin Shah: It seems somewhat untrue to me. I recently made a comment about this on the Alignment Forum. People make this analogy between AI x-risk and risk of nuclear war, on mutually assured destruction. That particular analogy seems off to me because with nuclear war, you need the threat of being able to hurt the other side whereas with AI x-risk, if the destruction happens, that affects you too. So there’s no mutually assured destruction type dynamic.

I find this statement very confusing. I wonder if I'm misinterpreting Rohin. Wikipedia says "Mutual(ly) assured destruction (MAD) is a doctrine of military strategy and national security policy in which a full-scale use of nuclear weapons by two or more opposing sides would cause the complete annihilation of both the attacker and the defender (see pre-emptive nuclear strike and second strike)."

A core part of the idea of MAD is that the destruction would be mutual. So "with AI x-risk, if the destruction happens, that affects you too" seems like a reason why MAD is a good analogy, and why the way we engaged in MAD might suggest people would engage in similar brinkmanship or risks with AI x-risk, even if the potential for harm to people's "own side" would be extreme. There are other reasons why the analogy is imperfect, but the particular feature Rohin mentions seems like a reason why an analogy could be drawn.

MAD-style strategies happen when:

1. There are two (or more) actors that are in competition with each other

2. There is a technology such that if one actor deploys it and the other actor doesn't, the first actor remains the same and the second actor is "destroyed".

3. If both actors deploy the technology, then both actors are "destroyed".

(I just made these up right now; you could probably get better versions from papers about MAD.)

Condition 2 doesn't hold for accident risk from AI: if any actor deploys an unaligned AI, then both actors are destroyed.

I agree I didn't explain this well in the interview -- when I said

if the destruction happens, that affects you too

I should have said something like

if you deploy a dangerous AI system, that affects you too

which is not true for nuclear weapons (deploying a nuke doesn't affect you in and of itself).

I was already interpreting your comment as "if you deploy a dangerous AI system, that affects you too". I guess I'm just not sure your condition 2 is actually a key ingredient for the MAD doctrine. From the name, the start of Wikipedia's description, my prior impressions of MAD, and my general model of how it works, it seems like the key idea is that neither side wants to do the thing, because if they do the thing they get destroyed to.

The US doesn't want to nuke Russia, because then Russian nukes the US. This seems the same phenomena as some AI lab not wanting to develop and release a misaligned superintelligence (or whatever), because then the misaligned superintelligence would destroy them too. So in the key way, the analogy seems to me to hold. Which would then suggest that, however incautious or cautious society was about nuclear weapons, this analogy alone (if we ignore all other evidence) suggests we may do similar with AI. So it seems to me to suggest that there's not an important disanalogy that should update us towards expecting safety (i.e., the history of MAD for nukes should only make us expect AI safety to the extent we think MAD for nukes was handled safely).

Condition 2 does seem important for the initial step of the US developing the first nuclear weapon, and other countries trying to do so. Because it did mean that the first country who got it would get an advantage, since it could use it without being destroyed itself, at that point. And that doesn't apply for extreme AI accidents.

So would your argument instead be something like the following? "The initial development of nuclear weapons did not involve MAD. The first country who got them could use them without being itself harmed. However, the initial development of extremely unsafe, extremely powerful AI would substantially risk the destruction of its creator. So the fact we developed nuclear weapons in the first place may not serve as evidence that we'll develop extremely unsafe, extremely powerful AI in the first place."

If so, that's an interesting argument, and at least at first glance it seems to me to hold up.

Condition 2 is necessary for race dynamics to arise, which is what people are usually worried about.

Suppose that AI systems weren't going to be useful for anything -- the only effect of AI systems was that they posed an x-risk to the world. Then it would still be true that "neither side wants to do the thing, because if they do the thing they get destroyed too".

Nonetheless, I think in this world, no one ever builds AI systems and so don't need to worry about x-risk.

That seems reasonable to me. I think what I'm thinking is that that's a disanalogy between a potential "race" for transformative AI, and the race/motivation for building the first nuclear weapons, rather than a disanalogy between the AI situation and MAD.

So it seems like this disanalogy is a reason to think that the evidene "we built nuclear weapons" is weaker evidence than one might otherwise think for the claim "we'll build dangerous AI" or the claim "we'll build AI so in an especially 'racing'/risky way". And that seems an important point.

But it seems like "MAD strategies have been used" remains however strong evidence it previously was for the claim "we'll do dangerous things with AI". E.g., MAD strategies could still serve as some evidence for the general idea that countries/institutions are sometimes willing to do things that are risky to themselves, and that pose very large negative externalities of risks to others, for strategic reasons. And that general idea still seems to apply at least somewhat to AI.

(I'm not sure this is actually disagreeing with what you meant/believe.)

Suppose you have two events X and Y, such that X causes Y, that is, if not-X were true than not-Y would also be true.

Now suppose there's some Y' analogous to Y, and you make the argument A: "since Y happened, Y' is also likely to happen". If that's all you know, I agree that A is reasonable evidence that Y' is likely to happen. But if you then show that the analogous X' is not true, while X was true, I think argument A provides ~no evidence.

Example:

"It was raining yesterday, so it will probably rain today."

"But it was cloudy yesterday, and today it is sunny."

"Ah. In that case it probably won't rain."

I think condition 2 causes racing causes MAD strategies in the case of nuclear weapons; since condition 2 / racing doesn't hold in the case of AI, the fact that MAD strategies were used for nuclear weapons provides very little evidence about whether similar strategies will be used for AI.

MAD strategies could still serve as some evidence for the general idea that countries/institutions are sometimes willing to do things that are risky to themselves, and that pose very large negative externalities of risks to others, for strategic reasons.

I agree with that sentence interpreted literally. But I think you can change "for strategic reasons" to "in cases where condition 2 holds" and still capture most of the cases in which this happens.

I think I get what you're saying. Is it roughly the following?

"If an AI race did occur, maybe similar issues to what we saw in MAD might occur; there may well be an analogy there. But there's a disanalogy between the nuclear weapon case and the AI risk case with regards to the initial race, such that the initial nuclear race provides little/no evidence that a similar AI race may occur. And if a similar AI race doesn't occur, then the conditions under which MAD-style strategies may arise would not occur. So it might not really matter if there's an analogy between the AI risk situation if a race occurred and the MAD situation."

If so, I think that makes sense to me, and it seems an interesting/important argument. Though it seems to suggest something more like "We may be more ok than people might think, as long as we avoid an AI race, and we'll probably avoid an AI race", rather than simply "We may be more ok than people might think". And that distinction might e.g. suggest additional value to strategy/policy/governance work to avoid race dynamics, or to investigate how likely they are. (I don't think this is disagreeing with you, just highlighting a particular thing a bit more.)

Yup, I agree with that summary.