Rob Bensinger

Communications @ MIRI. Unless otherwise indicated, my posts and comments here reflect my own views, and not necessarily my employer's. (Though we agree about an awful lot.)


2022 MIRI Alignment Discussion
2021 MIRI Conversations
Naturalized Induction

Wiki Contributions

Load More


I feel pretty frustrated at how rarely people actually bet or make quantitative predictions about existential risk from AI. EG my recent attempt to operationalize a bet with Nate went nowhere. Paul trying to get Eliezer to bet during the MIRI dialogues also went nowhere, or barely anywhere—I think they ended up making some random bet about how long an IMO challenge would take to be solved by AI. (feels pretty weak and unrelated to me. lame. but huge props to Paul for being so ready to bet, that made me take him a lot more seriously.)

This paragraph doesn't seem like an honest summary to me. Eliezer's position in the dialogue, as I understood it, was:

  • The journey is a lot harder to predict than the destination. Cf. "it's easier to use physics arguments to predict that humans will one day send a probe to the Moon, than it is to predict when this will happen or what the specific capabilities of rockets five years from now will be". Eliezer isn't claiming to have secret insights about the detailed year-to-year or month-to-month changes in the field; if he thought that, he'd have been making those near-term tech predictions already back in 2010, 2015, or 2020 to show that he has this skill.
  • From Eliezer's perspective, Paul is claiming to know a lot about the future trajectory of AI, and not just about the endpoints: Paul thinks progress will be relatively smooth and continuous, and thinks it will get increasingly smooth and continuous as time passes and more resources flow into the field. Eliezer, by contrast, expects the field to get choppier as time passes and we get closer to ASI.
  • A way to bet on this, which Eliezer repeatedly proposed but wasn't able to get Paul to do very much, would be for Paul to list out a bunch of concrete predictions that Paul sees as "yep, this is what smooth and continuous progress looks like". Then, even though Eliezer doesn't necessarily have a concrete "nope, the future will go like X instead of Y" prediction, he'd be willing to bet against a portfolio of Paul-predictions: when you expect the future to be more unpredictable, you're willing to at least weakly bet against any sufficiently ambitious pool of concrete predictions.
  • (Also, if Paul generated a ton of predictions like that, an occasional prediction might indeed make Eliezer go "oh wait, I do have a strong prediction on that question in particular; I didn't realize this was one of our points of disagreement". I don't think this is where most of the action is, but it's at least a nice side-effect of the person-who-thinks-this-tech-is-way-more-predictable spelling out predictions.)

Eliezer was also more interested in trying to reach mutual understanding of the views on offer, as opposed to bet let's bet on things immediately never mind the world-views. But insofar as Paul really wanted to have the bets conversation instead, Eliezer sunk an awful lot of time into trying to find operationalizations Paul and he could bet on, over many hours of conversation.

If your end-point take-away from that (even after actual bets were in fact made, and tons of different high-level predictions were sketched out) is "wow how dare Eliezer be so unwilling to make bets on anything", then I feel a lot less hope that world-models like Eliezer's ("long-term outcome is more predictable than the detailed year-by-year tech pathway") are going to be given a remotely fair hearing.

(Also, in fairness to Paul, I'd say that he spent a bunch of time working with Eliezer to try to understand the basic methodologies and foundations for their perspectives on the world. I think both Eliezer and Paul did an admirable job going back and forth between the thing Paul wanted to focus on and the thing Eliezer wanted to focus on, letting us look at a bunch of different parts of the elephant. And I don't think it was unhelpful for Paul to try to identify operationalizations and bets, as part of the larger discussion; I just disagree with TurnTrout's summary of what happened.)

If I was misreading the blog post at the time, how come it seems like almost no one ever explicitly predicted at the time that these particular problems were trivial for systems below or at human-level intelligence?!? 

Quoting the abstract of MIRI's "The Value Learning Problem" paper (emphasis added):

Autonomous AI systems’ programmed goals can easily fall short of programmers’ intentions. Even a machine intelligent enough to understand its designers’ intentions would not necessarily act as intended. We discuss early ideas on how one might design smarter-than-human AI systems that can inductively learn what to value from labeled training data, and highlight questions about the construction of systems that model and act upon their operators’ preferences.

And quoting from the first page of that paper:

The novelty here is not that programs can exhibit incorrect or counter-intuitive behavior, but that software agents smart enough to understand natural language may still base their decisions on misrepresentations of their programmers’ intent. The idea of superintelligent agents monomaniacally pursuing “dumb”-seeming goals may sound odd, but it follows from the observation of Bostrom and Yudkowsky [2014, chap. 7] that AI capabilities and goals are logically independent.1 Humans can fully comprehend that their “designer” (evolution) had a particular “goal” (reproduction) in mind for sex, without thereby feeling compelled to forsake contraception. Instilling one’s tastes or moral values into an heir isn’t impossible, but it also doesn’t happen automatically.

I won't weigh in on how many LessWrong posts at the time were confused about where the core of the problem lies. But "The Value Learning Problem" was one of the seven core papers in which MIRI laid out our first research agenda, so I don't think "we're centrally worried about things that are capable enough to understand what we want, but that don't have the right goals" was in any way hidden or treated as minor back in 2014-2015.

I also wouldn't say "MIRI predicted that NLP will largely fall years before AI can match e.g. the best human mathematicians, or the best scientists", and if we saw a way to leverage that surprise to take a big bite out of the central problem, that would be a big positive update.

I'd say:

  • MIRI mostly just didn't make predictions about the exact path ML would take to get to superintelligence, and we've said we didn't expect this to be very predictable because "the journey is harder to predict than the destination". (Cf. "it's easier to use physics arguments to predict that humans will one day send a probe to the Moon, than it is to predict when this will happen or what the specific capabilities of rockets five years from now will be".)
  • Back in 2016-2017, I think various people at MIRI updated to median timelines in the 2030-2040 range (after having had longer timelines before that), and our timelines haven't jumped around a ton since then (though they've gotten a little bit longer or shorter here and there).
    • So in some sense, qualitatively eyeballing the field, we don't feel surprised by "the total amount of progress the field is exhibiting", because it looked in 2017 like the field was just getting started, there was likely an enormous amount more you could do with 2017-style techniques (and variants on them) than had already been done, and there was likely to be a lot more money and talent flowing into the field in the coming years.
    • But "the total amount of progress over the last 7 years doesn't seem that shocking" is very different from "we predicted what that progress would look like". AFAIK we mostly didn't have strong guesses about that, though I think it's totally fine to say that the GPT series is more surprising to the circa-2017 MIRI than a lot of other paths would have been.
    • (Then again, we'd have expected something surprising to happen here, because it would be weird if our low-confidence visualizations of the mainline future just happened to line up with what happened. You can expect to be surprised a bunch without being able to guess where the surprises will come from; and in that situation, there's obviously less to be gained from putting out a bunch of predictions you don't particularly believe in.)
  • Pre-deep-learning-revolution, we made early predictions like "just throwing more compute at the problem without gaining deep new insights into intelligence is less likely to be the key thing that gets us there", which was falsified. But that was a relatively high-level prediction; post-deep-learning-revolution we haven't claimed to know much about how advances are going to be sequenced.
  • We have been quite interested in hearing from others about their advance prediction record: it's a lot easier to say "I personally have no idea what the qualitative capabilities of GPT-2, GPT-3, etc. will be" than to say "... and no one else knows either", and if someone has an amazing track record at guessing a lot of those qualitative capabilities, I'd be interested to hear about their further predictions. We're generally pessimistic that "which of these specific systems will first unlock a specific qualitative capability?" is particularly predictable, but this claim can be tested via people actually making those predictions.

But the benefit of a Pause is that you use the extra time to do something in particular. Why wouldn't you want to fiscally sponsor research on problems that you think need to be solved for the future of Earth-originating intelligent life to go well? 

MIRI still sponsors some alignment research, and I expect we'll sponsor more alignment research directions in the future. I'd say MIRI leadership didn't have enough aggregate hope in Agent Foundations in particular to want to keep supporting it ourselves (though I consider its existence net-positive).

My model of MIRI is that our main focus these days is "find ways to make it likelier that a halt occurs" and "improve the world's general understanding of the situation in case this helps someone come up with a better idea", but that we're also pretty open to taking on projects in all four of these quadrants, if we find something that's promising and that seems like a good fit at MIRI (or something promising that seems unlikely to occur if it's not housed at MIRI):

 AI alignment workNon-alignment work
High-EV absent a pause  
High-EV given a pause  

I don't find this convincing. I think the target "dumb enough to be safe, honest, trustworthy, relatively non-agentic, etc., but smart enough to be super helpful for alignment" is narrow (or just nonexistent, using the methods we're likely to have on hand).

Even if this exists, verification seems extraordinarily difficult: how do we know that the system is being honest? Separately, how do we verify that its solutions are correct? Checking answers is sometimes easier than generating them, but only to a limited degree, and alignment seems like a case where checking is particularly difficult.

It's also important to keep in mind that on Leopold's model (and my own), these problems need to be solved under a ton of time pressure. To maintain a lead, the USG in Leopold's scenario will often need to figure out some of these "under what circumstances can we trust this highly novel system and believe its alignment answers?" issues in a matter of weeks or perhaps months, so that the overall alignment project can complete in a very short window of time. This is not a situation where we're imagining having a ton of time to develop mastery and deep understanding of these new models. (Or mastery of the alignment problem sufficient to verify when a new idea is on the right track or not.)

one positive feature it does have, it proposes to rely on a multitude of "limited weakly-superhuman artificial alignment researchers" and makes a reasonable case that those can be obtained in a form factor which is alignable and controllable.

I don't find this convincing. I think the target "dumb enough to be safe, honest, trustworthy, relatively non-agentic, etc., but smart enough to be super helpful for alignment" is narrow (or just nonexistent, using the methods we're likely to have on hand).

Even if this exists, verification seems extraordinarily difficult: how do we know that the system is being honest? Separately, how do we verify that its solutions are correct? Checking answers is sometimes easier than generating them, but only to a limited degree, and alignment seems like a case where checking is particularly difficult.

You and Leopold seem to share the assumption that huge GPU farms or equivalently strong compute are necessary for superintelligence.

Nope! I don't assume that.

I do think that it's likely the first world-endangering AI is trained using more compute than was used to train GPT-4; but I'm certainly not confident of that prediction, and I don't think it's possible to make reasonable predictions (given our current knowledge state) about how much more compute might be needed.

("Needed" for the first world-endangeringly powerful AI humans actually build, that is. I feel confident that you can in principle build world-endangeringly powerful AI with far less compute than was used to train GPT-4; but the first lethally powerful AI systems humans actually build will presumably be far from the limits of what's physically possible!)

But what would happen if one effectively closes that path? There will be huge selection pressure to look for alternative routes, to invest more heavily in those algorithmic breakthroughs which can work with modest GPU power or even with CPUs.

Agreed. This is why I support humanity working on things like human enhancement and (plausibly) AI alignment, in parallel with working on an international AI development pause. I don't think that a pause on its own is a permanent solution, though if we're lucky and the laws are well-designed I imagine it could buy humanity quite a few decades.

I hope people will step back from solely focusing on advocating for policy-level prescriptions (as none of the existing policy-level prescriptions look particularly promising at the moment) and invest some of their time in continuing object-level discussions of AI existential safety without predefined political ends.

FWIW, MIRI does already think of "generally spreading reasonable discussion of the problem, and trying to increase the probability that someone comes up with some new promising idea for addressing x-risk" as a top organizational priority.

The usual internal framing is some version of "we have our own current best guess at how to save the world, but our idea is a massive longshot, and not the sort of basket humanity should put all its eggs in". I think "AI pause + some form of cognitive enhancement" should be a top priority, but I also consider it a top priority for humanity to try to find other potential paths to a good future.

As a start, you can prohibit sufficiently large training runs. This isn't a necessary-and-sufficient condition, and doesn't necessarily solve the problem on its own, and there's room for debate about how risk changes as a function of training resources. But it's a place to start, when the field is mostly flying blind about where the risks arise; and choosing a relatively conservative threshold makes obvious sense when failing to leave enough safety buffer means human extinction. (And when algorithmic progress is likely to reduce the minimum dangerous training size over time, whatever it is today -- also a reason the cap is likely to need to lower over time to some extent, until we're out of the lethally dangerous situation we currently find ourselves in.)

Alternatively, they either don't buy the perils or believes there's a chance the other chance may not?

If they "don't buy the perils", and the perils are real, then Leopold's scenario is falsified and we shouldn't be pushing for the USG to build ASI.

If there are no perils at all, then sure, Leopold's scenario and mine are both false. I didn't mean to imply that our two views are the only options.

Separately, Leopold's model of "what are the dangers?" is different from mine. But I don't think the dangers Leopold is worried about are dramatically easier to understand than the dangers I'm worried about (in the respective worlds where our worries are correct). Just the opposite: the level of understanding you need to literally solve alignment for superintelligences vastly exceeds the level you need to just be spooked by ASI and not want it to be built. Which is the point I was making; not "ASI is axiomatically dangerous", but "this doesn't count as a strike against my plan relative to Leopold's, and in fact Leopold is making a far bigger ask of government than I am on this front".

Nuclear war essentially has a localized p(doom) of 1

I don't know what this means. If you're saying "nuclear weapons kill the people they hit", I don't see the relevance; guns also kill the people they hit, hut that doesn't make a gun strategically similar to a smarter-than-human AI system.

Why? 95% risk of doom isn't certainty, but seems obviously more than sufficient.

For that matter, why would the USG want to build AGI if they considered it a coinflip whether this will kill everyone or not? The USG could choose the coinflip, or it could choose to try to prevent China from putting the world at risk without creating that risk itself. "Sit back and watch other countries build doomsday weapons" and "build doomsday weapons yourself" are not the only two options.

Leopold's scenario requires that the USG come to deeply understand all the perils and details of AGI and ASI (since they otherwise don't have a hope of building and aligning a superintelligence), but then needs to choose to gamble its hegemony, its very existence, and the lives of all its citizens on a half-baked mad science initiative, when it could simply work with its allies to block the tech's development and maintain the status quo at minimal risk.

Success in this scenario requires a weird combination of USG prescience with self-destructiveness: enough foresight to see what's coming, but paired with a weird compulsion to race to build the very thing that puts its existence at risk, when it would potentially be vastly easier to spearhead an international alliance to prohibit this technology.

Load More