Rob Bensinger

Communications lead at MIRI. Unless otherwise indicated, my posts and comments here reflect my own views, and not necessarily my employer's.


2022 MIRI Alignment Discussion
2021 MIRI Conversations
Naturalized Induction

Wiki Contributions

Load More


To pick out a couple of specific examples from your list, Wei Dai:

14. Human-controlled AIs causing ethical disasters (e.g., large scale suffering that can't be "balanced out" later) prior to reaching moral/philosophical maturity

This is a serious long-term concern if we don't kill ourselves first, but it's not something I see as a premise for "the priority is for governments around the world to form an international agreement to halt AI progress". If AI were easy to use for concrete tasks like "build nanotechnology" but hard to use for things like CEV, I'd instead see the priority as "use AI to prevent anyone else from destroying the world with AI", and I wouldn't want to trade off probability of that plan working in exchange for (e.g.) more probability of the US and the EU agreeing in advance to centralize and monitor large computing clusters.

After someone has done a pivotal act like that, you might then want to move more slowly insofar as you're worried about subtle moral errors creeping in to precursors to CEV.

30. AI systems end up controlled by a group of humans representing a small range of human values (ie. an ideological or religious group that imposes values on everyone else)

I currently assign very low probability to humans being able to control the first ASI systems, and redirecting governments' attention away from "rogue AI" and toward "rogue humans using AI" seems very risky to me, insofar as it causes governments to misunderstand the situation, and to specifically misunderstand it in a way that encourages racing.

If you think rogue actors can use ASI to achieve their ends, then you should probably also think that you could use ASI to achieve your own ends; misuse risk tends to go hand-in-hand with "we're the Good Guys, let's try to outrace the Bad Guys so AI ends up in the right hands". This could maybe be justified if it were true, but when it's not even true it strikes me as an especially bad argument to make.

Yep, before I saw orthonormal's response I had a draft-reply written that says almost literally the same thing:

we just call 'em like we see 'em


insofar as we make bad predictions, we should get penalized for it. and insofar as we think alignment difficulty is the crux for 'why we need to shut it all down', we'd rather directly argue against illusory alignment progress (and directly acknowledge real major alignment progress as a real reason to be less confident of shutdown as a strategy) rather than redirect to something less cruxy

I'll also add: Nate (unlike Eliezer, AFAIK?) hasn't flatly said 'alignment is extremely difficult'. Quoting from Nate's "sharp left turn" post:

Many people wrongly believe that I'm pessimistic because I think the alignment problem is extraordinarily difficult on a purely technical level. That's flatly false, and is pretty high up there on my list of least favorite misconceptions of my views.

I think the problem is a normal problem of mastering some scientific field, as humanity has done many times before. Maybe it's somewhat trickier, on account of (e.g.) intelligence being more complicated than, say, physics; maybe it's somewhat easier on account of how we have more introspective access to a working mind than we have to the low-level physical fields; but on the whole, I doubt it's all that qualitatively different than the sorts of summits humanity has surmounted before.

It's made trickier by the fact that we probably have to attain mastery of general intelligence before we spend a bunch of time working with general intelligences (on account of how we seem likely to kill ourselves by accident within a few years, once we have AGIs on hand, if no pivotal act occurs), but that alone is not enough to undermine my hope.

What undermines my hope is that nobody seems to be working on the hard bits, and I don't currently expect most people to become convinced that they need to solve those hard bits until it's too late.

So it may be that Nate's models would be less surprised by alignment breakthroughs than Eliezer's models. And some other MIRI folks are much more optimistic than Nate, FWIW.

My own view is that I don't feel nervous leaning on "we won't crack open alignment in time" as a premise, and absent that premise I'd indeed be much less gung-ho about government intervention.

why put all your argumentative eggs in the "alignment is hard" basket? (If you're right, then policymakers can't tell that you're right.)

The short answer is "we don't put all our eggs in the basket" (e.g., Eliezer's TED talk and TIME article emphasize that alignment is an open problem, but they emphasize other things too, and they don't go into detail on exactly how hard Eliezer thinks the problem is), plus "we very much want at least some eggs in that basket because it's true, it's honest, it's cruxy for us, etc." And it's easier for policymakers to acquire strong Bayesian evidence for "the problem is currently unsolved" and "there's no consensus about how to solve it" and "most leaders in the field seem to think there's a serious chance we won't solve it in time" than to acquire strong Bayesian evidence for "we're very likely generations away from solving alignment", so the difficulty of communicating the latter isn't a strong reason to de-emphasize all the former points.

The longer answer is a lot more complicated. We're still figuring out how best to communicate our views to different audiences, and "it's hard for policymakers to evaluate all the local arguments or know whether Yann LeCun is making more sense than Yoshua Bengio" is a serious constraint. If there's a specific argument (or e.g. a specific three arguments) you think we should be emphasizing alongside "alignment is unsolved and looks hard", I'd be interested to hear your suggestion and your reasoning. is a very long list and isn't optimized for policymakers, so I'm not sure what specific changes you have in mind here.

I expect it makes it easier, but I don't think it's solved.

Suppose you want to synthesize a lot of diamonds. Instead of giving an AI some lofty goal "maximize diamonds in an aligned way", why not a bunch of small grounded ones.

  1. "Plan the factory layout of the diamond synthesis plant with these requirements".
  2. "Order the equipment needed, here's the payment credentials".
  3. "Supervise construction this workday comparing to original plans"
  4. "Given this step of the plan, do it"
  5. (Once the factory is built) "remove the output from diamond synthesis machine A53 and clean it".

That is how MIRI imagines a sane developer using just-barely-aligned AI to save the world. You don't build an open-ended maximizer and unleash it on the world to maximize some quantity that sounds good to you; that sounds insanely difficult. You carve out as many tasks as you can into concrete, verifiable chunks, and you build the weakest and most limited possible AI you can to complete each chunk, to minimize risk. (Though per faul_sname, you're likely to be pretty limited in how much you can carve up the task, given time will be a major constraint and there may be parts of the task you don't fully understand at the outset.)

Cf. The Rocket Alignment Problem. The point of solving the diamond maximizer problem isn't to go build the thing; it's that solving it is an indication that we've become less conceptually confused about real-world optimization and about aimable cognitive work. Being less conceptually confused about very basic aspects of problem-solving and goal-oriented reasoning means that you might be able to build some of your powerful AI systems out of building blocks that are relatively easy to analyze, test, design, predict, separate out into discrete modules, measure and limit the capabilities of, etc., etc.

That seems acceptable, industrial equipment causes accidents all the time, the main thing is to limit the damage. Fences to limit the robots operating area, timers that shut down control after a timeout, etc.

If everyone in the world chooses to permanently use very weak systems because they're scared of AI killing them, then yes, the impact of any given system failing will stay low. But that's not what's going to actually happen; many people will use more powerful systems, once they can, because they misunderstand the risks or have galaxy-brained their way into not caring about them (e.g. 'maybe humans don't deserve to live', 'if I don't do it someone else will anyway', 'if it's that easy to destroy the world then we're fucked anyway so I should just do the Modest thing of assuming nothing I do is that important'...).

The world needs some solution to the problem "if AI keeps advancing and more-powerful AI keeps proliferating, eventually someone will destroy the world with it". I don't know of a way to leverage AI to solve that problem without the AI being pretty dangerously powerful, so I don't think AI is going to get us out of this mess unless we make a shocking amount of progress on figuring out how to align more powerful systems. (Where "aligning" includes things like being able to predict in advance how pragmatically powerful your system is, and being able to carefully limit the ways in which it's powerful.)

To be clear: The diamond maximizer problem is about getting specific intended content into the AI's goals ("diamonds" as opposed to some random physical structure it's maximizing), not just about building a stable maximizer.

From briefly talking to Eliezer about this the other day, I think the story from MIRI's perspective is more like:

  • Back in 2001, we defined "Friendly AI" as "The field of study concerned with the production of human-benefiting, non-human-harming actions in Artificial Intelligence systems that have advanced to the point of making real-world plans in pursuit of goals."

We could have defined the goal more narrowly or generically than that, but that just seemed like an invitation to take your eye off the ball: if we aren't going to think about the question of how to get good long-run outcomes from powerful AI systems, who will?

And many of the technical and philosophical problems seemed particular to CEV, which seemed like an obvious sort of solution to shoot for: just find some way to leverage the AI's intelligence to solve the problem of extrapolating everyone's preferences in a reasonable way, and of aggregating those preferences fairly.

  • Come 2014, Stuart Russell and MIRI were both looking for a new term to replace "the Friendly AI problem", now that the field was starting to become a Real Thing. Both parties disliked Bostrom's "the control problem". In conversation, Russell proposed "the alignment problem", and MIRI liked it, so Russell and MIRI both started using the term in public.

Unfortunately, it gradually came to light that Russell and MIRI had understood "Friendly AI" to mean two moderately different things, and this disconnect now turned into a split between how MIRI used "(AI) alignment" and how Russell used "(value) alignment". (Which I think also influenced the split between Paul Christiano's "(intent) alignment" and MIRI's "(outcome) alignment".)

Russell's version of "friendliness/alignment" was about making the AI have good, human-deferential goals. But Creating Friendly AI 1.0 had been very explicit that "friendliness" was about good behavior, regardless of how that's achieved. MIRI's conception of "the alignment problem" (like Bostrom's "control problem") included tools like capability constraint and boxing, because the thing we wanted researchers to focus on was the goal of leveraging AI capabilities to get actually-good outcomes, whatever technical work that requires, not some proxy goal that might turn out to be surprisingly irrelevant.

Again, we wanted a field of people keeping their eye on the ball and looking for clever technical ways to get the job done, rather than a field that neglects some actually-useful technique because it doesn't fit their narrow definition of "alignment".

  • Meanwhile, developments like the rise of deep learning had updated MIRI that CEV was not going to be a realistic thing to shoot for with your first AI. We were still thinking of some version of CEV as the ultimate goal, but it now seemed clear that capabilities were progressing too quickly for humanity to have time to nail down all the details of CEV, and it was also clear that the approaches to AI that were winning out would be far harder to analyze, predict, and "aim" than 2001-Eliezer had expected. It seemed clear that if AI was going to help make the future go well, the first order of business would be to do the minimal thing to prevent other AIs from destroying the world six months later, with other parts of alignment/friendliness deferred to later.

I think considerations like this eventually trickled in to how MIRI used the term "alignment". Our first public writing reflecting the switch from "Friendly AI" to "alignment", our Dec. 2014 agent foundations research agenda, said:

We call a smarter-than-human system that reliably pursues beneficial goals “aligned with human interests” or simply “aligned.”

Whereas by July 2016, when we released a new research agenda that was more ML-focused, "aligned" was shorthand for "aligned with the interests of the operators".

In practice, we started using "aligned" to mean something more like "aimable" (where aimability includes things like corrigibility, limiting side-effects, monitoring and limiting capabilities, etc., not just "getting the AI to predictably tile the universe with smiley faces rather than paperclips"). Focusing on CEV-ish systems mostly seemed like a distraction, and an invitation to get caught up in moral philosophy and pie-in-the-sky abstractions, when "do a pivotal act" is legitimately a hugely more philosophically shallow topic than "implement CEV". Instead, we went out of our way to frame the challenge of alignment in a way that seemed almost comically simple and "un-philosophical", but that successfully captured all of the key obstacles: 'explain how to use an AI to cause there two exist two strawberries that are identical at the cellular level, without causing anything weird or disruptive to happen in the process'.

Since realistic pivotal acts still seemed pretty outside the Overton window (and since we were mostly focused on our own research at the time), we wrote up our basic thoughts about the topic on Arbital but didn't try to super-popularize the topic among rationalists or EAs at the time. (Which unfortunately, I think, exacerbated a situation where the larger communities had very fuzzy models of the strategic situation, and fuzzy models of what the point even was of this "alignment research" thing; alignment research just become a thing-that-was-good-because-it-was-a-good, not a concrete part of a plan backchained from concrete real-world goals.)

I don't think MIRI wants to stop using "aligned" in the context of pivotal acts, and I also don't think MIRI wants to totally divorce the term from the original long-term goal of friendliness/alignment.

Turning "alignment" purely into a matter of "get the AI to do what a particular stakeholder wants" is good in some ways -- e.g., it clarifies that the level of alignment needed for pivotal acts could also be used to do bad things.

But from Eliezer's perspective, this move would also be sending a message to all the young Eliezers "Alignment Research is what you do if you're a serious sober person who thinks it's naive to care about Doing The Right Thing and is instead just trying to make AI Useful To Powerful People; if you want to aim for the obvious desideratum of making AI friendly and beneficial to the world, go join e/acc or something". Which does not seem ideal.

So I think my proposed solution would be to just acknowledge that 'the alignment problem' is ambiguous between a three different (overlapping) efforts to figure out how to get good and/or intended outcomes from powerful AI systems:

  • intent alignment, which is about getting AIs to try to do what the AI thinks the user wants, and in practice seems to be most interested in 'how do we get AIs to be generically trying-to-be-helpful'.
  • "strawberry problem" alignment, which is about getting AIs to safely, reliably, and efficiently do a small number of specific concrete tasks that are very difficult, for the sake of ending the acute existential risk period.
  • CEV-style alignment, which is about getting AIs to fully figure out how to make the future good.

Plausibly it would help to have better names for the latter two things. The distinction is similar to "narrow value learning vs. ambitious value learning", but both problems (as MIRI thinks about them) are a lot more general than just "value learning", and there's a lot more content to the strawberry problem than to "narrow alignment", and more content to CEV than to "ambitious value learning" (e.g., CEV cares about aggregation across people, not just about extrapolation).

(Note: Take the above summary of MIRI's history with a grain of salt; I had Nate Soares look at this comment and he said "on a skim, it doesn't seem to quite line up with my recollections nor cut things along the joints I would currently cut them along, but maybe it's better than nothing".)

In the context of a conversation with Balaji Srinivasan about my AI views snapshot, I asked Nate Soares what sorts of alignment results would impress him, and he said:

example thing that would be relatively impressive to me: specific, comprehensive understanding of models (with the caveat that that knowledge may lend itself more (and sooner) to capabilities before alignment). demonstrated e.g. by the ability to precisely predict the capabilities and quirks of the next generation (before running it)

i'd also still be impressed by simple theories of aimable cognition (i mostly don't expect that sort of thing to have time to play out any more, but if someone was able to come up with one after staring at LLMs for a while, i would at least be impressed)

fwiw i don't myself really know how to answer the question "technical research is more useful than policy research"; like that question sounds to me like it's generated from a place of "enough of either of these will save you" whereas my model is more like "you need both"

tho i'm more like "to get the requisite technical research, aim for uploads" at this juncture

if this was gonna be blasted outwards, i'd maybe also caveat that, while a bunch of this is a type of interpretability work, i also expect a bunch of interpretability work to strike me as fake, shallow, or far short of the bar i consider impressive/hopeful

(which is not itself supposed to be any kind of sideswipe; i applaud interpretability efforts even while thinking it's moving too slowly etc.)

I can come up with plans for destroying the world without wanting to do it, and other cognitive systems probably can too.

You're changing the topic to "can you do X without wanting Y?", when the original question was "can you do X without wanting anything at all?".

Nate's answer to nearly all questions of the form "can you do X without wanting Y?" is "yes", hence his second claim in the OP: "the wanting-like behavior required to pursue a particular training target X, does not need to involve the AI wanting X in particular".

I do need to answer that question using in a goal-oriented search process. But my goal would be "answer Paul's question", not "destroy the world".

Your ultimate goal would be neither of those things; you're a human, and if you're answering Paul's question it's probably because you have other goals that are served by answering.

In the same way, an AI that's sufficiently good at answering sufficiently hard and varied questions would probably also have goals, and it's unlikely by default that "answer questions" will be the AI's primary goal.

The idea that an area of study is less scientific because the subject is inelegant is a blinkered view of what science is.

See my reply to Bogdan here. The issue isn't "inelegance"; we also lack an inelegant ability to predict or explain how particular ML systems do what they do.

Modern ML is less like modern chemistry, and more like ancient culinary arts and medicine. (Or "ancient culinary arts and medicine shortly after a cultural reboot", such that we have a relatively small number of recently-developed shallow heuristics and facts to draw on, rather than centuries of hard-earned experience.)

The opening sounds a lot like saying "aerodynamics used to be a science until people started building planes."

The reason this analogy doesn't land for me is that I don't think our epistemic position regarding LLMs is similar to, e.g., the Wright brothers' epistemic position regarding heavier-than-air flight.

The point Nate was trying to make with "ML is no longer a science" wasn't "boo current ML that actually works, yay GOFAI that didn't work". The point was exactly to draw a contrast between, e.g., our understanding of heavier-than-air flight and our understanding of how the human brain works. The invention of useful tech that interfaces with the brain doesn't entail that we understand the brain's workings in the way we've long understood flight; it depends on what the (actual or hypothetical) tech is.

Maybe a clearer way of phrasing it is "AI used to be failed science; now it's (mostly, outside of a few small oases) a not-even-attempted science". "Failed science" maybe makes it clearer that the point here isn't to praise the old approaches that didn't work; there's a more nuanced point being made.

Some of Nate’s quick thoughts (paraphrased), after chatting with him:

Nate isn’t trying to say that we have literally zero understanding of deep nets. What he’s trying to do is qualitatively point to the kind of high-level situation we’re in, in part because he thinks there is real interpretability progress, and when you’re working in the interpretability mines and seeing real advances it can be easy to miss the forest for the trees and forget how far we are from understanding what LLMs are doing. (Compared to, e.g., how well we can predict or post-facto-mechanistically-explain a typical system humans have engineered.)

Nobody's been able to call the specific capabilities of systems in advance. Nobody's been able to call the specific exploits in advance. Nobody's been able to build better cognitive algorithms by hand after understanding how the AI does things we can't yet code by hand. There is clearly some other level of understanding that is possible that we lack, and that we once sought, and that only the interpretability folks continue to seek.

E.g., think of that time Neel Nanda figured out how a small transformer does modular arithmetic (AXRP episode). If nobody had ever thought of that algorithm for an adder, we would have thereby learned a new algorithm for an adder. There are things that these AI systems are doing that aren’t just lots of stuff we know; there are levels of organization of understanding that give you the ability to predict how things work outside of the bands where we’ve observed them.

It seems trendy to declare that they never existed in the first place and that that’s all white tower stuff, but Nate thinks this point of view is missing a pretty important and central thread.

The missing thread isn’t trivial to put into words, but it includes things like: 

  • This sounds like the same sort of thing some people would say if they were staring at computer binary for the first time and didn't know about the code behind the scenes: "We have plenty of understanding beyond just how the CPU handles instructions; we understand how memory caching works and we have recognized patterns like the stack and the heap; talking as if there's some deeper level of organization is talking like a theorist when in fact this is an engineering problem." Those types of understanding aren't false, but they aren't the sort of understanding of someone who has comprehended the codebase they're looking at.
  • There are, predictably, things to learn here; the messiness and complexity of the real world doesn’t mean we already know the relevant principles. You don't need to understand everything about how a bird works in order to build an airplane; there are compressible principles behind how birds fly; if you understand what's going on you can build flying devices that have significantly more carrying capacity than a bird, and this holds true even if the practical engineering of an airplane requires a bunch of trial and error and messy engineering work.
  • A mind’s causal structure is allowed to be complicated; we can see the weights, but we don’t thereby have a mastery of the high-level patterns. In the case of humans, neuroscience hasn’t actually worked to give us a mastery of the high-level patterns the human brain is implementing.
  • Mystery is in the map, not in the territory; reductionism works. Not all sciences that can exist, already exist today.

Possibly the above pointers are only useful if you already grok the point we’re trying to make, and isn’t so useful for communicating a new idea; but perhaps not.

Load More