Rob Bensinger

Communications lead at MIRI. Unless otherwise indicated, my posts and comments here reflect my own views, and not necessarily my employer's.


2022 MIRI Alignment Discussion
2021 MIRI Conversations

Wiki Contributions

Load More


The genre of plans that I'd recommend to groups currently pushing the capabilities frontier is: aim for a pivotal act that's selected for being (to the best of your knowledge) the easiest-to-align action that suffices to end the acute risk period. Per Eliezer on Arbital, the "easiest-to-align" condition probably means that you want the act that requires minimal cognitive abilities, out of the set of acts that suffice to prevent the world from being destroyed:

In the context of AI alignment, the "Principle of Minimality" or "Principle of Least Everything" says that when we are building the first sufficiently advanced Artificial Intelligence, we are operating in an extremely dangerous context in which building a marginally more powerful AI is marginally more dangerous. The first AGI ever built should therefore execute the least dangerous plan for preventing immediately following AGIs from destroying the world six months later. Furthermore, the least dangerous plan is not the plan that seems to contain the fewest material actions that seem risky in a conventional sense, but rather the plan that requires the least dangerous cognition from the AGI executing it. Similarly, inside the AGI itself, if a class of thought seems dangerous but necessary to execute sometimes, we want to execute the fewest possible instances of that class of thought.

E.g., if we think it's a dangerous kind of event for the AGI to ask "How can I achieve this end using strategies from across every possible domain?" then we might want a design where most routine operations only search for strategies within a particular domain, and events where the AI searches across all known domains are rarer and visible to the programmers. Processing a goal that can recruit subgoals across every domain would be a dangerous event, albeit a necessary one, and therefore we want to do less of it within the AI (and require positive permission for all such cases and then require operators to validate the results before proceeding).

Ideas that inherit from this principle include the general notion of Task-directed AGI, taskishness, and mild optimization.

Having a plan for alignment, deployment, etc. of AGI is (on my model) crucial for orgs that are trying to build AGI.

MIRI itself isn't pushing the AI capabilities frontier, but we are trying to do whatever seems likeliest to make the long-term future go well, and our guess is that the best way to do this is "make progress on figuring out AI alignment". So I can separately answer the question "what's MIRI's organizational plan for solving alignment?"

My answer to that question is: we don't currently have one. Nate and Eliezer are currently doing a lot of sharing of their models, while keeping an eye out for hopeful-seeming ideas.

  • If an alignment idea strikes us as having even a tiny scrap of hope, and isn't already funding-saturated, then we're making sure it gets funded. We don't care whether that happens at MIRI versus elsewhere — we're just seeking to maximize the amount of good work that's happening in the world (insofar as money can help with that), and trying to bring about the existence of a research ecosystem that contains a wide variety of different moonshots and speculative ideas that are targeted at the core difficulties of alignment (described in the AGI Ruin and sharp left turn write-ups).
  • If an idea seems to have a significant amount of hope, and not just a tiny scrap — either at a glance, or after being worked on for a while by others and bearing surprisingly promising fruit — then I expect that MIRI will make that our new organizational focus, go all-in, and pour everything we have into helping with it as much we can. (E.g., we went all-in on our 2017-2020 research directions, before concluding in late 2020 that these were progressing too slowly to still have significant hope, though they might still meet the "tiny scrap of hope" bar.)

None of the research directions we're aware of currently meet our "significant amount of hope" bar, but several things meet the "tiny scrap of hope" bar, so we're continuing to keep an eye out and support others' work, while not going all-in on any one approach.

Various researchers at MIRI are pursuing research pathways as they see fit, though (as mentioned) none currently seem promising enough to MIRI's research leadership to make us want to put lots of eggs in those baskets or narrowly focus the org's attention on those directions. We just think they're worth funding at all, given how important alignment is and how little of an idea the world has about how to make progress; and MIRI is as good a place as any to host this work.

Scott Garrabrant and Abram Demski wrote the Embedded Agency sequence as their own take on the "Agent Foundations" problems, and they and other MIRI researchers have continued to do work over the years on problems related to EA / AF, though MIRI as a whole diversified away from the Agent Foundations agenda years ago. (AFAIK Scott sees "Embedded Agency" less as a discrete agenda, and more as a cluster of related problems/confusions that bear various relations to different parts of the alignment problem.)

(Caveat: I had input from some other MIRI staff in writing the above, but I'm speaking from my own models above, not trying to perfectly capture the view of anyone else at MIRI.)

These types of posts are what drive me to largely regard lesswrong as unserious.

Do you think that there are specific falsehoods in the OP? Or do you just think it's unrespectable for humans to think about the future?

Solve the immediate problem of AGI, and then we can talk about whatever sci-fi bullcrap you want to.

Some people object to working on AGI alignment on the grounds that the future will go a lot better if we take our hands off the steering wheel and let minds develop "naturally" and "freely".

Some of those people even own companies with names like "Google"!

The best way to address that family of views is to actually talk about what would probably happen if you let a random misaligned AGI, or a random alien, optimize the future.

Foxes > Hedgehogs.

So on your understanding, "foxes" = people who have One Big Theory about which topics are respectable, and answer all futurist questions based on that theory? While "hedgehogs" = people who write long, detailed blog posts poking at various nuances and sub-nuances of a long list of loosely related object-level questions?

... Seems to me that you're either very confused about what "foxes" and "hedgehogs" are, or you didn't understand much of the OP's post.

Writing a long post about a topic doesn't imply that you're using One Simple Catch-All Model to generate all the predictions, and it doesn't imply that you're confident about the contents of the post. Refusing to think about a topic isn't being a "fox".

You'll learn a lot more about the future paying attention to what's happening right now than by wild extrapolation.

Because as all foxes know, "thinking about the present" and "thinking about the future" are mutually exclusive.

many approaches to training AGI currently seem to have as a training target something like "learn to predict humans", or some other objective that is humanly-meaningful but not-our-real-values,

I don't know whether this will continue in the future (all the way up to AGI). If it does, then it strikes me as a sufficiently coarse-grained approach (that's bad enough at inner alignment, and bad enough at outer-alignment-to-specific-things-we-actually-care-about) that I'd still be pretty surprised if the result (in the limit of superintelligence) bears any resemblance to stuff we care much about, good or bad.

E.g., there are many more "unconscious configurations of matter that bear some relation to things you learn in trying to predict humans" than there are "conscious configurations of matter that bear some relation to things you learn in trying to predict humans". Building an entire functioning conscious mind is still a very complicated end-state that requires getting lots of bits into the AGI's terminal goals correctly; it doesn't necessarily become that much easier just because we're calling the ability we're training "human prediction". Like, a superintelligent paperclipper would also be excellent at the human prediction task, given access to information about humans.

(I'll also mention that I think it's a terrible idea for safety-conscious AI researchers to put all their eggs in the "train AI via lots of data on humans" basket. But that's a separate question from what AI researchers are likely to do in practice.)

How are you defining "super-intelligent", "AGI", and "ANI" here?

I'd distinguish two questions:

  • Pivotal-act-style AI:  How do we leverage AI (or some other tech) to end the period where humans can imminently destroy the world with AI?
  • CEV-style AI:  How do we leverage AI to solve all of our problems and put us on a trajectory to an ~optimal future?

My guess is that successful pivotal act AI will need to be AGI, though I'm not highly confident of this. By "AGI" I mean "something that's doing qualitatively the right kind of reasoning to be able to efficiently model physical processes in general, both high-level and low-level".

I don't mean that the AGI that saves the world necessarily actually has the knowledge or inclination to productively reason about arbitrary topics -- e.g., we might want to limit AGI to just reasoning about low-level physics (in ways that help us build tech to save the world), and keep the AGI from doing dangerous things like "reasoning about its operators' minds". (Similarly, I would call a human a "general intelligence" insofar as they have the right cognitive machinery to do science in general, even if they've never actually thought about physics or learned any physics facts.)

In the case of CEV-style AI, I'm much more confident that it will need to be AGI, and I strongly expect it to need to be aligned enough (and capable enough) that we can trust it to reason about arbitrary domains. If it can safely do CEV at all, then we shouldn't need to restrict it -- needing to restrict it is a flag that we aren't ready to hand it such a difficult task.

The definitions given in the post are:

  • ASI-boosted humans — We solve all of the problems involved in aiming artificial superintelligence at the things we’d ideally want.


  • misaligned AI — Humans build and deploy superintelligent AI that isn’t aligned with what we’d ideally want.

I'd expect most people to agree that "We solve all of the problems involved in aiming artificial superintelligence at the things we'd ideally want" yields outcomes that are about as good as possible, and I'd expect most of the disagreement to turn (either overtly or in some subtle way) on differences in how we're defining relevant words (like "ideally", "good", and "problems").

I'd be fine with skipping over this question, except that some of the differences-in-definition might be important for the other questions, so this question may be useful for establishing a baseline.

With "misaligned AI", there are some definitional issues but I expect most of the disagreement to be substantive, since there are a lot of different levels of Badness you could expect even if you want to call all misaligned AI "bad" (at least relative to ASI-boosted humans).

In my own answers, I interpreted "misaligned AGI" as meaning: We weren't good enough at alignment to make the AGI do exactly what we wanted, so it permanently took control of the future and did "something that isn't exactly what we wanted" instead. (Which might be kinda similar to what we wanted, or might be wildly different, etc.)

If an alien only cared about maximizing the amount of computronium in the universe, and it built an AI that fills the universe with computronium because the AI values calculating pi, then I think I'd say that the AI is "aligned with that alien by default / by accident", rather than saying "the AI is misaligned with that alien but is doing ~exactly what we want anyway". So if someone thinks AI does exactly what humans want even with humans putting in zero effort to steer the AI toward that outcome, I'd classify that as "aligned-by-default AI", rather than "misaligned AI". (But there's still a huge range of possible-in-principle outcomes from misaligned AI, even if I think some a lot more likely than others.)

From the scaling-pilled perspective, or even just centrist AI perspective, this is an insane position: it is taking a L on one of, if not the most, important future technological capabilities, which in the long run may win or lose wars. If China wants to dominate Asia, much less surpass the obsolete American empire, or create AGI, or lead in aerospace, or create '5G' or whatever, it's hard to see how it's going to do that while paying more for chips which are half a decade or worse out of date.

The scaling-pilled AI view ought to be that scaling AI kills you. Why pretend that there's a strategic advantage here, as opposed to a loaded gun you can point at your own head if you're stupid enough?

It's one thing to say "given China's actual beliefs, they ought to do X" or "if China were rationally acting on a correct understanding of the world, they would do Y". But why criticize China for avoiding a self-destructive action that would make sense to do if they had a specific combination of definitely-true, maybe-true, and definitely-false beliefs -- a specific combination they don't in fact have?

That's what he's claiming, because he's claiming "cosmopolitan value" is itself a human value. (Just a very diversity-embracing one.)

I think Dennett's argumentation about the hard problem of consciousness has usually been terrible, and I don't see him as an important forerunner of illusionism, though he's an example of someone who soldiered on for anti-realism about phenomenal consciousness for long stretches of time where the arguments were lacking.

I think I remember Eliezer saying somewhere that he also wasn't impressed with Dennett's takes on the hard problem, but I forget where?

His approach also seems to be "explain our claims about consciousness".

There's some similarity between heterophenomenology and the way Eliezer/Nate talk about consciousness, though I guess I think of Eliezer/Nate's "let's find a theory that makes sense of our claims about consciousness" as more "here's a necessary feature of any account of consciousness, and a plausibly fruitful way to get insight into a lot of what's going on", not as an argument for otherwise ignoring all introspective data. Heterophenomenology IMO was always a somewhat silly and confused idea, because it's proposing that we a priori reject introspective evidence but it's not giving a clear argument for why.

(Or, worse, it's arguing something orthogonal to whether we should care about introspective evidence, while winking and nudging that there's something vaguely unrespectable about the introspective-evidence question.)

There are good arguments for being skeptical of introspection here, but "that doesn't sound like it's in the literary genre of science" should not be an argument that Bayesians find very compelling.

Thanks for the comment, Amelia! :)

One million years is ten thousand generations of humans as we know them. If AI progress were impossible under the heel of a world-state, we could increase intelligence by a few points each generation. This already happens naturally and it would hardly be difficult to compound the Flynn effect.

I think the "unboosted humans" hypothetical is meant to include mind-uploading (which makes the generation time an underestimate), but we're assuming that the simulation overlords stop us from drastically improving the quality of our individual reasoning.

Nate assigns "base humans, left alone" an ~82% chance of producing an outcome at least as good as “tiling our universe-shard with computronium that we use to run glorious merely-human civilizations", which seems unlikely to me if we can't upload humans at all. (But maybe I'm misunderstanding something about his view.)

Surely we could hit endgame technology that hits the limits of physical possibility/diminishing returns in one million years, let alone five hundred of those spans.

I think we hit the limits of technology we can think about, understand, manipulate, and build vastly earlier than that (especially if we have fast-running human uploads). But I think this limit is a lot lower than the technologies you could invent if your brain were as large as the planet Jupiter, you had native brain hardware for doing different forms of advanced math in your head, you could visualize the connection between millions of different complex machines in your working memory and simulate millions of possible complex connections between those machines inside your own head, etc.

Even when it comes to just winning a space battle using a fixed pool of fighters, I expect to get crushed by a superintelligence that can individually think about and maneuver effectively arbitrary numbers of nanobots in real time, versus humans that are manually piloting (or using crappy AI to pilot) our drones.

In comparative terms, a five hundred year sabbatical from AI would reduce the share of resources we could reach by an epsilon only, and if AI safety premises are sound then it would greatly increase EV.

Oh, agreed. But we're discussing a scenario where we never build ASI, not one where we delay 500 years.

This point is likely moot, of course. I understand that we do not live in a totalitarian world state and your intent is just to assure people that AI safety people are not neoluddites.

Yep! And more generally, to share enough background model (that doesn't normally come up in inside-baseball AI discussions) to help people identify cruxes of disagreement.

I suppose one could attempt to help a state establish global dominance

Seems super unrealistic to me, and probably bad if you could achieve it.

A different scenario that makes a lot more sense, IMO, is an AGI project pairing with some number of states during or after an AGI-enabled pivotal act. But that assumes you've already solved enough of the alignment problem to do at least one (possibly state-assisted) pivotal act.

I think there's kind of a lot of room between 95% of potential value being lost and 5%!!

My intuition is that capturing even 1% of the future's total value is an astoundingly conjunctive feat -- a narrow enough target that it's surprising if we can hit that target and yet not hit 10%, or 99%. Think less "capture at least 1% of the negentropy in our future light cone and use it for something", more "solve the first 999 digits of a 1000-digit decimal combination lock specifying an extremely complicated function of human brain-states that somehow encodes all the properties of Maximum Extremely-Weird-Posthuman Utility".

(This is based on the idea that even if the alignment problem is solved such that we know how to specify a goal rigorously to an AI, it doesn't follow that the people who happen to be programming the goal will be selfless. You work in AI so presumably you have practiced rebuttals to this concept; I do not so I'll state my thought but be clear that I expect this is well-worn territory to which I expect you to have a solid answer.)

Why do they need to be selfless? What are the selfish benefits to making the future less Fun for innumerable numbers of posthumans you'll never meet or hear anything about?

(The future light cone is big, and no one human can interact with very much of it. You swamp the selfish desires of every currently-living human before you've even used up the negentropy in one hundredth of a single galaxy. And then what do you do with the rest of the universe? We aren't guaranteed to use the rest of the universe well, but if we use it poorly the explanation probably can't be "selfishness".)

It seems to assume that things like hive-mind species are possible or common, which I don't have information about but maybe you do.

I dunno Nate's reasoning, but AFAIK the hive-mind thing may just be an example, rather than being central to his reasoning on this point.

Load More