Rob Bensinger

Communications lead at MIRI. Unless otherwise indicated, my posts and comments here reflect my own views, and not necessarily my employer's.


2022 MIRI Alignment Discussion
2021 MIRI Conversations
Naturalized Induction

Wiki Contributions

Load More


I can come up with plans for destroying the world without wanting to do it, and other cognitive systems probably can too.

You're changing the topic to "can you do X without wanting Y?", when the original question was "can you do X without wanting anything at all?".

Nate's answer to nearly all questions of the form "can you do X without wanting Y?" is "yes", hence his second claim in the OP: "the wanting-like behavior required to pursue a particular training target X, does not need to involve the AI wanting X in particular".

I do need to answer that question using in a goal-oriented search process. But my goal would be "answer Paul's question", not "destroy the world".

Your ultimate goal would be neither of those things; you're a human, and if you're answering Paul's question it's probably because you have other goals that are served by answering.

In the same way, an AI that's sufficiently good at answering sufficiently hard and varied questions would probably also have goals, and it's unlikely by default that "answer questions" will be the AI's primary goal.

The idea that an area of study is less scientific because the subject is inelegant is a blinkered view of what science is.

See my reply to Bogdan here. The issue isn't "inelegance"; we also lack an inelegant ability to predict or explain how particular ML systems do what they do.

Modern ML is less like modern chemistry, and more like ancient culinary arts and medicine. (Or "ancient culinary arts and medicine shortly after a cultural reboot", such that we have a relatively small number of recently-developed shallow heuristics and facts to draw on, rather than centuries of hard-earned experience.)

The opening sounds a lot like saying "aerodynamics used to be a science until people started building planes."

The reason this analogy doesn't land for me is that I don't think our epistemic position regarding LLMs is similar to, e.g., the Wright brothers' epistemic position regarding heavier-than-air flight.

The point Nate was trying to make with "ML is no longer a science" wasn't "boo current ML that actually works, yay GOFAI that didn't work". The point was exactly to draw a contrast between, e.g., our understanding of heavier-than-air flight and our understanding of how the human brain works. The invention of useful tech that interfaces with the brain doesn't entail that we understand the brain's workings in the way we've long understood flight; it depends on what the (actual or hypothetical) tech is.

Maybe a clearer way of phrasing it is "AI used to be failed science; now it's (mostly, outside of a few small oases) a not-even-attempted science". "Failed science" maybe makes it clearer that the point here isn't to praise the old approaches that didn't work; there's a more nuanced point being made.

Some of Nate’s quick thoughts (paraphrased), after chatting with him:

Nate isn’t trying to say that we have literally zero understanding of deep nets. What he’s trying to do is qualitatively point to the kind of high-level situation we’re in, in part because he thinks there is real interpretability progress, and when you’re working in the interpretability mines and seeing real advances it can be easy to miss the forest for the trees and forget how far we are from understanding what LLMs are doing. (Compared to, e.g., how well we can predict or post-facto-mechanistically-explain a typical system humans have engineered.)

Nobody's been able to call the specific capabilities of systems in advance. Nobody's been able to call the specific exploits in advance. Nobody's been able to build better cognitive algorithms by hand after understanding how the AI does things we can't yet code by hand. There is clearly some other level of understanding that is possible that we lack, and that we once sought, and that only the interpretability folks continue to seek.

E.g., think of that time Neel Nanda figured out how a small transformer does modular arithmetic (AXRP episode). If nobody had ever thought of that algorithm for an adder, we would have thereby learned a new algorithm for an adder. There are things that these AI systems are doing that aren’t just lots of stuff we know; there are levels of organization of understanding that give you the ability to predict how things work outside of the bands where we’ve observed them.

It seems trendy to declare that they never existed in the first place and that that’s all white tower stuff, but Nate thinks this point of view is missing a pretty important and central thread.

The missing thread isn’t trivial to put into words, but it includes things like: 

  • This sounds like the same sort of thing some people would say if they were staring at computer binary for the first time and didn't know about the code behind the scenes: "We have plenty of understanding beyond just how the CPU handles instructions; we understand how memory caching works and we have recognized patterns like the stack and the heap; talking as if there's some deeper level of organization is talking like a theorist when in fact this is an engineering problem." Those types of understanding aren't false, but they aren't the sort of understanding of someone who has comprehended the codebase they're looking at.
  • There are, predictably, things to learn here; the messiness and complexity of the real world doesn’t mean we already know the relevant principles. You don't need to understand everything about how a bird works in order to build an airplane; there are compressible principles behind how birds fly; if you understand what's going on you can build flying devices that have significantly more carrying capacity than a bird, and this holds true even if the practical engineering of an airplane requires a bunch of trial and error and messy engineering work.
  • A mind’s causal structure is allowed to be complicated; we can see the weights, but we don’t thereby have a mastery of the high-level patterns. In the case of humans, neuroscience hasn’t actually worked to give us a mastery of the high-level patterns the human brain is implementing.
  • Mystery is in the map, not in the territory; reductionism works. Not all sciences that can exist, already exist today.

Possibly the above pointers are only useful if you already grok the point we’re trying to make, and isn’t so useful for communicating a new idea; but perhaps not.

I read and responded to some pieces of that post when it came out; I don't know whether Eliezer, Nate, etc. read it, and I'm guessing it didn't shift MIRI, except as one of many data points "person X is now loudly in favor of a pause (and other people seem receptive), so maybe this is more politically tractable than we thought".

I'd say that Kerry Vaughan was the main person who started smashing this Overton window, and this started in April/May/June of 2022. By late December my recollection is that this public conversation was already fully in swing and MIRI had already added our voices to the "stop building toward AGI" chorus. (Though at that stage I think we were mostly doing this on general principle, for lack of any better ideas than "share our actual long-standing views and hope that helps somehow". Our increased optimism about policy solutions mostly came later, in 2023.)

That said, I bet Katja's post had tons of relevant positive effects even if it didn't directly shift MIRI's views.

Remember that MIRI was in the business of poking at theoretical toy problems and trying to get less conceptually confused about how you could in principle cleanly design a reliable, aimable reasoner. MIRI wasn't (and isn't) in the business of issuing challenges to capabilities researchers to build a working water-bucket-filler as soon as possible, and wasn't otherwise in the business of challenging people to race to AGI faster.

It wouldn't have occurred to me that someone might think 'can a deep net fill a bucket of water, in real life, without being dangerously capable' is a crucial question in this context; I'm not sure we ever even had the thought occur in our heads 'when might such-and-such DL technique successfully fill a bucket?'. It would seem just as strange to me as going to check the literature to make sure no GOFAI system ever filled a bucket of water.

(And while I think I understand why others see ChatGPT as a large positive update about alignment's difficulty, I hope it's also obvious why others, MIRI included, would not see it that way.)

Hacky approaches to alignment do count just as much as clean, scrutable, principled approaches -- the important thing is that the AGI transition goes well, not that it goes well and feels clean and tidy in the process. But in this case the messy empirical approach doesn't look to me like it actually lets you build a corrigible AI that can help with a pivotal act.

If general-ish DL methods were already empirically OK at filling water buckets in 2016, just as GOFAI already was in 2016, I suspect we still would have been happy to use the Fantasia example, because it's a simple well-known story that can help make the abstract talk of utility functions and off-switch buttons easier to mentally visualize and manipulate.

(Though now that I've seen the confusion the example causes, I'm more inclined to think that the strawberry problem is a better frame than the Fantasia example.)

I think the old school MIRI cauldron-filling problem pertained to pretty mundane, everyday tasks. No one said at the time that they didn’t really mean that it would be hard to get an AGI to do those things, that it was just an allegory for other stuff like the strawberry problem. They really seemed to believe, and said over and over again, that we didn’t know how to direct a general-purpose AI to do bounded, simple, everyday tasks without it wanting to take over the world. So this should be a big update to people who held that view, even if there are still arguably risks about OOD behavior.

As someone who worked closely with Eliezer and Nate at the time, including working with Eliezer and Nate on our main write-ups that used the cauldron example, I can say that this is definitely not what we were thinking at the time. Rather:

  • The point was to illustrate a weird gap in the expressiveness and coherence of our theories of rational agency: "fill a bucket of water" seems like a simple enough task, but it's bizarrely difficult to just write down a simple formal description of an optimization process that predictably does this (without any major side-effects, etc.).
    • (We can obviously stipulate "this thing is smart enough to do the thing we want, but too dumb to do anything dangerous", but the relevant notion of "smart enough" is not itself formal; we don't understand optimization well enough to formally define agents that have all the cognitive abilities we want and none of the abilities we don't want.)
  • The point of emphasizing "holy shit, this seems so easy and simple and yet we don't see a way to do it!" wasn't to issue a challenge to capabilities researches to go cobble together a real-world AI that can fill a bucket of water without destroying the world. The point was to emphasize that corrigibility, low-impact problem-solving, 'real' satisficing behavior, etc. seem conceptually simple, and yet the concepts have no known formalism.
    • The hope was that someone would see the simple toy problems and go 'what, no way, this sounds easy', get annoyed/nerdsniped, run off to write some equations on a whiteboard, and come back a week or a year later with a formalism (maybe from some niche mathematical field) that works totally fine for this, and makes it easier to formalize lots of other alignment problems in simplified settings (e.g., with unbounded computation).
    • Or failing that, the hope was that someone might at least come up with a clever math hack that solves the immediate 'get the AI to fill the bucket and halt' problem and replaces this dumb-sounding theory question with a slightly deeper theory question.
  • By using a children's cartoon to illustrate the toy problem, we hoped to make it clearer that the genre here is "toy problem to illustrate a weird conceptual issue in trying to define certain alignment properties", not "robotics problem where we show a bunch of photos of factory robots and ask how we can build a good factory robot to refill water receptacles used in industrial applications".

Nate's version of the talk, which is mostly a more polished version of Eliezer's talk, is careful to liberally sprinkle in tons of qualifications like (emphasis added)

  • "... for systems that are sufficiently good at modeling their environment", 
  • 'if the system is smart enough to recognize that shutdown will lower its score',
  • "Relevant safety measures that don’t assume we can always outthink and outmaneuver the system...",

... to make it clearer that the general issue is powerful, strategic optimizers that have high levels of situational awareness, etc., not necessarily 'every system capable enough to fill a bucket of water' (or 'every DL system...').

I think this provides some support

??? What?? It's fine to say that this is a falsified prediction, but how does "Eliezer expected less NLP progress pre-ASI" provide support for "Eliezer thinks solving NLP is a major part of the alignment problem"?

I continue to be baffled at the way you're doing exegesis here, happily running with extremely tenuous evidence for P while dismissing contemporary evidence for not-P, and seeming unconcerned about the fact that Eliezer and Nate apparently managed to secretly believe P for many years without ever just saying it outright, and seeming equally unconcerned about the fact that Eliezer and Nate keep saying that your interpretation of what they said is wrong. (Which I also vouch for from having worked with them for ten years, separate from the giant list of specific arguments I've made. Good grief.)

At the very least, the two claims are consistent.

?? "Consistent" is very different from "supports"! Every off-topic claim by EY is "consistent" with Gallabytes' assertion.

using GOFAI methods

"Nope" to this part. I otherwise like this comment a lot!

The main thing I'm claiming is that MIRI said it would be hard to specify (for example, write into a computer) an explicit function that reflects the human value function with high fidelity, in the sense that judgements from this function about the value of outcomes fairly accurately reflect the judgements of ordinary humans. I think this is simply a distinct concept from the idea of getting an AI to understand human values. 

The key difference is the transparency and legibility of how the values are represented: if you solve the problem of value specification/value identification, that means you have an actual function that can tell you the value of any outcome. If you get an AI that merely understands human values, you can't necessarily use the AI to determine the value of any outcome, because, for example, the AI might lie to you, or simply stay silent.

Ah, this is helpful clarification! Thanks. :)

I don't think MIRI ever considered this an important part of the alignment problem, and I don't think we expect humanity to solve lots of the alignment problem as a result of having such a tool; but I think I better understand now why you think this is importantly different from "AI ever gets good at NLP at all".

don't know if your essay is the source of the phrase or whether you just titled it

I think I came up with that particular phrase (though not the idea, of course).

  • More "outer alignment"-like issues being given what seems/seemed to me like outsized focus compared to more "inner alignment"-like issues (although there has been a focus on both for as long as I can remember).

In retrospect I think we should have been more explicit about the importance of inner alignment; I think that we didn't do that in our introduction to corrigibility because it wasn't necessary for illustrating the problem and where we'd run into roadblocks.

Maybe a missing piece here is some explanation of why having a formal understanding of corrigibility might be helpful for actually training corrigibility into a system? (Helpful at all, even if it's not sufficient on its own.)

  • The attempts to think of "tricks" seeming to be focused on real-world optimization-targets to point at, rather than ways of extracting help with alignment somehow / trying to find techniques/paths/tricks for obtaining reliable oracles.

Aside from "concreteness can help make the example easier to think about when you're new to the topic", part of the explanation here might be "if the world is solved by AI, we do actually think it will probably be via doing some concrete action in the world (e.g., build nanotech), not via helping with alignment or building a system that only outputs English-language sentence".

  • Having utility functions so prominently/commonly be the layer of abstraction that is used[4].

I mean, I think utility functions are an extremely useful and basic abstraction. I think it's a lot harder to think about a lot of AI topics without invoking ideas like 'this AI thinks outcome X is better than outcome Y', or 'this AI's preference come with different weights, which can't purely be reduced to what the AI believes'.

Load More