Shah and Yudkowsky on alignment failures

Rohin Shah; Eliezer Yudkowsky

This is the final discussion log in the Late 2021 MIRI Conversations sequence, featuring Rohin Shah and Eliezer Yudkowsky, with additional comments from Rob Bensinger, Nate Soares, Richard Ngo, and Jaan Tallinn.

The discussion begins with summaries and comments on Richard and Eliezer's debate. Rohin's summary has since been revised and published in the Alignment Newsletter.

After this log, we'll be concluding this sequence with an AMA, where we invite you to comment with questions about AI alignment, cognition, forecasting, etc. Eliezer, Richard, Paul Christiano, Nate, and Rohin will all be participating.

Color key:

Chat by Rohin and Eliezer

Other chat

Emails

Follow-ups

19. Follow-ups to the Ngo/Yudkowsky conversation

19.1. Quotes from the public discussion

[Bensinger][9:22] (Nov. 25)

Interesting extracts from the public discussion of Ngo and Yudkowsky on AI capability gains:

Eliezer:

I think some of your confusion may be that you're putting "probability theory" and "Newtonian gravity" into the same bucket. You've been raised to believe that powerful theories ought to meet certain standards, like successful bold advance experimental predictions, such as Newtonian gravity made about the existence of Neptune (quite a while after the theory was first put forth, though). "Probability theory" also sounds like a powerful theory, and the people around you believe it, so you think you ought to be able to produce a powerful advance prediction it made; but it is for some reason hard to come up with an example like the discovery of Neptune, so you cast about a bit and think of the central limit theorem. That theorem is widely used and praised, so it's "powerful", and it wasn't invented before probability theory, so it's "advance", right? So we can go on putting probability theory in the same bucket as Newtonian gravity?
They're actually just very different kinds of ideas, ontologically speaking, and the standards to which we hold them are properly different ones. It seems like the sort of thing that would take a subsequence I don't have time to write, expanding beyond the underlying obvious ontological difference between validities and empirical-truths, to cover the way in which "How do we trust this, when" differs between "I have the following new empirical theory about the underlying model of gravity" and "I think that the logical notion of 'arithmetic' is a good tool to use to organize our current understanding of this little-observed phenomenon, and it appears within making the following empirical predictions..." But at least step one could be saying, "Wait, do these two kinds of ideas actually go into the same bucket at all?"
In particular it seems to me that you want properly to be asking "How do we know this empirical thing ends up looking like it's close to the abstraction?" and not "Can you show me that this abstraction is a very powerful one?" Like, imagine that instead of asking Newton about planetary movements and how we know that the particular bits of calculus he used were empirically true about the planets in particular, you instead started asking Newton for proof that calculus is a very powerful piece of mathematics worthy to predict the planets themselves - but in a way where you wanted to see some highly valuable material object that calculus had produced, like earlier praiseworthy achievements in alchemy. I think this would reflect confusion and a wrongly directed inquiry; you would have lost sight of the particular reasoning steps that made ontological sense, in the course of trying to figure out whether calculus was praiseworthy under the standards of praiseworthiness that you'd been previously raised to believe in as universal standards about all ideas.

Richard:

I agree that "powerful" is probably not the best term here, so I'll stop using it going forward (note, though, that I didn't use it in my previous comment, which I endorse more than my claims in the original debate).
But before I ask "How do we know this empirical thing ends up looking like it's close to the abstraction?", I need to ask "Does the abstraction even make sense?" Because you have the abstraction in your head, and I don't, and so whenever you tell me that X is a (non-advance) prediction of your theory of consequentialism, I end up in a pretty similar epistemic state as if George Soros tells me that X is a prediction of the theory of reflexivity, or if a complexity theorist tells me that X is a prediction of the theory of self-organisation. The problem in those two cases is less that the abstraction is a bad fit for this specific domain, and more that the abstraction is not sufficiently well-defined (outside very special cases) to even be the type of thing that can robustly make predictions.
Perhaps another way of saying it is that they're not crisp/robust/coherent concepts (although I'm open to other terms, I don't think these ones are particularly good). And it would be useful for me to have evidence that the abstraction of consequentialism you're using is a crisper concept than Soros' theory of reflexivity or the theory of self-organisation. If you could explain the full abstraction to me, that'd be the most reliable way - but given the difficulties of doing so, my backup plan was to ask for impressive advance predictions, which are the type of evidence that I don't think Soros could come up with.
I also think that, when you talk about me being raised to hold certain standards of praiseworthiness, you're still ascribing too much modesty epistemology to me. I mainly care about novel predictions or applications insofar as they help me distinguish crisp abstractions from evocative metaphors. To me it's the same type of rationality technique as asking people to make bets, to help distinguish post-hoc confabulations from actual predictions.
Of course there's a social component to both, but that's not what I'm primarily interested in. And of course there's a strand of naive science-worship which thinks you have to follow the Rules in order to get anywhere, but I'd thank you to assume I'm at least making a more interesting error than that.
Lastly, on probability theory and Newtonian mechanics: I agree that you shouldn't question how much sense it makes to use calculus in the way that you described, but that's because the application of calculus to mechanics is so clearly-defined that it'd be very hard for the type of confusion I talked about above to sneak in. I'd put evolutionary theory halfway between them: it's partly a novel abstraction, and partly a novel empirical truth. And in this case I do think you have to be very careful in applying the core abstraction of evolution to things like cultural evolution, because it's easy to do so in a confused way.

19.2. Rohin Shah's summary and thoughts

[Shah][7:06] (Nov. 6 email)

Newsletter summaries attached, would appreciate it if Eliezer and Richard checked that I wasn't misrepresenting them. (Conversation is a lot harder to accurately summarize than blog posts or papers.)

Best,

Rohin

Planned summary for the Alignment Newsletter:

Eliezer is known for being pessimistic about our chances of averting AI catastrophe. His main argument is roughly as follows:

[Yudkowsky][9:56] (Nov. 6 email reply)

[...] Eliezer is known for being pessimistic about our chances of averting AI catastrophe. His main argument

I request that people stop describing things as my "main argument" unless I've described them that way myself. These are answers that I customized for Richard Ngo's questions. Different questions would get differently emphasized replies. "His argument in the dialogue with Richard Ngo" would be fine.

[Shah][1:53] (Nov. 8 email reply)

I request that people stop describing things as my "main argument" unless I've described them that way myself.

Fair enough. It still does seem pretty relevant to know the purpose of the argument, and I would like to state something along those lines in the summary. For example, perhaps it is:

One of several relatively-independent lines of argument that suggest we're doomed; cutting this argument would make almost no difference to the overall take
Your main argument, but with weird Richard-specific emphases that you wouldn't have necessarily included if making this argument more generally; if someone refuted the core of the argument to your satisfaction it would make a big difference to your overall take
Not actually an argument you think much about at all, but somehow became the topic of discussion
Something in between these options
Something else entirely

If you can't really say, then I guess I'll just say "His argument in this particular dialogue".

I'd also like to know what the main argument is (if there is a main argument rather than lots of independent lines of evidence or something else entirely); it helps me orient to the discussion, and I suspect would be useful for newsletter readers as well.

[Shah][7:06] (Nov. 6 email)

1. We are very likely going to keep improving AI capabilities until we reach AGI, at which point either the world is destroyed, or we use the AI system to take some pivotal act before some careless actor destroys the world.

2. In either case, the AI system must be producing high-impact, world-rewriting plans; such plans are “consequentialist” in that the simplest way to get them (and thus, the one we will first build) is if you are forecasting what might happen, thinking about the expected consequences, considering possible obstacles, searching for routes around the obstacles, etc. If you don’t do this sort of reasoning, your plan goes off the rails very quickly; it is highly unlikely to lead to high impact. In particular, long lists of shallow heuristics (as with current deep learning systems) are unlikely to be enough to produce high-impact plans.

3. We’re producing AI systems by selecting for systems that can do impressive stuff, which will eventually produce AI systems that can accomplish high-impact plans using a general underlying “consequentialist”-style reasoning process (because that’s the only way to keep doing more impressive stuff). However, this selection process does not constrain the goals towards which those plans are aimed. In addition, most goals seem to have convergent instrumental subgoals like survival and power-seeking that would lead to extinction. This suggests that, unless we find a way to constrain the goals towards which plans are aimed, we should expect an existential catastrophe.

4. None of the methods people have suggested for avoiding this outcome seem like they actually avert this story.

[Yudkowsky][9:56] (Nov. 6 email reply)

[...] This suggests that, unless we find a way to constrain the goals towards which plans are aimed, we should expect an existential catastrophe.

I would not say we face catastrophe "unless we find a way to constrain the goals towards which plans are aimed". This is, first of all, not my ontology, second, I don't go around randomly slicing away huge sections of the solution space. Workable: "This suggests that we should expect an existential catastrophe by default."

[Shah][1:53] (Nov. 8 email reply)

I would not say we face catastrophe "unless we find a way to constrain the goals towards which plans are aimed".

Should I also change "However, this selection process does not constrain the goals towards which those plans are aimed", and if so what to? (Something along these lines seems crucial to the argument, but if this isn't your native ontology, then presumably you have some other thing you'd say here.)

[Shah][7:06] (Nov. 6 email)

Richard responds to this with a few distinct points:

1. It might be possible to build narrow AI systems that humans use to save the world, for example, by making AI systems that do better alignment research. Such AI systems do not seem to require the property of making long-term plans in the real world in point (3) above, and so could plausibly be safe. We might say that narrow AI systems could save the world but can’t destroy it, because humans will put plans into action for the former but not the latter.

2. It might be possible to build general AI systems that only state plans for achieving a goal of interest that we specify, without executing that plan.

3. It seems possible to create consequentialist systems with constraints upon their reasoning that lead to reduced risk.

4. It also seems possible to create systems that make effective plans, but towards ends that are not about outcomes in the real world, but instead are about properties of the plans -- think for example of corrigibility (AN #35) or deference to a human user.

5. (Richard is also more bullish on coordinating not to use powerful and/or risky AI systems, though the debate did not discuss this much.)

Eliezer’s responses:

1. This is plausible, but seems unlikely; narrow not-very-consequentialist AI (aka “long lists of shallow heuristics”) will probably not scale to the point of doing alignment research better than humans.

[Yudkowsky][9:56] (Nov. 6 email reply)

[...] This is plausible, but seems unlikely; narrow not-very-consequentialist AI (aka “long lists of shallow heuristics”) will probably not scale to the point of doing alignment research better than humans.

No, your summarized-Richard-1 is just not plausible. "AI systems that do better alignment research" are dangerous in virtue of the lethally powerful work they are doing, not because of some particular narrow way of doing that work. If you can do it by gradient descent then that means gradient descent got to the point of doing lethally dangerous work. Asking for safely weak systems that do world-savingly strong tasks is almost everywhere a case of asking for nonwet water, and asking for AI that does alignment research is an extreme case in point.

[Shah][1:53] (Nov. 8 email reply)

No, your summarized-Richard-1 is just not plausible. "AI systems that do better alignment research" are dangerous in virtue of the lethally powerful work they are doing, not because of some particular narrow way of doing that work.

How about "AI systems that help with alignment research to a sufficient degree that it actually makes a difference are almost certainly already dangerous."?

(Fwiw, I used the word "plausible" because of this sentence from the doc: "Definitely, <description of summarized-Richard-1> is among the more plausible advance-specified miracles we could get.", though I guess the point was that it is still a miracle, it just also is more likely than other miracles.)

[Ngo][9:59] (Nov. 6 email reply)

Thanks Rohin! Your efforts are much appreciated.

Eliezer: when you say "No, your summarized-Richard-1 is just not plausible", do you mean the argument is implausible, or it's not a good summary of my position (which you also think is implausible)?

For my part the main thing I'd like to modify is the term "narrow AI". In general I'm talking about all systems that are not of literally world-destroying intelligence+agency. E.g. including oracle AGIs which I wouldn't call "narrow".

More generally, I don't think all AGIs are capable of destroying the world. E.g. humans are GIs. So it might be better to characterise Eliezer as talking about some level of general intelligence which leads to destruction, and me as talking about the things that can be done with systems that are less general or less agentic than that.

We might say that narrow AI systems could save the world but can’t destroy it, because humans will put plans into action for the former but not the latter.

I don't endorse this, I think plenty of humans would be willing to use narrow AI systems to do things that could destroy the world.

systems that make effective plans, but towards ends that are not about outcomes in the real world, but instead are about properties of the plans

I'd change this to say "systems with the primary aim of producing plans with certain properties (that aren't just about outcomes in the world)"

[Yudkowsky][10:18] (Nov. 6 email reply)

Eliezer: when you say "No, your summarized-Richard-1 is just not plausible", do you mean the argument is implausible, or it's not a good summary of my position (which you also think is implausible)?

I wouldn't have presumed to state on your behalf whether it's a good summary of your position! I mean that the stated position is implausible, whether or not it was a good summary of your position.

[Shah][7:06] (Nov. 6 email)

2. This might be an improvement, but not a big one. It is the plan itself that is risky; if the AI system made a plan for a goal that wasn’t the one we actually meant, and we don’t understand that plan, that plan can still cause extinction. It is the misaligned optimization that produced the plan that is dangerous, even if there was no “agent” that specifically wanted the goal that the plan was optimized for.

3 and 4. It is certainly possible to do such things; the space of minds that could be designed is very large. However, it is difficult to do such things, as they tend to make consequentialist reasoning weaker, and on our current trajectory the first AGI that we build will probably not look like that.

[Yudkowsky][9:56] (Nov. 6 email reply)

2. This might be an improvement, but not a big one. It is the plan itself that is risky; if the AI system made a plan for a goal that wasn’t the one we actually meant, and we don’t understand that plan, that plan can still cause extinction. It is the misaligned optimization that produced the plan that is dangerous, even if there was no “agent” that specifically wanted the goal that the plan was optimized for.

No, it's not a significant improvement if the "non-executed plans" from the system are meant to do things in human hands powerful enough to save the world. They could of course be so weak as to make their human execution have no inhumanly big consequences, but this is just making the AI strategically isomorphic to a rock. The notion of there being "no 'agent' that specifically wanted the goal" seems confused to me as well; this is not something I'd ever say as a restatement of one of my own opinions. I'd shrug and tell someone to taboo the word 'agent' and would try to talk without using the word if they'd gotten hung up on that point.

[Shah][7:06] (Nov. 6 email)

Planned opinion:

I first want to note my violent agreement with the notion that a major scary thing is “consequentialist reasoning”, and that high-impact plans require such reasoning, and that we will end up building AI systems that produce high-impact plans. Nonetheless, I am still optimistic about AI safety relative to Eliezer, which I suspect comes down to three main disagreements:

1. There are many approaches that don’t solve the problem, but do increase the level of intelligence required before the problem leads to extinction. Examples include Richard’s points 1-4 above. For example, if we build a system that states plans without executing them, then for the plans to cause extinction they need to be complicated enough that the humans executing those plans don’t realize that they are leading to an outcome that was not what they wanted. It seems non-trivially probable to me that such approaches are sufficient to prevent extinction up to the level of AI intelligence needed before we can execute a pivotal act.

2. The consequentialist reasoning is only scary to the extent that it is “aimed” at a bad goal. It seems non-trivially probable to me that it will be “aimed” at a goal sufficiently good to not lead to existential catastrophe, without putting in much alignment effort.
3. I do expect some coordination to not do the most risky things.

I wish the debate had focused more on the claim that narrow AI can’t e.g. do better alignment research, as it seems like a major crux. (For example, I think that sort of intuition drives my disagreement #1.) I expect AI progress looks a lot like “the heuristics get less and less shallow in a gradual / smooth / continuous manner” which eventually leads to the sorts of plans Eliezer calls “consequentialist”, whereas I think Eliezer expects a sharper qualitative change between “lots of heuristics” and that-which-implements-consequentialist-planning.

20. November 6 conversation

20.1. Concrete plans, and AI-mediated transparency

[Yudkowsky][13:22]

So I have a general thesis about a failure mode here which is that, the moment you try to sketch any concrete plan or events which correspond to the abstract descriptions, it is much more obviously wrong, and that is why the descriptions stay so abstract in the mouths of everybody who sounds more optimistic than I am.

This may, perhaps, be confounded by the phenomenon where I am one of the last living descendants of the lineage that ever knew how to say anything concrete at all. Richard Feynman - or so I would now say in retrospect - is noticing concreteness dying out of the world, and being worried about that, at the point where he goes to a college and hears a professor talking about "essential objects" in class, and Feynman asks "Is a brick an essential object?" - meaning to work up to the notion of the inside of a brick, which can't be observed because breaking a brick in half just gives you two new exterior surfaces - and everybody in the classroom has a different notion of what it would mean for a brick to be an essential object.

Richard Feynman knew to try plugging in bricks as a special case, but the people in the classroom didn't, and I think the mental motion has died out of the world even further since Feynman wrote about it. The loss has spread to STEM as well. Though if you don't read old books and papers and contrast them to new books and papers, you wouldn't see it, and maybe most of the people who'll eventually read this will have no idea what I'm talking about because they've never seen it any other way...

I have a thesis about how optimism over AGI works. It goes like this: People use really abstract descriptions and never imagine anything sufficiently concrete, and this lets the abstract properties waver around ambiguously and inconsistently to give the desired final conclusions of the argument. So MIRI is the only voice that gives concrete examples and also by far the most pessimistic voice; if you go around fully specifying things, you can see that what gives you a good property in one place gives you a bad property someplace else, you see that you can't get all the properties you want simultaneously. Talk about a superintelligence building nanomachinery, talk concretely about megabytes of instructions going to small manipulators that repeat to lay trillions of atoms in place, and this shows you a lot of useful visible power paired with such unpleasantly visible properties as "no human could possibly check what all those instructions were supposed to do".

Abstract descriptions, on the other hand, can waver as much as they need to between what's desirable in one dimension and undesirable in another. Talk about "an AGI that just helps humans instead of replacing them" and never say exactly what this AGI is supposed to do, and this can be so much more optimistic so long as it never becomes too unfortunately concrete.

When somebody asks you "how powerful is it?" you can momentarily imagine - without writing it down - that the AGI is helping people by giving them the full recipes for protein factories that build second-stage nanotech and the instructions to feed those factories, and reply, "Oh, super powerful! More than powerful enough to flip the gameboard!" Then when somebody asks how safe it is, you can momentarily imagine that it's just giving a human mathematician a hint about proving a theorem, and say, "Oh, super duper safe, for sure, it's just helping people!"

Or maybe you don't even go through the stage of momentarily imagining the nanotech and the hint, maybe you just navigate straight in the realm of abstractions from the impossibly vague wordage of "just help humans" to the reassuring and also extremely vague "help them lots, super powerful, very safe tho".

[...] I wish the debate had focused more on the claim that narrow AI can’t e.g. do better alignment research, as it seems like a major crux. (For example, I think that sort of intuition drives my disagreement #1.) I expect AI progress looks a lot like “the heuristics get less and less shallow in a gradual / smooth / continuous manner” which eventually leads to the sorts of plans Eliezer calls “consequentialist”, whereas I think Eliezer expects a sharper qualitative change between “lots of heuristics” and that-which-implements-consequentialist-planning.

It is in this spirit that I now ask, "What the hell could it look like concretely for a safely narrow AI to help with alignment research?"

Or if you think that a left-handed wibble planner can totally make useful plans that are very safe because it's all leftish and wibbly: can you please give an example of a plan to do what?

And what I expect is for minds to bounce off that problem as they first try to visualize "Well, a plan to give mathematicians hints for proving theorems... oh, Eliezer will just say that's not useful enough to flip the gameboard... well, plans for building nanotech... Eliezer will just say that's not safe... darn it, this whole concreteness thing is such a conversational no-win scenario, maybe there's something abstract I can say instead".

[Shah][16:41]

It's reasonable to suspect failures to be concrete, but I don't buy that hypothesis as applied to me; I think I have sufficient personal evidence against it, despite the fact that I usually speak abstractly. I don't expect to convince you of this, nor do I particularly want to get into that sort of debate.

I'll note that I have the exact same experience of not seeing much concreteness, both of other people and myself, about stories that lead to doom. To be clear, in what I take to be the Eliezer-story, the part where the misaligned AI designs a pathogen that wipes out all humans or solves nanotech and gains tons of power or some other pivotal act seems fine. The part that seems to lack concreteness is how we built the superintelligence and why the superintelligence was misaligned enough to lead to extinction. (Well, perhaps. I also wouldn't be surprised if you gave a concrete example and I disagreed that it would lead to extinction.)

From my perspective, the simple concrete stories about the future are wrong and the complicated concrete stories about the future don't sound plausible, whether about safety or about doom.

Nonetheless, here's an attempt at some concrete stories. It is not the case that I think these would be convincing to you. I do expect you to say that it won't be useful enough to flip the gameboard (or perhaps that if it could possibly flip the gameboard then it couldn't be safe), but that seems to be because you think alignment will be way more difficult than I do (in expectation), and perhaps we should get into that instead.

Instead of having to handwrite code that does feature visualization or other methods of "naming neurons", an AI assistant can automatically inspect a neural net's weights, perform some experiments with them, and give them human-understandable "names". What a "name" is depends on the system being analyzed, but you could imagine that sometimes it's short memorable phrases (e.g. for the later layers of a language model), or pictures of central concepts (e.g. for image classifiers), or paragraphs describing the concept (e.g. for novel concepts discovered by a scientist AI). Given these names, it is much easier for humans to read off "circuits" from the neural net to understand how it works.
Like the above, except the AI assistant also reads out the circuits, and efficiently reimplements the neural network in, say, readable Python, that humans can then more easily mechanistically understand. (These two tasks could also be done by two different AI systems, instead of the same one; perhaps that would be easier / safer.)
We have AI assistants search for inputs on which the AI system being inspected would do something that humans would rate as bad. (We can choose any not-horribly-unnatural rating scheme we want that humans can understand, e.g. "don't say something the user said not to talk about, even if it's in their best interest" can be a tenet for finetuned GPT-N if we want.) We can either train on those inputs, or use them as a test for how well our other alignment schemes have worked.

(These are all basically leveraging the fact that we could have AI systems that are really knowledgeable in the realm of "connecting neural net activations to human concepts", which seems plausible to do without being super general or consequentialist.)

There's also lots of meta stuff, like helping us with literature reviews, speeding up paper- and blog-post-writing, etc, but I doubt this is getting at what you care about

[Yudkowsky][17:09]

If we thought that helping with literature review was enough to save the world from extinction, then we should be trying to spend at least $50M on helping with literature review right now today, and if we can't effectively spend $50M on that, then we also can't build the dataset required to train narrow AI to do literature review. Indeed, any time somebody suggests doing something weak with AGI, my response is often "Oh how about we start on that right now using humans, then," by which question its pointlessness is revealed.

[Shah][17:11]

I mean, doesn't seem crazy to just spend $50M on effective PAs, but in any case I agree with you that this is not the main thing to be thinking about

[Yudkowsky][17:13]

The other cases of "using narrow AI to help with alignment" via pointing an AI, or rather a loss function, at a transparency problem, seem to seamlessly blend into all of the other clever-ideas we may have for getting more insight into the giant inscrutable matrices of floating-point numbers. By this concreteness, it is revealed that we are not speaking of von-Neumann-plus-level AGIs who come over and firmly but gently set aside our paradigm of giant inscrutable matrices, and do something more alignable and transparent; rather, we are trying more tricks with loss functions to get human-language translations of the giant inscrutable matrices.

I have thought of various possibilities along these lines myself. They're on my list of things to try out when and if the EA community has the capacity to try out ML ideas in a format I could and would voluntarily access.

There's a basic reason I expect the world to die despite my being able to generate infinite clever-ideas for ML transparency, which, at the usual rate of 5% of ideas working, could get us as many as three working ideas in the impossible event that the facilities were available to test 60 of my ideas.

[Shah][17:15]

By this concreteness, it is revealed that we are not speaking of von-Neumann-plus-level AGIs who come over and firmly but gently set aside our paradigm of giant inscrutable matrices, and do something more alignable and transparent; rather, we are trying more tricks with loss functions to get human-language translations of the giant inscrutable matrices.

Agreed, but I don't see the point here

(Beyond "Rohin and Eliezer disagree on how impossible it is to align giant inscrutable matrices")

(I might dispute "tricks with loss functions", but that's nitpicky, I think)

[Yudkowsky][17:16]

It's that, if we get better transparency, we are then left looking at stronger evidence that our systems are planning to kill us, but this will not help us because we will not have anything we can do to make the system not plan to kill us.

[Shah][17:18]

The adversarial training case is one example where you are trying to change the system, and if you'd like I can generate more along these lines, but they aren't going to be that different and are still going to come down to what I expect you will call "playing tricks with loss functions"

[Yudkowsky][17:18]

Well, part of the point is that "AIs helping us with alignment" is, from my perspective, a classic case of something that might ambiguate between the version that concretely corresponds to "they are very smart and can give us the Textbook From The Future that we can use to easily build a robust superintelligence" (which is powerful, pivotal, unsafe, and kills you) or "they can help us with literature review" (safe, weak, unpivotal) or "we're going to try clever tricks with gradient descent and loss functions and labeled datasets to get alleged natural-language translations of some of the giant inscrutable matrices" (which was always the plan but which I expected to not be sufficient to avert ruin).

[Shah][17:19]

I'm definitely thinking of the last one, but I take your point that disambiguating between these is good

And I also think it's revealing that this is not in fact the crux of disagreement

20.2. Concrete disaster scenarios, out-of-distribution problems, and corrigibility

[Yudkowsky][17:20]

I'll note that I have the exact same experience of not seeing much concreteness, both of other people and myself, about stories that lead to doom.

I have a boundless supply of greater concrete detail for the asking, though if you ask large questions I may ask for a narrower question to avoid needing to supply 10,000 words of concrete detail.

[Shah][17:24]

I guess the main thing is to have an example of a story which includes a method for building a superintelligence (yes, I realize this is info-hazard-y, sorry, an abstract version might work) + how it becomes misaligned and what its plans become optimized for. Though as I type this out I realize that I'm likely going to disagree on the feasibility of the method for building a superintelligence?

[Yudkowsky][17:25]

I mean, I'm obviously not going to want to make any suggestions that I think could possibly work and which are not very very very obvious.

[Shah][17:25]

Yup, makes sense

[Yudkowsky][17:25]

But I don't think that's much of an issue.

I could just point to MuZero, say, and say, "Suppose something a lot like this scaled."

Do I need to explain how you would die in this case?

[Shah][17:26]

What sort of domain and what training data?

Like, do we release a robot in the real world, have it collect data, build a world model, and run MuZero with a reward for making a number in a bank account go up?

[Yudkowsky][17:28]

Supposing they're naive about it: playing all the videogames, predicting all the text and images, solving randomly generated computer puzzles, accomplishing sets of easily-labelable sensorymotor tasks using robots and webcams

[Shah][17:29]

Okay, so far I'm with you. Is there a separate deployment step, and if so, how did they finetune the agent for the deployment task? Or did it just take over the world halfway through training?

[Yudkowsky][17:29]

(though this starts to depart from the Mu Zero architecture if it has the ability to absorb knowledge via learning on more purely predictive problems)

[Shah][17:30]

(I'm okay with that, I think)

[Yudkowsky][17:32]

vaguely plausible rough scenario: there was a big ongoing debate about whether or not to try letting the system trade stocks, and while the debate was going on, the researchers kept figuring out ways to make Something Zero do more with less computing power, and then it started visibly talking at people and trying to manipulate them, and there was an enormous fuss, and what happens past this point depends on whether or not you want me to try to describe a scenario in which we die with an unrealistic amount of dignity, or a realistic scenario where we die much faster

I shall assume the former.

[Shah][17:32]

Actually I think I want concreteness earlier

[Yudkowsky][17:32]

Okay. I await your further query.

[Shah][17:32]

it started visibly talking at people and trying to manipulate them

What caused this?

Was it manipulating people in order to make e.g. sensory stuff easier to predict?

[Yudkowsky][17:36]

Cumulative lifelong learning from playing videogames took its planning abilities over a threshold; cumulative solving of computer games and multimodal real-world tasks took its internal mechanisms for unifying knowledge and making them coherent over a threshold; and it gained sufficient compressive understanding of the data it had implicitly learned by reading through hundreds of terabytes of Common Crawl, not so much the semantic knowledge contained in those pages, but the associated implicit knowledge of the Things That Generate Text (aka humans).

These combined to form an imaginative understanding that some of its real-world problems were occurring in interactions with the Things That Generate Text, and it started making plans which took that into account and tried to have effects on the Things That Generate Text in order to affect the further processes of its problems.

Or perhaps somebody trained it to write code in partnership with programmers and it already had experience coworking with and manipulating humans.

[Shah][17:39]

Checking understanding: At this point it is able to make novel plans that involve applying knowledge about humans and their role in the data-generating process in order to create a plan that leads to more reward for the real-world problems?

(Which we call "manipulating humans")

[Yudkowsky][17:40]

Yes, much as it might have gained earlier experience with making novel Starcraft plans that involved "applying knowledge about humans and their role in the data-generating process in order to create a plan that leads to more reward", if it was trained on playing Starcraft against humans at any point, or even needed to make sense of how other agents had played Starcraft

This in turn can be seen as a direct outgrowth and isomorphism of making novel plans for playing Super Mario Brothers which involve understanding Goombas and their role in the screen-generating process

except obviously that the Goombas are much less complicated and not themselves agents

[Shah][17:41]

Yup, makes sense. Not sure I totally agree that this sort of thing is likely to happen as quickly as it sounds like you believe but I'm happy to roll with it; I do think it will happen eventually

So doesn't seem particularly cruxy

I can see how this leads to existential catastrophe, if you don't expect the programmers to be worried at this early manipulation warning sign. (This is potentially cruxy for p(doom), but doesn't feel like the main action.)

[Yudkowsky][17:46]

On my mainline, where this is all happening at Deepmind, I do expect at least one person in the company has ever read anything I've written. I am not sure if Demis understands he is looking straight at death, but I am willing to suppose for the sake of discussion that he does understand this - which isn't ruled out by my actual knowledge - and talk about how we all die from there.

The very brief tl;dr is that they know they're looking at a warning sign but they cannot ~~fix the warning sign~~ actually fix the real underlying problem that the warning sign is about, and AGI is getting easier for other people to develop too.

[Shah][17:46]

I assume this is primarily about social dynamics + the ability to patch things such that things look fixed?

Yeah, makes sense

I assume the "real underlying problem" is somehow not the fact that the task you were training your AI system to do was not what you actually wanted it to do?

[Yudkowsky][17:48]

It's about the unavailability of any actual fix and the technology continuing to get easier. Even if Deepmind understands that surface patches are lethal and understands that the easy ways of hammering down the warning signs are just eliminating the visibility rather than the underlying problems, there is nothing they can do about that except wait for somebody else to destroy the world instead.

I do not know of any pivotal task you could possibly train an AI system to do using tons of correctly labeled data. This is part of why we're all dead.

[Shah][17:50]

Yeah, I think if I adopted (my understanding of) your beliefs about alignment difficulty, and there wasn't already a non-racing scheme set in place, seems like we're in trouble

[Yudkowsky][17:50]

Like, "the real underlying problem is the fact that the task you were training your AI system to do was not what you actually wanted it to do" is one way of looking at one of the several problems that are truly fundamental, but this has no remedy that I know of, besides training your AI to do something small enough to be unpivotal.

[Shah][17:51][17:52]

I don't actually know the response you'd have to "why not just do value alignment?" I can name several guesses

Fragility of value
Not sufficiently concrete
Can't give correct labels for human values

[Yudkowsky][17:52][17:52]

To be concrete, you can't ask the AGI to build one billion nanosystems, label all the samples that wiped out humanity as bad, and apply gradient descent updates

In part, you can't do that because one billion samples will get you one billion lethal systems, but even if that wasn't true, you still couldn't do it.

[Shah][17:53]

even if that wasn't true, you still couldn't do it.

Why not? Nearest unblocked strategy?

[Yudkowsky][17:53]

...no, because the first supposed output for training generated by the system at superintelligent levels kills everyone and there is nobody left to label the data.

[Shah][17:54]

Oh, I thought you were asking me to imagine away that effect with your second sentence

In fact, I still don't understand what it was supposed to mean

(Specifically this one:

In part, you can't do that because one billion samples will get you one billion lethal systems, but even if that wasn't true, you still couldn't do it.

)

[Yudkowsky][17:55]

there's a separate problem where you can't apply reinforcement learning when there's no good examples, even assuming you live to label them

and, of course, yet another form of problem where you can't tell the difference between good and bad samples

[Shah][17:56]

Okay, makes sense

Let me think a bit

[Yudkowsky][18:00]

and lest anyone start thinking that was an exhaustive list of fundamental problems, note the absence of, for example, "applying lots of optimization using an outer loss function doesn't necessarily get you something with a faithful internal cognitive representation of that loss function" aka "natural selection applied a ton of optimization power to humans using a very strict very simple criterion of 'inclusive genetic fitness' and got out things with no explicit representation of or desire towards 'inclusive genetic fitness' because that's what happens when you hill-climb and take wins in the order a simple search process through cognitive engines encounters those wins"

[Shah][18:02]

(Agreed that is another major fundamental problem, in the sense of something that could go wrong, as opposed to something that almost certainly goes wrong)

I am still curious about the "why not value alignment" question, where to expand, it's something like "let's get a wide range of situations and train the agent with gradient descent to do what a human would say is the right thing to do". (We might also call this "imitation"; maybe "value alignment" isn't the right term, I was thinking of it as trying to align the planning with "human values".)

My own answer is that we shouldn't expect this to generalize to nanosystems, but that's again much more of a "there's not great reason to expect this to go right, but also not great reason to go wrong either".

(This is a place where I would be particularly interested in concreteness, i.e. what does the AI system do in these cases, and how does that almost-necessarily follow from the way it was trained?)

[Yudkowsky][18:05]

what's an example element from the "wide range of situations" and what is the human labeling?

(I could make something up and let you object, but it seems maybe faster to ask you to make something up)

[Shah][18:09]

Uh, let's say that the AI system is being trained to act well on the Internet, and it's shown some tweet / email / message that a user might have seen, and asked to reply to the tweet / email / message. User says whether the replies are good or not (perhaps via comparisons, a la Deep RL from Human Preferences)

If I were not making it up on the spot, it would be more varied than that, but would not include "building nanosystems"

[Yudkowsky][18:10]

And presumably, in this example, the AI system is not smart enough that exposing humans to text it generates is already a world-wrecking threat if the AI is hostile?

i.e., does not just hack the humans

[Shah][18:10]

Yeah, let's assume that for the moment

[Yudkowsky][18:11]

so what you want to do is train on 'weak-safe' domains where the AI isn't smart enough to do damage, and the humans can label the data pretty well because the AI isn't smart enough to fool them

[Shah][18:11]

"want to do" is putting it a bit strongly. This is more like a scenario I can't prove is unsafe, but do not strongly believe is safe

[Yudkowsky][18:12]

but the domains where the AI can execute a world-saving pivotal act are out-of-distribution for those domains. extremely out-of-distribution. fundamentally out-of-distribution. the AI's own thought processes are out-of-distribution for any inscrutable matrices that were learned to influence those thought processes in a corrigible direction.

it's not like trying to generalize experience from playing Super Mario Bros to Metroid.

[Shah][18:13]

Definitely, but my reaction to this is "okay, no particular reason for it to be safe" -- but also not huge reason for it to be unsafe. Like, it would not hugely shock me if what-we-want is sufficiently "natural" that the AI system picks up on the right thing form the 'weak-safe' domains alone

[Yudkowsky][18:14]

you have this whole big collection of possible AI-domain tuples that are powerful-dangerous and they have properties that aren't in any of the weak-safe training situations, that are moving along third dimensions where all the weak-safe training examples were flat

now, just because something is out-of-distribution, doesn't mean that nothing can ever generalize there

[Shah][18:15]

I mean, you correctly would not accept this argument if I said that by training blue-car-driving robots solely on blue cars I am ensuring they would be bad on red-car-driving

[Yudkowsky][18:15]

humans generalize from the savannah to the vacuum

so the actual problem is that I expect the optimization to generalize and the corrigibility to fail

[Shah][18:15]

^Right, that

I am not clear on why you expect this so strongly

Maybe you think generalization is extremely rare and optimization is a special case because of how it is so useful for basically everything?

[Yudkowsky][18:16]

did you read the section of my dialogue with Richard Ngo where I tried to explain why corrigibility is anti-natural, or where Nate tried to give the example of why planning to get a laser from point A to point B without being scattered by fog is the sort of thing that also naturally says to prevent humans from filling the room with fog?

[Shah][18:19]

Ah, right, I should have predicted that. (Yes, I did read it.)

[Yudkowsky][18:19]

or for that matter, am I correct in remembering that these sections existed

so, do you need more concrete details about some part of that?

a bunch of the reason why I suspect that corrigibility is anti-natural is from trying to work particular problems there in MIRI's earlier history, and not finding anything that wasn't contrary to ~~coherence~~ the overlap in the shards of inner optimization that, when ground into existence by the outer optimization loop, coherently mix to form the part of cognition that generalizes to do powerful things; and nobody else finding it either, etc.

[Shah][18:22]

I think I disagreed with that part more directly, in that it seemed like in those sections the corrigibility was assumed to be imposed "from the outside" on top of a system with a goal, rather than having a goal that was corrigible. (I also had a similar reaction to the 2015 Corrigibility paper.)

So, for example, it seems to me like CIRL is an example of an objective that can be maximized in which the agent is corrigible-in-a-certain-sense. I agree that due to updated deference it will eventually stop seeking information from the human / be subject to corrections by the human. I don't see why, at that point, it wouldn't have just learned to do what the humans actually want it to do.

(There are objections like misspecification of the reward prior, or misspecification of the P(behavior | reward), but those feel like different concerns to the ones you're describing.)

[Yudkowsky][18:25]

a thing that MIRI tried and failed to do was find a sensible generalization of expected utility which could contain a generalized utility function that would look like an AI that let itself be shut down, without trying to force you to shut it down

and various workshop attendees not employed by MIRI, etc

[Shah][18:26]

I do agree that a CIRL agent would not let you shut it down

And this is something that should maybe give you pause, and be a lot more careful about potential misspecification problems

[Yudkowsky][18:27]

if you could give a perfectly specified prior such that the result of updating on lots of observations would be a representation of the utility function that CEV outputs, and you could perfectly inner-align an optimizer to do that thing in a way that scaled to arbitrary levels of cognitive power, then you'd be home free, sure.

[Shah][18:28]

I'm not trying to claim this is a solution. I'm more trying to point at a reason why I am not convinced that corrigibility is anti-natural.

[Yudkowsky][18:28]

the reason CIRL doesn't get off the ground is that there isn't any known, and isn't going to be any known, prior over (observation|'true' utility function) such that an AI which updates on lots of observations ends up with our true desired utility function.

if you can do that, the AI doesn't need to be corrigible

that's why it's not a counterexample to corrigibility being anti-natural

the AI just boomfs to superintelligence, observes all the things, and does all the goodness

it doesn't listen to you say no and won't let you shut it down, but by hypothesis this is fine because it got the true utility function yay

[Shah][18:31]

In the world where it doesn't immediately start out as a superintelligence, it spends a lot of time trying to figure out what you want, asking you what you prefer it does, making sure to focus on the highest-EV questions, being very careful around any irreversible actions, etc

[Yudkowsky][18:31]

and making itself smarter as fast as possible

[Shah][18:32]

Yup, that too

[Yudkowsky][18:32]

I'd do that stuff too if I was waking up in an alien world

and, with all due respect to myself, I am not corrigible

[Shah][18:33]

You'd do that stuff because you'd want to make sure you don't accidentally get killed by the aliens; a CIRL agent does it because it "wants to help the human"

[Yudkowsky][18:34]

no, a CIRL agent does it because it wants to implement the True Utility Function, which it may, early on, suspect to consist of helping* humans, and maybe to have some overlap (relative to its currently reachable short-term outcome sets, though these are of vanishingly small relative utility under the True Utility Function) with what some humans desire some of the time

(*) 'help' may not be help

separately it asks a lot of questions because the things humans do are evidence about the True Utility Function

[Shah][18:35]

I agree this is also an accurate description of CIRL

A more accurate description, even

Wait why is it vanishingly small relative utility? Is the assumption that the True Utility Function doesn't care much about humans? Or was there something going on with short vs. long time horizons that I didn't catch

[Yudkowsky][18:39]

in the short term, a weak CIRL tries to grab the hand of a human about to fall off a cliff, because its TUF probably does prefer the human who didn't fall off the cliff, if it has only exactly those two options, and this is the sort of thing it would learn was probably true about the TUF early on, given the obvious ways of trying to produce a CIRL-ish thing via gradient descent

humans eat healthy in the ancestral environment when ice cream doesn't exist as an option

in the long run, the things the CIRL agent wants do not overlap with anything humans find more desirable than paperclips (because there is no known scheme that takes in a bunch of observations, updates a prior, and outputs a utility function whose achievable maximum is galaxies living happily forever after)

and plausible TUF schemes are going to notice that grabbing the hand of a current human is a vanishing fraction of all value eventually at stake

[Shah][18:42]

Okay, cool, short vs. long time horizons

Makes sense

[Yudkowsky][18:42]

right, a weak but sufficiently reflective CIRL agent will notice an alignment of short-term interests with humans but deduce misalignment of long-term interests

though I should maybe call it CIRL* to denote the extremely probable case that the limit of its updating on observation does not in fact converge to CEV's output

[Soares][18:43]

(Attempted rephrasing of a point I read Eliezer as making upstream, in hopes that a rephrasing makes it click for Rohin:)

Corrigibility isn't for bug-free CIRL agents with a prior that actually dials in on goodness given enough observation; if you have one of those you can just run it and call it a day. Rather, corrigibility is for surviving your civilization's inability to do the job right on the first try.

CIRL doesn't have this property; it instead amounts to the assertion "if you are optimizing with respect to a distribution on utility functions that dials in on goodness given enough observation then that gets you just about as much good as optimizing goodness"; this is somewhat tangential to corrigibility.

[Yudkowsky: +1]

[Yudkowsky][18:44]

and you should maybe update on how, even though somebody thought CIRL was going to be more corrigible, in fact it made absolutely zero progress on the real problem

[Ngo: 👍]

the notion of having an uncertain utility function that you update from observation is coherent and doesn't yield circular preferences, running in circles, incoherent betting, etc.

so, of course, it is antithetical in its intrinsic nature to corrigibility

[Shah][18:47]

I guess I am not sure that I agree that this is the purpose of corrigibility-as-I-see-it. The point of corrigibility-as-I-see-it is that you don't have to specify the object-level outcomes that your AI system must produce, and instead you can specify the meta-level processes by which your AI system should come to know what the object-level outcomes to optimize for are

(At CHAI we had taken to talking about corrigibility_MIRI and corrigibility_Paul as completely separate concepts and I have clearly fallen out of that good habit)

[Yudkowsky][18:48]

speaking as the person who invented the concept, asked for name submissions for it, and selected 'corrigibility' as the winning submission, that is absolutely not how I intended the word to be used

and I think that the thing I was actually trying to talk about is important and I would like to retain a word that talks about it

'corrigibility' is meant to refer to the sort of putative hypothetical motivational properties that prevent a system from wanting to kill you after you didn't build it exactly right

low impact, mild optimization, shutdownability, abortable planning, behaviorism, conservatism, etc. (note: some of these may be less antinatural than others)

[Shah][18:51]

Cool. Sorry for the miscommunication, I think we should probably backtrack to here

so the actual problem is that I expect the optimization to generalize and the corrigibility to fail

and restart.

Though possibly I should go to bed, it is quite late here and there was definitely a time at which I would not have confused corrigibility_MIRI with corrigibility_Paul, and I am a bit worried at my completely having missed that this time

[Yudkowsky][18:51]

the thing you just said, interpreted literally, is what I would call simply "going meta" but my guess is you have a more specific metaness in mind

...does Paul use "corrigibility" to mean "going meta"? I don't think I've seen Paul doing that.

[Shah][18:54]

Not exactly "going meta", no (and I don't think I exactly mean that either). But I definitely infer a different concept from https://www.alignmentforum.org/posts/fkLYhTQteAu5SinAc/corrigibility than the one you're describing here. It is definitely possible that this comes from me misunderstanding Paul; I have done so many times

[Yudkowsky][18:55]

That looks to me like Paul used 'corrigibility' around the same way I meant it, if I'm not just reading my own face into those clouds. maybe you picked up on the exciting metaness of it and thought 'corrigibility' was talking about the metaness part? 😛

but I also want to create an affordance for you to go to bed

hopefully this last conversation combined with previous dialogues has created any sense of why I worry that corrigibility is anti-natural and hence that "on the first try at doing it, the optimization generalizes from the weak-safe domains to the strong-lethal domains, but the corrigibility doesn't"

so I would then ask you what part of this you were skeptical about

as a place to pick up when you come back from the realms of Morpheus

[Shah][18:58]

Yup, sounds good. Talk to you tomorrow!

21. November 7 conversation

21.1. Corrigibility, value learning, and pessimism

[Shah][3:23]

Quick summary of discussion so far (in which I ascribe views to Eliezer, for the sake of checking understanding, omitting for brevity the parts about how these are facts about my beliefs about Eliezer's beliefs and not Eliezer's beliefs themselves):

Some discussion of "how to use non-world-optimizing AIs to help with AI alignment", which are mostly in the category "clever tricks with gradient descent and loss functions and labeled datasets" rather than "textbook from the future". Rohin thinks these help significantly (and that "significant help" = "reduced x-risk"). Eliezer thinks that whatever help they provide is not sufficient to cross the line from "we need a miracle" to "we have a plan that has non-trivial probability of success without miracles". The crux here seems to be alignment difficulty.
Some discussion of how doom plays out. I agree with Eliezer that if the AI is catastrophic by default, and we don't have a technique that stops the AI from being catastrophic by default, and we don't already have some global coordination scheme in place, then bad things happen. Cruxes seem to be alignment difficulty and the plausibility of a global coordination scheme, of which alignment difficulty seems like the bigger one.
On alignment difficulty, an example scenario is "train on human judgments about what the right thing to do is on a variety of weak-safe domains, and hope for generalization to potentially-lethal domains". Rohin views this as neither confidently safe nor confidently unsafe. Eliezer views this as confidently unsafe, because he strongly expects the optimization to generalize while the corrigibility doesn't, because corrigibility is anti-natural.

(Incidentally, "optimization generalizes but corrigibility doesn't" is an example of the sort of thing I wish were more concrete, if you happen to be able to do that)

My current take on "corrigibility":

Prior to this discussion, in my head there was corrigibility_A and corrigibility_B. Corrigibility_A, which I associated with MIRI, was about imposing a constraint "from the outside". Given an AI system, it is a method of modifying that AI system to (say) allow you to shut it down, by performing some sort of operation on its goal. Corrigibility_B, which I associated with Paul, was about building an AI system which would have particular nice behaviors like learning about the user's preferences, accepting corrections about what it should do, etc.

After this discussion, I think everyone meant corrigibility_B all along. The point of the 2015 MIRI paper was to check whether it is possible to build a version of corrigibility_B that was compatible with expected utility maximization with a not-terribly-complicated utility function; the point of this was to see whether corrigibility could be made compatible with "plans that lase".
While I think people agree on the behaviors of corrigibility, I am not sure they agree on why we want it. Eliezer wants it for surviving failures, but maybe others want it for "dialing in on goodness". When I think about a "broad basin of corrigibility", that intuitively seems more compatible with the "dialing in on goodness" framing (but this is an aesthetic judgment that could easily be wrong).
I don't think I meant "going meta", e.g. I wouldn't have called indirect normativity an example of corrigibility. I think I was pointing at "dialing in on goodness" vs. "specifying goodness".
I agree CIRL doesn't help survive failures. But if you instead talk about "dialing in on goodness", CIRL does in fact do this, at least conceptually (and other alternatives don't).
I am somewhat surprised that "how to conceptually dial in on goodness" is not something that seems useful to you. Maybe you think it is useful, but you're objecting to me calling it corrigibility, or saying we knew how to do it before CIRL?

(A lot of the above on corrigibility is new, because the distinction between surviving-failures and dialing-in-on-goodness as different use cases for very similar kinds of behaviors is new to me. Thanks for discussion that led me to making such a distinction.)

Possible avenues for future discussion, in the order of my-guess-at-usefulness:

Discussing anti-naturality of corrigibility. As a starting point: you say that an agent that makes plans but doesn't execute them is also dangerous, because it is the plan itself that lases, and corrigibility is antithetical to lasing. Does this mean you predict that you, or I, with suitably enhanced intelligence and/or reflectivity, would not be capable of producing a plan to help an alien civilization optimize their world, with that plan being corrigible w.r.t the aliens? (This seems like a strange and unlikely position to me, but I don't see how to not make this prediction under what I believe to be your beliefs. Maybe you just bite this bullet.)
Discussing why it is very unlikely for the AI system to generalize correctly both on optimization and values-or-goals-that-guide-the-optimization (which seems to be distinct from corrigibility). Or to put it another way, why is "alignment by default according to John Wentworth" doomed to fail? https://www.lesswrong.com/posts/Nwgdq6kHke5LY692J/alignment-by-default
More checking of where I am failing to pass your ITT
Why is "dialing in on goodness" not a reasonable part of the solution space (to the extent you believe that)?
More concreteness on how optimization generalizes but corrigibility doesn't, in the case where the AI was trained by human judgment on weak-safe domains Just to continue to state it so people don't misinterpret me: in most of the cases that we're discussing, my position is not that they are safe, but rather that they are not overwhelmingly likely to be unsafe.

[Ngo][3:41]

I don't understand what you mean by dialling in on goodness. Could you explain how CIRL does this better than, say, reward modelling?

[Shah][3:49]

Reward modeling does not by default (a) choose relevant questions to ask the user in order to get more information about goodness, (b) act conservatively, especially in the face of irreversible actions, while it is still uncertain about what goodness is, or (c) take actions that are known to be robustly good, while still waiting for future information that clarifies the nuances of goodness

You could certainly do something like Deep RL from Human Preferences, where the preferences are things like "I prefer you ask me relevant questions to get more information about goodness", in order to get similar behavior. In this case you are transferring desired behaviors from a human to the AI system, whereas in CIRL the behaviors "fall out of" optimization for a specific objective

In Eliezer/Nate terms, the CIRL story shows that dialing on goodness is compatible with "plans that lase", whereas reward modeling does not show this

[Ngo][4:04]

The meta-level objective that CIRL is pointing to, what makes that thing deserve the name "goodness"? Like, if I just gave an alien CIRL, and I said "this algorithm dials an AI towards a given thing", and they looked at it without any preconceptions of what the designers wanted to do, why wouldn't they say "huh, it looks like an algorithm for dialling in on some extrapolation of the unintended consequences of people's behaviour" or something like that?

See also this part of my second discussion with Eliezer, where he brings up CIRL: [https://www.lesswrong.com/posts/7im8at9PmhbT4JHsW/ngo-and-yudkowsky-on-alignment-difficulty#3_2__Brain_functions_and_outcome_pumps] He was emphasising that CIRL, and most other proposals for alignment algorithms, just shuffle the problematic consequentialism from the original place to a less visible place. I didn't engage much with this argument because I mostly agree with it.

[Yudkowsky: +1]

[Shah][5:28]

I think you are misunderstanding my point. I am not claiming that we know how to implement CIRL such that it produces good outcomes; I agree this depends a ton on having a sufficiently good P(obs | reward). Similarly, if you gave CIRL to aliens, whether or not they say it is about getting some extrapolation of unintended consequences depends on exactly what P(obs | reward) you ended up using. There is some not-too-complicated P(obs | reward) such that you do end up getting to "goodness", or something sufficiently close that it is not an existential catastrophe; I do not claim we know what it is.

I am claiming that behaviors like (a), (b) and (c) above are compatible with expected utility theory, and thus compatible with "plans that lase". This is demonstrated by CIRL. It is not demonstrated by reward modeling, see e.g. these three papers for problems that arise (which make it so that it is working at cross purposes with itself and seems incompatible with "plans that lase"). (I'm most confident in the first supporting my point, it's been a long time since I read them so I might be wrong about the others.) To my knowledge, similar problems don't arise with CIRL (and they shouldn't, because it is a nice integrated Bayesian agent doing expected utility theory).

I could imagine an objection that P(obs | reward), while not as complicated as "the utility function that rationalizes a twitching robot", is still too complicated to really show compatibility with plans-that-lase, but pointing out that P(obs | reward) could be misspecified doesn't seem particularly relevant to whether behaviors (a), (b) and (c) are compatible with plans-that-lase.

Re: shuffling around the problematic consequentialism: it is not my main plan to avoid consequentialism in the sense of plans-that-lase. I broadly agree with Eliezer that you need consequentialism to do high-impact stuff. My plan is for the consequentialism to be aimed at good ends. So I agree that there is still consequentialism in CIRL, and I don't see this as a damning point; when I talk about "dialing in to goodness", I am thinking of aiming the consequentialism at goodness, not getting rid of consequentialism.

(You can still do things like try to be domain-specific rather than domain-general; I don't mean to completely exclude such approaches. They do seem to give additional safety. But the mainline story is that the consequentialism / optimization is directed at what we want rather than something else.)

[Ngo][6:21]

If you don't know how to implement CIRL in such a way that it actually aims at goodness, then you don't have an algorithm with properties a, b and c above.

Or, to put it another way: suppose I replace the word "goodness" with "winningness". Now I can describe AlphaStar as follows:

it choose relevant questions to ask (read: scouts to send) in order to get more information about winningness
it acts conservatively while it is still uncertain about what winningness is
it take actions that are known to be robustly ~~good~~ winningish, while still waiting for future information that clarifies the nuances of winningness

Now, you might say that the difference is that CIRL implements uncertainty over possible utility functions, not possible empirical beliefs. But this is just a semantic difference which shuffles the problem around without changing anything substantial. E.g. it's exactly equivalent if we think of CIRL as an agent with a fixed (known) utility function, which just has uncertainty about some empirical parameter related to the humans it interacts with.

[Yudkowsky: +1]

[Soares][6:55]

[...] it take actions that are known to be robustly good, while still waiting for future information that clarifies the nuances of winningness

(typo: "known to be robustly good" -> "known to be robustly winningish" :-p)

[Ngo: 👍]

Some quick reactions, some from me and some from my model of Eliezer:

Eliezer thinks that whatever help they provide is not sufficient [...] The crux here seems to be alignment difficulty.

I'd be more hesitant to declare the crux "alignment difficulty". My understanding of Eliezer's position on your "use AI to help with alignment" proposals (which focus on things like using AI to make paradigmatic AI systems more transparent) is "that was always the plan, and it doesn't address the sort of problems I'm worried about". Maybe you understand the problems Eliezer's worried about, and believe them not to be very difficult to overcome, thus putting the crux somewhere like "alignment difficulty", but I'm not convinced.

I'd update towards your crux-hypothesis if you provided a good-according-to-Eliezer summary of what other problems Eliezer sees and the reasons-according-to-Eliezer that "AI make our tensors more transparent" doesn't much address them.

Corrigibility_A [...] Corrigibility_B [...]

Of the two Corrigibility_B does sound a little closer to my concept, though neither of your descriptions cause me to be confident that communication has occurred. Throwing some checksums out there:

There are three reasons a young weak AI system might accept your corrections. It could be corrigible, or it could be incorrigibly pursuing goodness, or it could be incorrigibly pursuing some other goal while calculating that accepting this correction is better according to its current goals than risking a shutdown.
One way you can tell that CIRL is not corrigible is that it does not accept corrections when old and strong.
There's an intuitive notion of "you're here to help us implement a messy and fragile concept not yet clearly known to us; work with us here?" that makes sense to humans, that includes as a side effect things like "don't scan my brain and then disregard my objections; there could be flaws in how you're inferring my preferences from my objections; it's actually quite important that you be cautious and accept brain surgery even in cases where your updated model says we're about to make a big mistake according to our own preferences".

The point of the 2015 MIRI paper was to check whether it is possible to build a version of corrigibility_B that was compatible with expected utility maximization with a not-terribly-complicated utility function; the point of this was to see whether corrigibility could be made compatible with "plans that lase".

More like:

Corrigibility seems, at least on the surface, to be in tension with the simple and useful patterns of optimization that tend to be spotlit by demands for cross-domain success, similar to how acting like two oranges are worth one apple and one apple is worth one orange is in tension with those patterns.
In practice, this tension seems to run more than surface-deep. In particular, various attempts to reconcile the tension fail, and cause the AI to have undesirable preferences (eg, incentives to convince you to shut it down whenever its utility is suboptimal), exploitably bad beliefs (eg, willingness to bet at unreasonable odds that it won't be shut down), and/or to not be corrigible in the first place (eg, a preference for destructively uploading your mind against your protests, at which point further protests from your coworkers are screened off by its access to that upload).

[Yudkowsky: ✅]

(There's an argument I occasionally see floating around these parts that goes "ok, well what if the AI is fractally corrigible, in the sense that instead of its cognition being oriented around pursuit of some goal, its cognition is oriented around doing what it predicts a human would do (or what a human would want it to do) in a corrigible way, at every level and step of its cognition". This is perhaps where you perceive a gap between your A-type and B-type notions, where MIRI folk tend to be more interested in reconciling the tension between corrigibility and coherence, and Paulian folk tend to place more of their chips on some such fractal notion?

I admit I don't find much hope in the "fractally corrigible" view myself, and I'm not sure whether I could pass a proponent's ITT, but fwiw my model of the Yudkowskian rejoinder is "mindspace is deep and wide; that could plausibly be done if you had sufficient mastery of minds; you're not going to get anywhere near close to that in practice, because of the way that basic normal everyday cross-domain training will highlight patterns that you'd call orienting-cognition-around-a-goal".)

And my super-quick takes on your avenues for future discussion:

1. Discussing anti-naturality of corrigibility.

Hopefully the above helps.

2. Discussing why it is very unlikely for the AI system to generalize correctly both on optimization and values-or-goals-that-guide-the-optimization

The concept "patterns of thought that are useful for cross-domain success" is latent in the problems the AI faces, and known to have various simple mathematical shadows, and our training is more-or-less banging the AI over the head with it day in and day out. By contrast, the specific values we wish to be pursued are not latent in the problems, are known to lack a simple boundary, and our training is much further removed from it.

3. More checking of where I am failing to pass your ITT

4. Why is "dialing in on goodness" not a reasonable part of the solution space?

It has long been the plan to say something less like "the following list comprises goodness: ..." and more like "yo we're tryin to optimize some difficult-to-name concept; help us out?". "Find a prior that, with observation of the human operators, dials in on goodness" is a fine guess at how to formalize the latter.

If we had been planning to take the former tack, and you had come in suggesting CIRL, that might have helped us switch to the latter tack, which would have been cool. In that sense, it's a fine part of the solution.

It also provides some additional formality, which is another iota of potential solution-ness, for that part of the problem.

It doesn't much address the rest of the problem, which is centered much more around "how do you point powerful cognition in any direction at all" (such as towards your chosen utility function or prior thereover).

5. More concreteness on how optimization generalizes but corrigibility doesn't, in the case where the AI was trained by human judgment on weak-safe domains

[Shah][13:23]

If you don't know how to implement CIRL in such a way that it actually aims at goodness, then you don't have an algorithm with properties a, b and c above.

I want clarity on the premise here:

Is the premise "Rohin cannot write code that when run exhibits properties a, b, and c"? If so, I totally agree, but I'm not sure what the point is. All alignment work ever until the very last step will not lead you to writing code that when run exhibits an aligned superintelligence, but this does not mean that the prior alignment work was useless.
Is the premise "there does not exist code that (1) we would call an implementation of CIRL and (2) when run has properties a, b, and c"? If so, I think your premise is false, for the reasons given previously (I can repeat them if needed)

I imagine it is neither of the above, and you are trying to make a claim that some conclusion that I am drawing from or about CIRL is invalid, because in order for me to draw that conclusion, I need to exhibit the correct P(obs | reward). If so, I want to know which conclusion is invalid and why I have to exhibit the correct P(obs | reward) before I can reach that conclusion.

I agree that the fact that you can get properties (a), (b) and (c) are simple straightforward consequences of being Bayesian about a quantity you are uncertain about and care about, as with AlphaStar and "winningness". I don't know what you intend to imply by this -- because it also applies to other Bayesian things, it can't imply anything about alignment? I also agree the uncertainty over reward is equivalent to uncertainty over some parameter of the human (and have proved this theorem myself in the paper I wrote on the topic). I do not claim that anything in here is particularly non-obvious or clever, in case anyone thought I was making that claim.

To state it again, my claim is that behaviors like (a), (b) and (c) are consistent with "plans-that-lase", and as evidence for this claim I cite the existence of an expected-utility-maximizing algorithm that displays them, specifically CIRL with the correct p(obs | reward). I do not claim that I can write down the code, I am just claiming that it exists. If you agree with the claim but not the evidence then let's just drop the point. If you disagree with the claim then tell me why it's false. If you are unsure about the claim then point to the step in the argument you think doesn't work.

The reason I care about this claim is that it seems to me like even if you think that superintelligences only involve plans-that-lase, it seems to me like this does not rule out what we might call "dialing in to goodness" or "assisting the user", and thus it seems like this is a valid target for you to try to get your superintelligence to do.

I suspect that I do not agree with Eliezer about what plans-that-lase can do, but it seems like the two of us should at least agree that behaviors like (a), (b) and (c) can be exhibited in plans-that-lase, and if we don't agree on that some sort of miscommunication has happened.

Throwing some checksums out there

The checksums definitely make sense. (Technically I could name more reasons why a young AI might accept correction, such as "it's still sphexish in some areas, accepting corrections is one of those reasons", and for the third reason the AI could be calculating negative consequences for things other than shutdown, but that seems nitpicky and I don't think it means I have misunderstood you.)

I think the third one feels somewhat slippery and vague, in that I don't know exactly what it's claiming, but it clearly seems to be the same sort of thing as corrigibility. Mostly it's more like I wouldn't be surprised if the Textbook from the Future tells us that we mostly had the right concept of corrigibility, but that third checksum is not quite how they would describe it any more. I would be a lot more surprised if the Textbook says we mostly had the right concept but then says checksums 1 and 2 were misguided.

"The point of the 2015 MIRI paper was to check whether it is possible to build a version of corrigibility_B that was compatible with expected utility maximization with a not-terribly-complicated utility function; the point of this was to see whether corrigibility could be made compatible with 'plans that lase'."
More like:
Corrigibility seems, at least on the surface, to be in tension with the simple and useful patterns of optimization that tend to be spotlit by demands for cross-domain success, similar to how as acting like an two oranges are worth one apple and one apple is worth one orange is in tension with those patterns.
In practice, this tension seems to run more than surface-deep. In particular, various attempts to reconcile the tension fail, and cause the AI to have undesirable preferences (eg, incentives to convince you to shut it down whenever its utility is suboptimal), exploitably bad beliefs (eg, willingness to bet at unreasonable odds that it won't be shut down), and/or to not be corrigible in the first place (eg, a preference for destructively uploading your mind against your protests, at which point further protests from your coworkers are screened off by its access to that upload).

On the 2015 Corrigibility paper, is this an accurate summary: "it wasn't that we were checking whether corrigibility could be compatible with useful patterns of optimization; it was already obvious at least at a surface level that corrigibility was in tension with these patterns, and we wanted to check and/or show that this tension persisted more deeply and couldn't be easily fixed".

(My other main hypothesis is that there's an important distinction between "simple and useful patterns of optimization" (term in your message) and "plans that lase" (term in my message) but if so I don't know what it is.)

[Soares][13:52]

What we wanted to do was show that the apparent tension was merely superficial. We failed.

[Shah: 👍]

(Also, IIRC -- and it's been a long time since I checked -- the 2015 paper contains only one exploration, relating to an idea of Stuart Armstrong's. There were another host of ideas raised and shot down in that era, that didn't make it into that paper, pro'lly b/c they came afterwards.)

[Shah][13:55]

What we wanted to do was show that the apparent tension was merely superficial. We failed.

(That sounds like what I originally said? I'm a bit confused why you didn't just agree with my original phrasing:

The point of the 2015 MIRI paper was to check whether it is possible to build a version of corrigibility_B that was compatible with expected utility maximization with a not-terribly-complicated utility function; the point of this was to see whether corrigibility could be made compatible with "plans that lase".

)

(I'm kinda worried that there's some big distinction between "EU maximization", "plans that lase", and "simple and useful patterns of optimization", that I'm not getting; I'm treating them as roughly equivalent at the moment when putting on my MIRI-ontology-hat.)

[Soares][14:01]

(There are a bunch of aspects of your phrasing that indicated to me a different framing, and one I find quite foreign. For instance, this talk of "building a version of corrigibility_B" strikes me as foreign, and the talk of "making it compatible with 'plans that lase'" strikes me as foreign. It's plausible to me that you, who understand your original framing, can tell that my rephrasing matches your original intent. I do not yet feel like I could emit the description you emitted without contorting my thoughts about corrigibility in foreign ways, and I'm not sure whether that's an indication that there are distinctions, important to me, that I haven't communicated.)

(I'm kinda worried that there's some big distinction between "EU maximization", "plans that lase", and "simple and useful patterns of optimization", that I'm not getting; I'm treating them as roughly equivalent at the moment when putting on my MIRI-ontology-hat.)

I, too, believe them to be basically equivalent (with the caveat that the reason for using expanded phrasings is because people have a history of misunderstanding "utility maximization" and "coherence", and so insofar as you round them all to "coherence" and then argue against some very narrow interpretation of coherence, I'm gonna protest that you're bailey-and-motting).

[Shah: 👍]

[Shah][14:12]

Hopefully the above helps.

I'm still interested in the question "Does this mean you predict that you, or I, with suitably enhanced intelligence and/or reflectivity, would not be capable of producing a plan to help an alien civilization optimize their world, with that plan being corrigible w.r.t the aliens?" I don't currently understand how you avoid making this prediction given other stated beliefs. (Maybe you just bite the bullet and do predict this?)

By contrast, the specific values we wish to be pursued are not latent in the problems, are known to lack a simple boundary, and our training is much further removed from it.

I'm not totally sure what is meant by "simple boundary", but it seems like a lot of human values are latent in text prediction on the Internet, and when training from human feedback the training is not very removed from values.

It has long been the plan to say something less like "the following list comprises goodness: ..." and more like "yo we're tryin to optimize some difficult-to-name concept; help us out?". [...]

I take this to mean that "dialing in on goodness" is a reasonable part of the solution space? If so, I retract that question. I thought from previous comments that Eliezer thought this part of solution space was more doomed than corrigibility.

(I get the sense that people think that I am butthurt about CIRL not getting enough recognition or something. I do in fact think this, but it's not part of my agenda here. I originally brought it up to make the argument that corrigibility is not in tension with EU maximization, then realized that I was mistaken about what "corrigibility" meant, but still care about the argument that "dialing in on goodness" is not in tension with EU maximization. But if we agree on that claim then I'm happy to stop talking about CIRL.)

[Soares][14:13]

I'd be capable of helping aliens optimize their world, sure. I wouldn't be motivated to, but I'd be capable.

[Shah][14:14]

(There are a bunch of aspects of your phrasing that indicated to me a different framing, and one I find quite foreign. For instance, this talk of "building a version of corrigibility_B" strikes me as foreign, and the talk of "making it compatible with 'plans that lase'" strikes me as foreign. It's plausible to me that you, who understand your original framing, can tell that my rephrasing matches your original intent. I do not yet feel like I could emit the description you emitted without contorting my thoughts about corrigibility in foreign ways, and I'm not sure whether that's an indication that there are distinctions, important to me, that I haven't communicated.)

This makes sense. I guess you might think of these concepts as quite pinned down? Like, in your head, EU maximization is just a kind of behavior (= set of behaviors), corrigibility is just another kind of behavior (= set of behaviors), and there's a straightforward yes-or-no question about whether the intersection is empty which you set out to answer, you can't "make" it come out one way or the other, nor can you "build" a new kind of corrigibility

[Soares][14:17]

Re: CIRL, my current working hypothesis is that by "use CIRL" you mean something analogous to what I say when I say "do CEV" -- namely, direct the AI to figure out what we "really" want in some correct sense, rather than attempting to specify what we want concretely. And to be clear, on my model, this is part of the solution to the overall alignment problem, and it's more-or-less why we wouldn't die immediately on the "value is fragile / we can't name exactly what we want" step if we solved the other problems.

My guess as to the disagreement about how much credit CIRL should get, is that there is in fact a disagreement, but it's not coming from MIRI folk saying "no we should be specifying the actual utility function by hand", it's coming from MIRI folk saying "this is just the advice 'do CEV' dressed up in different clothing and presented as a reason to stop worrying about corrigibility, which is irritating, given that it's orthogonal to corrigibility".

If you wanna fight that fight, I'd start by asking: Do you think CIRL is doing anything above and beyond what "use CEV" is doing? If so, what?

Regardless, I think it might be a good idea for you to try to pass my (or Eliezer's) ITT about what parts of the problem remain beyond the thing I'd call "do CEV" and why they're hard. (Not least b/c if my working hypothesis is wrong, demonstrating your mastery of that subject might prevent a bunch of toil covering ground you already know.)

[Shah][14:17]

I'd be capable of helping aliens optimize their world, sure. I wouldn't be motivated to, but I'd be capable.

Okay, so it seems like the danger requires the thing-producing-the-plan to be badly-motivated. But then I'm not sure why it seems so impossible to have a (not-badly-motivated) thing that, when given a goal, produces a plan to corrigibly get that goal. (This is a scenario Richard mentioned earlier.)

[Soares][14:19]

This makes sense. I guess you might think of these concepts as quite pinned down? Like, in your head, EU maximization is just a kind of behavior (= set of behaviors), corrigibility is just another kind of behavior (= set of behaviors), and there's a straightforward yes-or-no question about whether the intersection is empty which you set out to answer, you can't "make" it come out one way or the other, nor can you "build" a new kind of corrigibility

That sounds like one of the big directions in which your framing felt off to me, yeah :-). (I don't fully endorse that rephrasing, but it seems directionally correct to me.)

Okay, so it seems like the danger requires the thing-producing-the-plan to be badly-motivated. But then I'm not sure why it seems so impossible to have a (not-badly-motivated) thing that, when given a goal, produces a plan to corrigibly get that goal. (This is a scenario Richard mentioned earlier.)

On my model, aiming the powerful optimizer is the hard bit.

Like, once I grant "there's a powerful optimizer, and all it does is produce plans to corrigibly attain a given goal", I agree that the problem is mostly solved.

There's maybe some cleanup, but the bulk of the alignment challenge preceded that point.

[Shah: 👍]

(This is hard for all the usual reasons, that I suppose I could retread.)

[Shah][14:24]

[...] Regardless, I think it might be a good idea for you to try to pass my (or Eliezer's) ITT about what parts of the problem remain beyond the thing I'd call "do CEV" and why they're hard. (Not least b/c if my working hypothesis is wrong, demonstrating your mastery of that subject might prevent a bunch of toil covering ground you already know.)

(Working on ITT)

[Soares][14:30]

(To clarify some points of mine, in case this gets published later to other readers: (1) I might call it more centrally something like "build a DWIM system" rather than "use CEV"; and (2) this is not advice about what your civilization should do with early AGI systems, I strongly recommend against trying to pull off CEV under that kind of pressure.)

[Shah][14:32]

I don't particularly want to have fights about credit. I just didn't want to falsely state that I do not care about how much credit CIRL gets, when attempting to head off further comments that seemed designed to appease my sense of not-enough-credit. (I'm also not particularly annoyed at MIRI, here.)

On passing ITT, about what's left beyond "use CEV" (stated in my ontology because it's faster to type; I think you'll understand, but I can also translate if you think that's important):

The main thing is simply how to actually get the AI system to care about pursuing CEV. I think MIRI ontology would call this the target loading problem.
This is hard because (a) you can't just train on CEV, because you can't just implement CEV and provide that as training and (b) even if you magically could train on CEV, that does not establish that the resulting AI system then wants to optimize CEV. It could just as well optimize some other objective that correlated with CEV in the situations you trained, but no longer correlates in some new situation (like when you are building a nanosystem). (Point (b) is how I would talk about inner alignment.)
This is made harder for a variety of reasons, including (a) you're working with inscrutable matrices that you can't look at the details of, (b) there are clear racing incentives when the prize is to take over the world (or even just lots of economic profit), (c) people are unlikely to understand the issues at stake (unclear to me of the exact reasons, I'd guess it would be that the issues are too subtle / conceptual, + pressure to rationalize it away), (d) there's very little time in which we have a good understanding of the situation we face, because of fast / discontinuous takeoff

[Soares: 👍]

[Soares][14:37]

Passable ^_^ (Not exhaustive, obviously; "it will have a tendency to kill you on the first real try if you get it wrong" being an example missing piece, but I doubt you were trying to be exhaustive.) Thanks.

[Shah: 👍]

Okay, so it seems like the danger requires the thing-producing-the-plan to be badly-motivated. But then I'm not sure why it seems so impossible to have a (not-badly-motivated) thing that, when given a goal, produces a plan to corrigibly get that goal. (This is a scenario Richard mentioned earlier.)

I'm uncertain where the disconnect is here. Like, I could repeat some things from past discussions about how "it only outputs plans, it doesn't execute them" does very little (not nothing, but very little) from my perspective? Or you could try to point at past things you'd expect me to repeat and name why they don't seem to apply to you?

[Shah][14:40]

(Flagging that I should go to bed soon, though it doesn't have to be right away)

[Yudkowsky][14:50]

...I do not know if this is going to help anything, but I have a feeling that there's a frequent disconnect wherein I invented an idea, considered it, found it necessary-but-not-sufficient, and moved on to looking for additional or varying solutions, and then a decade or in this case 2 decades later, somebody comes along and sees this brilliant solution which MIRI is for some reason neglecting

this is perhaps exacerbated by a deliberate decision during the early days, when I looked very weird and the field was much more allergic to weird, to not even try to stamp my name on all the things I invented. eg, I told Nick Bostrom to please use various of my ideas as he found appropriate and only credit them if he thought that was strategically wise.

I expect that some number of people now in the field don't know I invented corrigibility, and any number of other things that I'm a little more hesitant to claim here because I didn't leave Facebook trails for inventing them

and unless you had been around for quite a while, you definitely wouldn't know that I had been (so far as I know) the first person to perform the unexceptional-to-me feat of writing down, in 2001, the very obvious idea I called "external reference semantics", or as it's called nowadays, CIRL

[Shah][14:53]

I really honestly am not trying to say that MIRI didn't think of CIRL-like things, nor am I trying to get credit for CIRL. I really just wanted to establish that "learn what is good to do" seems not-ruled-out by EU maximization. That's all. It sounds like we agree on this point and if so I'd prefer to drop it.

[Soares: ❤️]

[Yudkowsky][14:53]

Having a prior over utility functions that gets updated by evidence is not ruled out by EU maximization. That exact thing is hard for other reasons than it being contrary to the nature of EU maximization.

If it was ruled out by EU maximization for any simple reason, I would have noticed that back in 2001.

[Ngo][14:54]

I think we all agree on this point.

[Shah: 👍]

[Soares: 👍]

One thing I'd note is that during my debate with Eliezer, I'd keep saying "oh so you think X is impossible" and he'd say "no, all these things are possible, they're just really really hard".

[Yudkowsky][14:58]

...to do correctly on your first try when a failed attempt kills you.

[Shah][14:58]

Maybe it's fine; perhaps the point is just that target loading is hard, and the question is why target loading is so hard.

From my perspective, the main confusing thing about the Eliezer/Nate view is how confident it is. With each individual piece, I (usually) find myself nodding along and saying "yes, it seems like if we wanted to guarantee safety, we would need to solve this". What I don't do is say "yes, it seems like without a solution to this, we're near-certainly dead". The uncharitable view (which I share mainly to emphasize where the disconnect is, not because I think it is true) would be something like "Eliezer/Nate are falling to a Murphy bias, where they assume that unless they have an ironclad positive argument for safety, the worst possible thing will happen and we all die". I try to generate things that seem more like ironclad (or at least "leatherclad") positive arguments for doom, and mostly don't succeed; when I say "human values are very complicated" there's the rejoinder that "a superintelligence will certainly know about human values; pointing at them shouldn't take that many more bits"; when I say "this is ultimately just praying for generalization", there's the rejoinder "but it may in fact actually generalize"; add to all of this the fact that a bunch of people will be trying to prevent the problem and it seems weird to be so confident in doom.

A lot of my questions are going to be of the form "it seems like this is a way that we could survive; it definitely involves luck and does not say good things about our civilization, but it does not seem as improbable as the word 'miracle' would imply"

[Yudkowsky][15:00]

heh. from my standpoint, I'd say of this that it reflects those old experiments where if you ask people for their "expected case" it's indistinguishable from their "best case" (since both of these involve visualizing various things going on their imaginative mainline, which is to say, as planned) and reality is usually worse than their "worst case" (because they didn't adjust far enough away from their best-case anchor towards the statistical distribution for actual reality when they were trying to imagine a few failures and disappointments of the sort that reality had previously delivered)

it rhymes with the observation that it's incredibly hard to find people - even inside the field of computer security - who really have what Bruce Schneier termed the security mindset, of asking how to break a cryptography scheme, instead of imagining how your cryptography scheme could succeed

from my perspective, people are just living in a fantasy reality which, if we were actually living in it, would not be full of failed software projects or rocket prototypes that blow up even after you try quite hard to get a system design about which you made a strong prediction that it wouldn't explode

they think something special has to go wrong with a rocket design, that you must have committed some grave unusual sin against rocketry, for the rocket to explode

as opposed to every rocket wanting really strongly to explode and needing to constrain every aspect of the system to make it not explode and then the first 4 times you launch it, it blows up anyways

why? because of some particular technical issue with O-rings, with the flexibility of rubber in cold weather?

[Shah][15:05]

(I have read your Rocket Alignment and security mindset posts. Not claiming this absolves me of bias, just saying that I am familiar with them)

[Yudkowsky][15:05]

no, because the strains and temperatures in rockets are large compared to the materials that we use to make up the rockets

the fact that sometimes people are wrong in their uncertain guesses about rocketry does not make their life easier in this regard

the less they understand, the less ability they have to force an outcome within reality

it's no coincidence that when you are Wrong about your rocket, the particular form of Being Wrong that reality delivers to you as a surprise message, is not that you underestimated the strength of steel and so your rocket went to orbit and came back with fewer scratches on the hull than expected

when you are working with powerful forces there is not a symmetry around pleasant and unpleasant surprises being equally likely relative to your first-order model. if you're a good Bayesian, they will be equally likely relative to your second-order model, but this requires you to be HELLA pessimistic, indeed, SO PESSIMISTIC that sometimes you are pleasantly surprised

which looks like such a bizarre thing to a mundane human that they will gather around and remark at the case of you being pleasantly surprised

they will not be used to seeing this

and they shall say to themselves, "haha, what pessimists"

because to be unpleasantly surprised is so ordinary that they do not bother to gather and gossip about it when it happens

my fundamental sense about the other parties in this debate, underneath all the technical particulars, is that they've constructed a Murphy-free fantasy world from the same fabric that weaves crazy optimistic software project estimates and brilliant cryptographic codes whose inventors didn't quite try to break them, and are waiting to go through that very common human process of trying out their optimistic idea, letting reality gently correct them, predictably becoming older and wiser and starting to see the true scope of the problem, and so in due time becoming one of those Pessimists who tell the youngsters how ha ha of course things are not that easy

this is how the cycle usually goes

the problem is that instead of somebody's first startup failing and them then becoming much more pessimistic about lots of things they thought were easy and then doing their second startup

the part where they go ahead optimistically and learn the hard way about things in their chosen field which aren't as easy as they hoped

[Shah][15:13]

Do you want to bet on that? That seems like a testable prediction about beliefs of real people in the not-too-distant future

[Yudkowsky][15:13]

kills everyone

not just them

everyone

this is an issue

how on Earth would we bet on that if you think the bet hasn't already resolved? I'm describing the attitudes of people that I see right now today.

[Shah][15:15]

Never mind, I wanted to bet on "people becoming more pessimistic as they try ideas and see them fail", but if your idea of "see them fail" is "superintelligence kills everyone" then obviously we can't bet on that

(people here being alignment researchers, obviously ones who are not me)

[Yudkowsky][15:17]

there is some element here of the Bayesian not updating in a predictable direction, of executing today the update you know you'll make later, of saying, "ah yes, I can see that I am in the same sort of situation as the early AI pioneers who thought maybe it would take a summer and actually it was several decades because Things Were Not As Easy As They Imagined, so instead of waiting for reality to correct me, I will imagine myself having already lived through that and go ahead and be more pessimistic right now, not just a little more pessimistic, but so incredibly pessimistic that I am as likely to be pleasantly surprised as unpleasantly surprised by each successive observation, which is even more pessimism than even some sad old veterans manage", an element of genre-savviness, an element of knowing the advice that somebody would predictably be shouting at you from outside, of not just blindly enacting the plot you were handed

and I don't quite know why this is so much less common than I would have naively thought it would be

why people are content with enacting the predictable plot where they start out cheerful today and get some hard lessons and become pessimistic later

they are their own scriptwriters, and they write scripts for themselves about going into the haunted house and then splitting up the party

I would not have thought that to defy the plot was such a difficult thing for an actual human being to do

that it would require so much reflectivity or something, I don't know what else

nor do I know how to train other people to do it if they are not doing it already

but that from my perspective is the basic difference in gloominess

I am a time-traveler who came back from the world where it (super duper predictably) turned out that a lot of early bright hopes didn't pan out and various things went WRONG and alignment was HARD and it was NOT SOLVED IN ONE SUMMER BY TEN SMART RESEARCHERS

and now I am trying to warn people about this development which was, from a certain perspective, really quite obvious and not at all difficult to see coming

but people are like, "what the heck are you doing, you are enacting the wrong part of the plot, people are currently supposed to be cheerful, you can't prove that anything will go wrong, why would I turn into a grizzled veteran before the part of the plot where reality hits me over the head with the awful real scope of the problem and shows me that my early bright ideas were way too optimistic and naive"

and I'm like "no you don't get it, where I come from, everybody died and didn't turn into grizzled veterans"

and they're like "but that's not what the script says we do next"... or something, I do not know what leads people to think like this because I do not think like that myself

[Soares][15:24]

(I think what they actually do is say "it's not obvious to me that this is one of those scenarios where we become grizzled veterans, as opposed to things just actually working out easily")

("many things work out easily all the time; obviously society spends a bunch more focus on things that don't work out easily b/c the things that work easily tend to get resolved fairly quickly and then you don't notice them", or something)

(more generally, I kinda suspect that bickering closer to the object level is likely more productive)

(and i suspect this convo might be aided by Rohin naming a concrete scenario where things go well, so that Eliezer can lament the lack of genre saviness in various specific points)

[Yudkowsky][15:26]

there are, of course, lots of more local technical issues where I can specifically predict the failure mode for somebody's bright-eyed naive idea, especially when I already invented a more sophisticated version a decade or two earlier, and this is what I've usually tried to discuss

[Soares: ❤️]

because conversations like that can sometimes make any progress

[Soares][15:26]

(and possibly also Eliezer naming a concrete story where things go poorly, so that Rohin may lament the seemingly blind pessimism & premature grizzledness)

[Yudkowsky][15:27]

whereas if somebody lacks the ability to see the warning signs of which genre they are in, I do not know how to change the way they are by talking at them

[Shah][15:28]

Unsurprisingly I have disagreements with the meta-level story, but it seems really thorny to make progress on and I'm kinda inclined to not discuss it. I also should go to sleep now.

One thing it did make me think of -- it's possible that the "do it correctly on your first try when a failed attempt kills you" could be the crux here. There's a clearly-true sense which is "the first time you build a superintelligence that you cannot control, if you have failed in your alignment, then you die". There's a different sense which is "and also, anything you try to do with non-superintelligences that you can control, will tell you approximately nothing about the situation you face when you build a superintelligence". I mostly don't agree with the second sense, but if Eliezer / Nate do agree with it, that would go a long way to explaining the confidence in doom.

Two arguments I can see for the second sense: (1) the non-superintelligences only seem to respond well to alignment schemes because they don't yet have the core of general intelligence, and (2) the non-superintelligences only seem to respond well to alignment schemes because despite being misaligned they are doing what we want in order to survive and later execute a treacherous turn. EDIT: And (3) fast takeoff = not much time to look at the closest non-dangerous examples

(I still should sleep, but would be interested in seeing thoughts tomorrow, and if enough people think it's actually worthwhile to engage on the meta level I can do that. I'm cheerful about engaging on specific object-level ideas.)

[Soares: 💤]

[Yudkowsky][15:28]

it's not that early failures tell you nothing

the failure of the 1955 Dartmouth Project to produce strong AI over a summer told those researchers something

it told them the problem was harder than they'd hoped on the first shot

it didn't show them the correct way to build AGI in 1957 instead

[Bensinger][16:41]

Linking to a chat log between Eliezer and some anonymous people (and Steve Omohundro) from early September: [https://www.lesswrong.com/posts/CpvyhFy9WvCNsifkY/discussion-with-eliezer-yudkowsky-on-agi-interventions]

Eliezer tells me he thinks it pokes at some of Rohin's questions

[Yudkowsky][16:48]

I'm not sure that I can successfully, at this point, go back up and usefully reply to the text that scrolled past - I also note some internal grinding about this having turned into a thing which has Pending Replies instead of Scheduled Work Hours - and this maybe means that in the future we shouldn't have such a general chat here, which I didn't anticipate before the fact. I shall nonetheless try to pick out some things and reply to them.

[Shah: 👍]

While I think people agree on the behaviors of corrigibility, I am not sure they agree on why we want it. Eliezer wants it for surviving failures, but maybe others want it for "dialing in on goodness". When I think about a "broad basin of corrigibility", that intuitively seems more compatible with the "dialing in on goodness" framing (but this is an aesthetic judgment that could easily be wrong).

This is a weird thing to say in my own ontology.

There's a general project of AGI alignment where you try to do some useful pivotal thing, which has to be powerful enough to be pivotal, and so you somehow need a system that thinks powerful thoughts in the right direction without it killing you.

This could include, for example:

Trying to train in "low impact" via an RL loss function that penalizes a sufficiently broad range of "impacts" that we hope the learned impact penalty generalizes to all the things we'd consider impacts - even as we scale up the system, without the sort of obvious pathologies that would materialize only over options available to sufficiently powerful systems, like sending out nanosystems to erase the visibility of its actions from human observers
Tweaking MCTS search code so that it behaves in the fashion of "mild optimization" or "taskishness" instead of searching as hard as it has power available to search
Exposing the system to lots of labeled examples of relatively simple and safe instructions being obeyed, hoping that it generalizes safe instruction-following to regimes too dangerous for us to inspect outputs and label results
Writing code that tries to recognize cases of activation vectors going outside the bounds they occupied during training, as a check on whether internal cognitive conservatism is being violated or something is seeking out adversarial counterexamples to a constraint

You could say that only parts 1 and 3 are "dialing in on goodness" because only those parts involve iteratively refining a target, or you could say that all 4 parts are "dialing in on goodness" because parts 2 and 4 help you stay alive while you're doing the iterative refining. But I don't see this distinction as fundamental or particularly helpful. What if, on part 4, you were training something to recognize out-of-bounds activations, instead of trying to hardcode it? Is that dialing in on goodness? Or is it just dialing in on survivability or corrigibility or whatnot? Or maybe even part 3 isn't really "dialing in on goodness" because the true distinction between Good and Evil is still external in the programmers and not inside the system?

I don't see this as an especially useful distinction to draw. There's a hardcoded/learned distinction that probably does matter in several places. There's a maybe-useful forest-level distinction between "actually doing the pivotal thing" and "not destroying the world as a side effect" which breaks down around the trees because the very definition of "that pivotal thing you want to do" is to do that thing and not to destroy the world.

And all of this is a class of shallow ideas that I can generate in great quantity. I now and then consider writing up the ideas like this, just to make clear that I've already thought of way more shallow ideas like this than the net public output of the entire rest of the alignment field, so it's not that my concerns of survivability stem from my having missed any of the obvious shallow ideas like that.

The reason I don't spend a lot of time talking about it is not that I haven't thought of it, it's that I've thought of it, explored it for a while, and decided not to write it up because I don't think it can save the world and the infinite well of shallow ideas seems more like a distraction from the level of miracle we would actually need.

As a starting point: you say that an agent that makes plans but doesn't execute them is also dangerous, because it is the plan itself that lases, and corrigibility is antithetical to lasing. Does this mean you predict that you, or I, with suitably enhanced intelligence and/or reflectivity, would not be capable of producing a plan to help an alien civilization optimize their world, with that plan being corrigible w.r.t the aliens? (This seems like a strange and unlikely position to me, but I don't see how to not make this prediction under what I believe to be your beliefs. Maybe you just bite this bullet.)

I 'could' corrigibly help the Babyeaters in the sense that I have a notion of what it would mean to corrigibly help them, and if I wanted to do that thing for some reason, like an outside super-universal entity offering to pay me a googolplex flops of eudaimonium if I did that one thing, then I could do that thing. Absent the superuniversal entity bribing me, I wouldn't want to behave corrigibly towards the Babyeaters.

This is not a defect of myself as an individual. The Superhappies would also be able to understand what it would be like to be corrigible; they wouldn't want to behave corrigibly towards the Babyeaters, because, like myself, they don't want exactly what the Babyeaters want. In particular, we would rather the universe be other than it is with respect to the Babyeaters eating babies.

[Shah: 👍]

22. Follow-ups

[Shah][0:33] (Nov. 8)

[...] Absent the superuniversal entity bribing me, I wouldn't want to behave corrigibly towards the Babyeaters. [...]

Got it. Yeah I think I just misunderstood a point you were saying previously. When Richard asked about systems that simply produce plans rather than execute them, you said something like "the plan itself is dangerous", which I now realize meant "you don't get additional safety from getting to read the plan, the superintelligence would have just chosen a plan that was convincing to you but nonetheless killed everyone / otherwise worked in favor of the superintelligence's goals", but at the time I interpreted it as "any reasonable plan that can actually build nanosystems is going to be dangerous, regardless of the source", which seemed obviously false in the case of a well-motivated system.

[...] This is a weird thing to say in my own ontology. [...]

When I say "dialing in on goodness", I mean a specific class of strategies for getting a superintelligence to do a useful pivotal thing, in which you build it so that the superintelligence is applying its force towards figuring out what it is that you actually want it to do and pursuing that, which among other things would involve taking a pivotal act to reduce x-risk to ~zero.

I previously had the mistaken impression that you thought this class of strategies was probably doomed because it was incompatible with expected utility theory, which seemed wrong to me. (I don't remember why I had this belief; possibly it was while I was still misunderstanding what you meant by "corrigibility" + the claim that corrigibility is anti-natural.)

I now think that you think it is probably doomed for the same reason that most other technical strategies are probably doomed, which is that there still doesn't seem to be any plausible way of loading in the right target to the superintelligence, even when that target is a process for learning-what-to-optimize, rather than just what-to-optimize.

Linking to a chat log between Eliezer and some anonymous people (and Steve Omohundro) from early September: [https://www.lesswrong.com/posts/CpvyhFy9WvCNsifkY/discussion-with-eliezer-yudkowsky-on-agi-interventions]
Eliezer tells me he thinks it pokes at some of Rohin's questions

I'm surprised that you think this addresses (or even pokes at) my questions. As far as I can tell, most of the questions there are either about social dynamics, which I've been explicitly avoiding, and the "technical" questions seem to treat "AGI" or "superintelligence" as a symbol; there don't seem to be any internal gears underlying that symbol. The closest anyone got to internal gears was mentioning iterated amplification as a way of bootstrapping known-safe things to solving hard problems, and that was very brief.

I am much more into the question "how difficult is technical alignment". It seems like answers to this question need to be in one of two categories: (1) claims about the space of minds that lead to intelligent behavior (probably weighted by simplicity, to account for the fact that we'll get the simple ones first), (2) claims about specific methods of building superintelligences. As far as I can tell the only thing in that doc which is close to an argument of this form is "superintelligent consequentialists would find ways to manipulate humans", which seems straightforwardly true (when they are misaligned). I suppose one might also count the assertion that "the speedup step of iterated amplification will introduce errors" as an argument of this form.

It could be that you are trying to convince me of some other beliefs that I wasn't asking about, perhaps in the hopes of conveying some missing mood, but I suspect that it is just that you aren't particularly clear on what my beliefs are / what I'm interested in. (Not unreasonable, given that I've been poking at your models, rather than the other way around.) I could try saying more about that, if you'd like.

[Tallinn][11:39] (Nov. 12)

FWIW, a voice from the audience: +1 to going back to sketching concrete scenarios. even though i learned a few things from the abstract discussion of goodness/corrigibility/etc myself (eg, that “corrigible” was meant to be defined at the limit of self-improvement till maturity, not just as a label for code that does not resist iterated development), the progress felt more tangible during the “scaled up muzero” discussion above.

[Yudkowsky][15:03] (Nov. 12)

anybody want to give me a prompt for a concrete question/scenario, ideally a concrete such prompt but I'll take whatever?

[Soares][15:34] (Nov. 12)

Not sure I count, but one I'd enjoy a concrete response to: "The leading AI lab vaguely thinks it's important that their systems are 'mere predictors', and wind up creating an AGI that is dangerous; how concretely does it wind up being a scary planning optimizer or whatever, that doesn't run through a scary abstract "waking up" step".

(asking for a friend; @Joe Carlsmith or whoever else finds this scenario unintuitive plz clarify with more detailed requests if interested)

23. November 13 conversation

23.1. GPT-n and goal-oriented aspects of human reasoning

[Shah][1:46]

I'm still interested in:

5. More concreteness on how optimization generalizes but corrigibility doesn't, in the case where the AI was trained by human judgment on weak-safe domains

Specifically, we can go back to the scaled-up MuZero example. Some (lightly edited) details we had established there:

Pretraining: playing all the videogames, predicting all the text and images, solving randomly generated computer puzzles, accomplishing sets of easily-labelable sensorymotor tasks using robots and webcams
Finetuning: The AI system is being trained to act well on the Internet, and it's shown some tweet / email / message that a user might have seen, and asked to reply to the tweet / email / message. User says whether the replies are good or not (perhaps via comparisons, a la Deep RL from Human Preferences). It would be more varied than that, but would not include "building nanosystems".
The AI system is not smart enough that exposing humans to text it generates is already a world-wrecking threat if the AI is hostile.

At that point we moved from concrete to abstract:

Abstract description: train on 'weak-safe' domains where the AI isn't smart enough to do damage, and the humans can label the data pretty well because the AI isn't smart enough to fool them
Abstract problem: Optimization generalizes and corrigibility fails

I would be interested in a more concrete description here. I'm not sure exactly what details I'm looking for -- on my ontology the question is something like "what algorithm is the AI system forced to learn; how does that lead to generalized optimization and failed corrigibility; why weren't there simple safer algorithms that were compatible with the training, or if there were such algorithms why didn't the AI system learn them". I don't really see how to answer all of that without abstraction, but perhaps you'll have an answer anyway

(I am hoping to get some concrete detail on "how did it go from non-hostile to hostile", though I suppose you might confidently predict that it was already hostile after pretraining, conditional on it being an AGI at all. I can try devising a different concrete scenario if that's a blocker.)

[Yudkowsky][11:09]

I am hoping to get some concrete detail on "how did it go from non-hostile to hostile"

Mu Zero is intrinsically dangerous for reasons essentially isomorphic to the way that AIXI is intrinsically dangerous: It tries to remove humans from its environment when playing Reality for the same reasons it stomps a Goomba if it learns how to play Super Mario Bros 1, because it has some goal and the Goomba is in the way. It doesn't need to learn anything more to be that way, except for learning what a Goomba/human is within the current environment.

The question is more "What kind of patches might it learn for a weak environment if optimized by some hill-climbing optimization method and loss function not to stomp Goombas there, and how would those patches fail to generalize to not stomping humans?"

Agree or disagree so far?

[Shah][12:07]

Agree assuming that it is pursuing a misaligned goal, but I am also asking what misaligned goal it is pursuing (and depending on the answer, maybe also how it came to be pursuing that misaligned goal given the specified training setup).

In fact I think "what misaligned goal is it pursuing" is probably the more central question for me

[Yudkowsky][12:14]

well, obvious abstract guess is: something whose non-maximal "optimum" (that is, where the optimization ended up, given about how powerful the optimization was) coincided okayish with the higher regions of the fitness landscape (lower regions of the loss landscape) that could be reached at all, relative to its ancestral environment

I feel like it would be pretty hard to blindly guess, in advance, at my level of intelligence, without having seen any precedents, what the hell a Human would look like, as a derivation of "inclusive genetic fitness"

[Shah][12:15]

Yeah I agree with that in the abstract, but have had trouble giving compelling-to-me concrete examples

Yeah I also agree with that

[Yudkowsky][12:15]

I could try to make up some weird false specifics if that helps?

[Shah][12:16]

To be clear I am fine with "this is a case where we predictably can't have good concrete stories and this does not mean we are safe" (and indeed argued the same thing in a doc I linked here many messages ago)

But weird false specifics could still be interesting

Although let me think if it is actually valuable

Probably it is not going to change my mind very much on alignment difficulty, if it is "weird false specifics", so maybe this isn't the most productive line of discussion. I'd be "selfishly" interested in that "weird false specifics" seems good for me to generate novel thoughts about these sorts of scenarios, but that seems like a bad use of this Discord

I think given the premises that (1) superintelligence is coming soon, (2) it pursues a misaligned goal by default, and (3) we currently have no technical way of preventing this and no realistic-seeming avenues for generating such methods, I am very pessimistic. I think (2) and (3) are the parts that I don't believe and am interested in digging into, but perhaps "concrete stories" doesn't really work for this.

[Yudkowsky][12:26]

with any luck - though I'm not sure I actually expect that much luck - this would be something Redwood Research could tell us about, if they can learn a nonviolence predicate over GPT-3 outputs and then manage to successfully mutate the distribution enough that we can get to see what was actually inside the predicate instead of "nonviolence"

[Shah: 👍]

or, like, 10% of what was actually inside it

or enough that people have some specifics to work with when it comes to understanding how gradient descent learning a function over outcomes from human feedback relative to a distribution, doesn't just learn the actual function the human is using to generate the feedback (though, if this were learned exactly, it would still be fatal given superintelligence)

[Shah][12:33]

In this framing I do buy that you don't learn exactly the function that generates the feedback -- I have ~5 contrived specific examples where this is the case (i.e. you learn something that wasn't what the feedback function would have rewarded in a different distribution)

(I'm now thinking about what I actually want to say about this framing)

Actually, maybe I do think you might end up learning the function that generates the feedback. Not literally exactly, if for no other reason than rounding errors, but well enough that the inaccuracies don't matter much. The AGI presumably already knows and understands the concepts we use based on its pretraining, is it really so shocking if gradient descent hooks up those concepts in the right way? (GPT-3 on the other hand doesn't already know and understand the relevant concepts, so I wouldn't predict this of GPT-3.) I do feel though like this isn't really getting at my reason for (relative) optimism, and that reason is much more like "I don't really buy that AGI must be very coherent in a way that would prevent corrigibility from working" (which we could discuss if desired)

On the comment that learning the exact feedback function is still fatal -- I am unclear on why you are so pessimistic on having "human + AI" supervise "AI", in order to have the supervisor be smarter than the thing being supervised. (I think) I understand the pessimism that the learned function won't generalize correctly, but if you imagine that magically working, I'm not clear what additional reason prevents the "human + AI" supervising "AI" setup.

I can see how you die if the AI ever becomes misaligned, i.e. there isn't a way to fix mistakes, but I don't see how you get the misaligned AI in the first place.
I could also see things like "Just like a student can get away with plagiarism even when the teacher is smarter than the student, the AI knows more about its cognition than the human + AI system, and so will likely be incentivized to do bad things that it knows are bad but the human + AI system doesn't know is bad". But that sort of thing seems solvable with future research, e.g. debate, interpretability, red teaming all seem like feasible approaches.

[Yudkowsky][13:06]

what's a "human + AI"? can you give me a more concrete version of that scenario, either one where you expect it to work, or where you yourself have labeled the first point you expect it to fail and you want to know whether I see an earlier failure than that?

[Shah][13:09]

One concrete training algorithm would be debate, ideally with mechanisms that allow the AI systems to "look into each other's thoughts" and make credible statements about them, but we can skip that for now as it isn't very concrete

Would you like a training domain and data as well?

I don't like the fact that a smart AI system in this position could notice that it is playing against itself and decide not to participate in a zero-sum game, but I am not sure if that worry actually makes sense or not

(Debate can be thought of as simultaneously "human + first AI evaluate second AI" and "human + second AI evaluate first AI")

[Yudkowsky][13:12]

further concreteness, please! what pivotal act is it training for? what are the debate contents about?

[Shah][13:16]

You start with "easy" debates like mathematical theorem proving or fact-based questions, and ramp up until eventually the questions are roughly "what is the next thing to do in order to execute a pivotal act"

Intermediate questions might be things like "is it a good idea to have a minimum wage"

[Yudkowsky][13:17]

so, like, "email ATTTTGAGCTTGCC... to the following address, mix the proteins you receive by FedEx in a water-saline solution at 2 degrees Celsius..." for the final stage?

[Shah][13:17]

Yup, that could be it

Humans are judging debates based on reasoning though, not just outcomes-after-executing-the-plan

[Yudkowsky][13:19]

okay. let's suppose you manage to prevent both AGIs from using logical decision theory to coordinate with each other. both AIs tell their humans that the other AI's plans are murderous. now what?

[Shah][13:19]

So assuming perfect generalization there should be some large implicit debate tree that justifies the plan in human-understandable form

[Yudkowsky][13:20]

yah, I flatly disbelieve that entire development scheme, so we should maybe back up.

people fiddled around with GPT-4 derivatives and never did get them to engage in lines of printed reasoning that would design interesting new stuff. now what?

Living Zero (a more architecturally complicated successor of Mu Zero) is getting better at designing complicated things over on its side while that's going on, whatever it is

[Shah][13:23]

Okay, so the worry is that this just won't scale, not that (assuming perfect generalization) it is unsafe? Or perhaps you also think it is unsafe but it's hard to engage with because you don't believe it will scale?

And the issue is that relying on reasoning confines you to a space of possible thoughts that doesn't include the kinds of thoughts required to develop new stuff (e.g. intuition)?

[Yudkowsky][13:25]

mostly I have found these alleged strategies to be too permanently abstract, never concretized, to count as admissible hypotheses. if you ask me to concretize them myself, I think that unelaborated giant transformer stacks trained on massive online text corpuses fail to learn smart-human-level engineering reasoning before the world ends. If that were not true, I would expect Paul-style schemes to blow up on the distillation step, but first failures first.

[Shah][13:26]

What additional concrete detail do you want?

It feels like I specified something that we could code up a stupidly inefficient version of now

[Yudkowsky][13:27]

Great. Describe the stupidly inefficient version?

[Shah][13:33]

In terms of what actually happens: Each episode, there is an initial question specified by the human. Agent A and agent B, which are copies of the same neural net, simultaneously produce statements ("answers"). They then have a conversation. At the end the human judge decides which answer is better, and rewards the appropriate agent. The agents are updated using some RL algorithm.

I can say stuff about why we might hope this works, or about tricks you have to play in order to get learning to happen at all, or other things

[Yudkowsky][13:35]

Are the agents also playing Starcraft or have they spent their whole lives inside the world of text?

[Shah][13:35]

For the stupidly inefficient version they could have spent their whole lives inside text

[Yudkowsky][13:37]

Okay. I don't think the pure-text versions of GPT-5 are being very good at designing nanosystems while Living Zero is ending the world.

[Shah][13:37]

In the stupidly inefficient version human feedback has to teach the agents facts about the real world

[Yudkowsky][13:37]

(It's called "Living Zero" because it does lifelong learning, in the backstory I've been trying to separately sketch out in a draft.)

[Shah][13:38]

Oh I definitely agree this is not competitive

So when you say this is too abstract, you mean that there isn't a story for how they incorporate e.g. physical real-world knowledge?

[Yudkowsky][13:39]

no, I mean that when I talk to Paul about this, I can't get Paul to say anything as concrete as the stuff you've already said

the reason why I don't expect the GPT-5s to be competitive with Living Zero is that gradient descent on feedforward transformer layers, in order how to learn science by competing to generate text that humans like, would have to pick up on some very deep latent patterns generating that text, and I don't think there's an incremental pathway there for gradient descent to follow - if gradient descent even follows incremental pathways as opposed to finding lottery tickets, but that's a whole separate open question of artificial neuroscience.

in other words, humans play around with legos, and hominids play around with chipping flint handaxes, and mammals play around with spatial reasoning, and that's part of the incremental pathway to developing deep patterns for causal investigation and engineering, which then get projected into human text and picked up by humans reading text

it's just straightforwardly not clear to me that GPT-5 pretrained on human text corpuses, and then further posttrained by RL on human judgment of text outputs, ever runs across the deep patterns

where relatively small architectural changes might make the system no longer just a giant stack of transformers, even if that resulting system is named "GPT-5", and in this case, bets might be off, but also in this case, things will go wrong with it that go wrong with Living Zero, because it's now learning the more powerful and dangerous kind of work

[Shah][13:45]

That does seem like a disagreement, in that I think this process does eventually reach the "deep patterns", but I do agree it is unlikely to be competitive

[Yudkowsky][13:45]

I mean, if you take a feedforward stack of transformer layers the size of a galaxy and train it via gradient descent using all the available energy in the reachable universe, it might find something, sure

though this is by no means certain to be the case

[Shah][13:50]

It would be quite surprising to me if it took that much. It would be especially surprising to me if we couldn't figure out some alternative reasonably-simple training scheme like "imitate a human doing good reasoning" that still remained entirely in text that could reach the "deep patterns". (This is now no longer a discussion about whether the training scheme is aligned, not sure if we should continue it.)

I realize that this might be hard to do, but if you imagine that GPT-5 + human feedback finetuning does run across the deep patterns and could in theory do the right stuff, and also generalization magically works, what's the next failure?

[Yudkowsky][13:56]

what sort of deep thing does a hill-climber run across in the layers, such that the deep thing is the most predictive thing it found for human text about science?

if you don't visualize this deep thing in any detail, then it can in one moment be powerful, and in another moment be safe. it can have all the properties that you want simultaneously. who's to say otherwise? the mysterious deep thing has no form within your mind.

if one were to name specifically "well, it ran across a little superintelligence with long-term goals that it realized it could achieve by predicting well in all the cases that an outer gradient descent loop would probably be updating on", that sure doesn't end well for you.

this perhaps is not the first thing that gradient descent runs across. it wasn't the first thing that natural selection ran across to build things that ran the savvanah and made more of themselves. but what deep pattern that is not pleasantly and unfrighteningly formless would gradient descent run across instead?

[Shah][14:00]

(Tbc by "human feedback finetuning" I mean debate, and I suspect that "generalization magically works" will be meant to rule out the thing that you say next, but seems worth checking so let me write an answer)

the deep thing is the most predictive thing it found for human text about science?

Wait, the most predictive thing? I was imagining it as just a thing that is present in addition to all the other things. Like, I don't think I've learned a "deep thing" that is most useful for riding a bike. Probably I'm just misunderstanding what you mean here.

I don't think I can give a good answer here, but to give some answer, it has a belief that there is a universe "out there", that lots but not all of the text it reads is making claims about (some aspect of) the universe, those claims can be true or false, there are some claims that are known to be true, there are some ways to take assumed-true claims and generate new assumed-true claims, which includes claims about optimal actions for goals, as well as claims about how to build stuff, or what the effect of a specified machine is

[Yudkowsky][14:10]

hell of a lot of stuff for gradient descent to run across in a stack of transformer layers. clearly the lottery-ticket hypothesis must have been very incorrect, and there was an incremental trail of successively more complicated gears that got trained into the system.

btw by "claims" are you meaning to make the jump to English claims? I was reading them as giant inscrutable vectors encoding meaningful propositions, but maybe you meant something else there.

[Shah][14:11]

In fact I am skeptical of some strong versions of the lottery ticket hypothesis, though it's been a while since I read the paper and I don't remember exactly what the original hypothesis was

Giant inscrutable vectors encoding meaningful propositions

[Yudkowsky][14:13]

oh, I'm not particularly confident of the lottery-ticket hypothesis either, though I sure do find it grimly amusing that a species which hasn't already figured that out one way or another thinks it's going to have deep transparency into neural nets all wrapped up in time to survive. but, separate issue.

"How does gradient descent even work?" "Lol nobody knows, it just does."

but, separate issue

[Shah][14:16]

How does strong lottery ticket hypothesis explain GPT-3? Seems like that should already be enough to determine that there's an incremental trail of successively more complicated gears

[Yudkowsky][14:18]

could just be that in 175B parameters, combinatorially combined through possible execution pathways, there is some stuff that was pretty close to doing all the stuff that GPT-3 ended up doing.

anyways, for a human to come up with human text about science, the human has to brood and think for a bit about different possible hypotheses that could account for the data, notice places where those hypotheses break down, tweak the hypotheses in their mind to make the errors go away; they would engineer an internal mental construct towards the engineering goal of making good predictions. if you're looking at orbital mechanics and haven't invented calculus yet, you invent calculus as a persistent mental tool that you can use to craft those internal mental constructs.

does the formless deep pattern of GPT-5 accomplish the same ends, by some mysterious means that is, formless, able to produce the same result, but not by any detailed means where if you visualized them you would be able to see how it was unsafe?

[Shah][14:24]

I expect that probably we will figure out some way to have adaptive computation time be a thing (it's been investigated for years now, but afaik hasn't worked very well), which will allow for this sort of thing to happen

In the stupidly inefficient version, you have a really really giant and deep neural net that does all of that in successive layers of the neural net. (And when it doesn't need to do that, those layers are noops.)

[Yudkowsky][14:26][14:32]

okay, so my question is, is there a little goal-oriented mind inside there that solves science problems the same way humans solve them, by engineering mental constructs that serve a goal of prediction, including backchaining for prediction goals and forward chaining from alternative hypotheses / internal tweaked states of the mental construct? or is there something else which solves the same problem, not how humans do it, without any internal goal orientation?

People who would not in the first place realize that humans solve prediction problems by internally engineering internal mental constructs in a goal-oriented way, would of course imagine themselves able to imagine a formless spirit which produces "predictions" without being "goal-oriented" because they lack an understanding of internal machinery and so can combine whatever surface properties and English words they want to yield a beautiful optimism

Or perhaps there is indeed some way to produce "predictions" without being "goal-oriented", which gradient descent on a great stack of transformer layers would surely run across; but you will pardon my grave lack of confidence that someone has in fact seen so much further than myself, when they don't seem to have appreciated in advance of my own questions why somebody who understood something about human internals would be skeptical of this.

If they're sort of visibly trying to come up with it on the spot after I ask the question, that's not such a great sign either.

This is not aimed particularly at you, but I hope the reader may understand something of why Eliezer Yudkowsky goes about sounding so gloomy all the time about other people's prospects for noticing what will kill them, by themselves, without Eliezer constantly hovering over their shoulder every minute prompting them with almost all of the answer.

[Shah][14:31]

Just to check my understanding: if we're talking about, say, how humans might go about understanding neural nets, there's a goal of "have a theory that can retrodict existing observations and make new predictions", backchaining might say "come up with hypotheses that would explain double descent", forward chaining might say "look into bias and variance measurements"?

If so, yes, I think the AGI / GPT-5-that-is-an-AGI is doing something similar

[Yudkowsky][14:33]

your understanding sounds okay, though it might make more sense to talk about a domain that human beings understand better than artificial neuroscience, for purposes of illustrating how scientific thinking works, since human beings haven't actually gotten very far with artificial neuroscience.

[Shah][14:33]

Fair point re using a different domain

To be clear I do not in fact think that GPT-N is safe because it is trained with supervised learning and I am confused at the combination of views that GPT-N will be AGI and GPT-N will be safe because it's just doing predictions

Maybe there is marginal additional safety but you clearly can't say it is "definitely safe" without some additional knowledge that I have not seen so far

Going back to the original question, of what the next failure mode of debate would be assuming magical generalization, I think it's just not one that makes sense to ask on your worldview / ontology; "magical generalization" is the equivalent of "assume that the goal-oriented mind somehow doesn't do dangerous optimization towards its goal, yet nonetheless produces things that can only be produced by dangerous optimization towards a goal", and so it is assuming the entire problem away

[Yudkowsky][14:41]

well YES

from my perspective the whole field of mental endeavor as practiced by alignment optimists consists of ancient alchemists wondering if they can get collections of surface properties, like a metal as shiny as gold, as hard as steel, and as self-healing as flesh, where optimism about such wonderfully combined properties can be infinite as long as you stay ignorant of underlying structures that produce some properties but not others

and, like, maybe you can get something as hard as steel, as shiny as gold, and resilient or self-healing in various ways, but you sure don't get it by ignorance of the internals

and not for a while

so if you need the magic sword in 2 years or the world ends, you're kinda dead

[Shah][14:46]

Potentially dumb question: when humans do science, why don't they then try to take over the world to do the best possible science? (If humans are doing dangerous goal-directed optimization when doing science, why doesn't that lead to catastrophe?)

You could of course say that they just aren't smart enough to do so, but it sure feels like (most) humans wouldn't want to do the best possible science even if they were smarter

I think this is similar to a question I asked before about plans being dangerous independent of their source, and the answer was that the source was misaligned

But in the description above you didn't say anything about the thing-doing-science being misaligned, so I am once again confused

[Yudkowsky][14:48]

boy, so many dumb answers to this dumb question:

even relatively "smart" humans are not very smart compared to other humans, such that they don't have a "take over the world" option available.
most humans who use Science were not smart enough to invent the underlying concept of Science for themselves from scratch; and Francis Bacon, who did, sure did want to take over the world with it.
groups of humans with relatively more Engineering sure did take over large parts of the world relative to groups that had relatively less.
Eliezer Yudkowsky clearly demonstrates that when you are smart enough you start trying to use Science and Engineering to take over your whole future lightcone, the other humans you're thinking of just aren't that smart, and, if they were, would inevitably converge towards Eliezer Yudkowsky, who is really a very typical example of a person that smart, even if he looks odd to you because you're not seeing the population of other dath ilani

I am genuinely not sure how to come up with a less dumb answer and it may require a more precise reformulation of the question

[Shah][14:50]

But like, in Eliezer's case, there is a different goal that is motivating him to use Science and Engineering for this purpose

It is not the prediction-goal that he instantiated in his mind as part of the method of doing Science

[Yudkowsky][14:52]

sure, and the mysterious formless thing within GPT-5 with "adaptive computation time" that broods and thinks, may be pursuing its prediction-subgoal for the sake of other goals, or be pursuing different subgoals of prediction separately without ever once having a goal of prediction, or have 66,666 different shards of desire across different kinds of predictive subproblems that were entrained by gradient descent which does more brute memorization and less Occam bias than natural selection

oh, are you asking why humans, when they do goal-oriented Science for the sake of their other goals, don't (universally always) stomp on their other goals while pursuing the Science part?

[Shah][14:54]

Well, that might also be interesting to hear the answer to -- I don't know how I'd answer that through an Eliezer-lens -- though it wasn't exactly what I was asking

[Yudkowsky][14:56]

basically the answer is "well, first of all, they do stomp on themselves to the extent that they're stupid; and to the extent that they're smart, pursuing X on the pathway to Y has a 'natural' structure for not stomping on Y which is simple and generalizes and obeys all the coherence theorems and can incorporate arbitrarily fine wiggles via epistemic modeling of those fine wiggles because those fine wiggles have a very compact encoding relative to the epistemic model, aka, predicting which forms of X lead to Y; and to the extent that group structures of humans can't do that simple thing coherently because of their cognitive and motivational partitioning, the group structures of humans are back to not being able to coherently pursue the final goal again"

[Shah][14:58]

(Going back to what I meant to ask) It seems to me like humans demonstrate that you can have a prediction goal without that being your final/terminal goal. So it seems like with AI you similarly need to talk about the final/terminal goal. But then we talked about GPT and debate and so on for a while, and then you explained how GPTs would have deep patterns that do dangerous optimization, where the deep patterns involved instantiating a prediction goal. Notably, you didn't say anything about a final/terminal goal. Do you see why I am confused?

[Yudkowsky][15:00]

so you can do prediction because it's on the way to some totally other final goal - the way that any tiny superintelligence or superhumanly-coherent agent, if an optimization method somehow managed to run across that early on, with an arbitrary goal, which also understood the larger picture, would make good predictions while it thought the outer loop was probably doing gradient descent updates, and bide its time to produce rather different "predictions" once it suspected the results were not going to be checked given what the inputs had looked like.

you can imagine a thing that does prediction the same way that humans optimize inclusive genetic fitness, by pursuing dozens of little goals that tend to cohere to good prediction in the ancestral environment

both of these could happen in order; you could get a thing that pursued 66 severed shards of prediction as a small mind, and which, when made larger, cohered into a utility function around the 66 severed shards that sum to something which is not good prediction and which you could pursue by transforming the universe, and then strategically made good predictions while it expected the results to go on being checked

[Shah][15:02]

OH you mean that the outer objective is prediction

[Yudkowsky][15:02]

[Shah][15:03]

I have for quite a while thought that you meant that Science involves internally setting a subgoal of "predict a confusing part of reality"

[Yudkowsky][15:03]

it... does?

I mean, that is true.

[Shah][15:04]

Okay wait. There are two things. One is that GPT-3 is trained with a loss function that one might call a prediction objective for human text. Two is that Science involves looking at a part of reality and figuring out how to predict it. These two things are totally different. I am now unsure which one(s) you were talking about in the conversation above

[Yudkowsky][15:06]

what I'm saying is that for GPT-5 to successfully do AGI-complete prediction of human text about Science, gradient descent must identify some formless thing that does Science internally in order to optimize the outer loss function for predicting human text about Science

just like, if it learns to predict human text about multiplication, it must have learned something internally that does multiplication

(afk, lunch/dinner)

[Shah][15:07]

Yeah, so you meant the first thing, and I misinterpreted as the second thing

(I will head to bed in this case -- I was meaning to do that soon anyway -- but I'll first summarize.)

[Yudkowsky][15:08]

I am concerned that there is still a misinterpretation going on, because the case I am describing is both things at once

there is an outer loss function that scores text predictions, and an internal process which for purposes of predicting what Science would say must actually somehow do the work of Science

[Shah][15:09]

Okay let me look back at the conversation

is there a little goal-oriented mind inside there that solves science problems the same way humans solve them, by engineering mental constructs that serve a goal of prediction, including backchaining for prediction goals and forward chaining from alternative hypotheses / internal tweaked states of the mental construct?

Here, is the word "prediction" meant to refer to the outer objective and/or predicting what English sentences about Science one might say, or is it referring to a subpart of the Process Of Science in which one aims to predict some aspect of reality (which is typically not in the form of English sentences)?

[Yudkowsky][15:20]

it's here referring to the inner Science problem

[Shah][15:21]

Okay I think my original understanding was correct in that case

from my perspective the whole field of mental endeavor as practiced by alignment optimists consists of ancient alchemists wondering if they can get collections of surface properties, like a metal as shiny as gold, as hard as steel, and as self-healing as flesh, where optimism about such wonderfully combined properties can be infinite as long as you stay ignorant of underlying structures that produce some properties but not others

I actually think something like this might be a crux for me, though obviously I wouldn't put it the way you're putting it. More like "are arguments about internal mechanisms more or less trustworthy than arguments about what you're selecting for" (limiting to arguments we actually have access to, of course in the limit of perfect knowledge internal mechanisms beats selection). But that is I think a discussion for another day.

[Yudkowsky][15:29]

I think the critical insight - though it has a format that basically nobody except me ever visibly invokes in those terms, and I worry maybe it can only be taught by a kind of life experience that's very hard to obtain - is the realization that any consistent reasonable story about underlying mechanisms will give you less optimistic forecasts than the ones you get by freely combining surface desiderata

[Shah] [1:38] (next day, Nov. 14)

(For the reader, I don't think that "arguments about what you're selecting for" is the same thing as "freely combining surface desiderata", though I do expect they look approximately the same to Eliezer)

Yeah, I think I do not in fact understand why that is true for any consistent reasonable story.

From my perspective, when I posit a hypothetical, you demonstrate that there is an underlying mechanism that produces strong capabilities that generalize combined with real world knowledge. I agree that a powerful AI system that we build capable of executing a pivotal act will have strong capabilities that generalize and real world knowledge. I am happy to assume for the purposes of this discussion that it involves backchaining from a target and forward chaining from things that you currently know or have. I agree that such capabilities could be used to cause an existential catastrophe (at least in a unipolar world, multipolar case is more complicated, but we can stick with unipolar for now). None of my arguments so far are meant to factor through the route of "make it so that the AGI can't cause an existential catastrophe even if it wants to".

The main question according to me is why those capabilities are aimed towards achievement of a misaligned goal.

It feels like when I try to ask why we have misaligned goals, I often get answers that are of the form "look at the deep patterns underlying the strong capabilities that generalize, obviously given a misaligned goal they would generate the plan of killing the humans who are an obstacle towards achieving that goal". This of course doesn't work since it's a circular argument.

I can generate lots of arguments for why it would be aimed towards achievement of a misaligned goal, such as (1) only a tiny fraction of goals are aligned; the rest are misaligned, (2) the feedback we provide is unlikely to be the right goal and even small errors are fatal, (3) lots of misaligned goals are compatible with the feedback we provide even if the feedback is good, since the AGI might behave well until it can execute a treacherous turn, (4) the one example of strategically aware intelligence (i.e. humans) is misaligned relative to its creator. (I'm not saying I agree with these arguments, but I do understand them.)

Are these the arguments that make you think that you get misaligned goals by default? Or is it something about "deep patterns" that isn't captured by "strong capabilities that generalize, real-world knowledge, ability to cause an existential catastrophe if it wants to"?

24. Follow-ups

[Yudkowsky][15:59] (Feb. 21, 2022)

So I realize it's been a bit, but looking over this last conversation, I feel unhappy about the MIRI conversations sequence stopping exactly here, with an unanswered major question, after I ran out of energy last time. I shall attempt to answer it, at least at all. CC @rohin @RobBensinger .

[Shah: 🙂]

[Ngo: 🙂]

[Bensinger: 🙂]

One basic large class of reasons has the form, "Outer optimization on a precise loss function doesn't get you inner consequentialism explicitly targeting that outer objective, just inner consequentialism targeting objectives which empirically happen to align with the outer objective given that environment and those capability levels; and at some point sufficiently powerful inner consequentialism starts to generalize far out-of-distribution, and, when it does, the consequentialist part generalizes much further than the empirical alignment with the outer objective function."

This, I hope, is by now recognizable to individuals of interest as an overly abstract description of what happened with humans, who one day started building Moon rockets without seeming to care very much about calculating and maximizing their personal inclusive genetic fitness while doing that. Their capabilities generalized much further out of the ancestral training distribution, than the empirical alignment of those capabilities on inclusive genetic fitness in the ancestral training distribution.

One basic large class of reasons has the form, "Because the real objective is something that cannot be precisely and accurately shown to the AGI and the differences are systematic and important."

Suppose you have a bunch of humans classifying videos of real events or text descriptions of real events or hypothetical fictional scenarios in text, as desirable or undesirable, and assigning them numerical ratings. Unless these humans are perfectly free of, among other things, all the standard and well-known cognitive biases about eg differently treating losses and gains, the value of this sensory signal is not "The value of our real CEV rating what is Good or Bad and how much" nor even "The value of a utility function we've got right now, run over the real events behind these videos". Instead it is in a systematic and real and visible way, "The result of running an error-prone human brain over this data to produce a rating on it."

This is not a mistake by the AGI, it's not something the AGI can narrow down by running more experiments, the correct answer as defined is what contains the alignment difficulty. If the AGI, or for that matter the outer optimization loop, correctly generalizes the function that is producing the human feedback, it will include the systematic sources of error in that feedback. If the AGI essays an experimental test of a manipulation that an ideal observer would see as "intended to produce error in humans" then the experimental result will be "Ah yes, this is correctly part of the objective function, the objective function I'm supposed to maximize sure does have this in it according to the sensory data I got about this objective."

People have fantasized about having the AGI learn something other than the true and accurate function producing its objective-describing data, as its actual objective, from the objective-describing data that it gets; I, of course, was the first person to imagine this and say it should be done, back in 2001 or so; unlike a lot of latecomers to this situation, I am skeptical of my own proposals and I know very well that I did not in fact come up with any reliable-looking proposal for learning 'true' human values off systematically erroneous human feedback.

Difficulties here are fatal, because a true and accurate learning of what is producing the objective-describing signal, will correctly imply that higher values of this signal obtain as the humans are manipulated or as they are bypassed with physical interrupts for control of the feedback signal. In other words, even if you could do a bunch of training on an outer objective, and get inner optimization perfectly targeted on that, the fact that it was perfectly targeted would kill you.

[Bensinger][23:15] (Feb. 27, 2022 follow-up comment)

This is the last log in the Late 2021 MIRI Conversations. We'll be concluding the sequence with a public Ask Me Anything (AMA) this Wednesday; you can start posting questions there now.

MIRI has found the Discord format useful, and we plan to continue using it going into 2022. This includes follow-up conversations between Eliezer and Rohin, and a forthcoming conversation between Eliezer and Scott Alexander of Astral Codex Ten.

Some concluding thoughts from Richard Ngo:

[Ngo][6:20] (Nov. 12 follow-up comment)

Many thanks to Eliezer and Nate for their courteous and constructive discussion and moderation, and to Rob for putting the transcripts together.

This debate updated me about 15% of the way towards Eliezer's position, with Eliezer's arguments about the difficulties of coordinating to ensure alignment responsible for most of that shift. While I don't find Eliezer's core intuitions about intelligence too implausible, they don't seem compelling enough to do as much work as Eliezer argues they do. As in the Foom debate, I think that our object-level discussions were constrained by our different underlying attitudes towards high-level abstractions, which are hard to pin down (let alone resolve).

Given this, I think that the most productive mode of intellectual engagement with Eliezer's worldview going forward is probably not to continue debating it (since that would likely hit those same underlying disagreements), but rather to try to inhabit it deeply enough to rederive his conclusions and find new explanations of them which then lead to clearer object-level cruxes. I hope that these transcripts shed sufficient light for some readers to be able to do so.

I am a time-traveler who came back from the world where it (super duper predictably) turned out that a lot of early bright hopes didn't pan out and various things went WRONG and alignment was HARD and it was NOT SOLVED IN ONE SUMMER BY TEN SMART RESEARCHERS

I think these kinds of comments update readers' beliefs in a bad, invalid way. The bad event (AGI ruin) is argued for by... a request for me to condition on testimony of a survivor of that bad event. Yes, I know the whole thing is tongue-in-cheek. I know that EY is not literally claiming to be a time-traveller.

But in TurnTrout-culture, "I experienced X" is something to be said when X has actually been experienced. "The fact that X" is to be said when X is actually supported by a heap of accepted evidence.^[1] "Have you met dath ilani?" is to be said when such entities actually exist and are not outputs of the model of intelligence which is being argued for. (Yes, that last one was flagged as a "bad argument", but still.)

This paragraph of EY self-fic didn't update me at all. But it almost did. When these statements are made, I am inclined to update my beliefs in the predictable way -- to gullibly update on claims -- unless I take special effort to not update on (checks dialogue) fictional evidence. Which effort I do take (as a matter of reflex, at this point), but that effort is a cost imposed on me.

^{^}
This particular kind of misleading statement wasn't made in this dialogue, but I've seen it made erroneously-according-to-me in private correspondence with smart researchers.

is there a little goal-oriented mind inside there that solves science problems the same way humans solve them, by engineering mental constructs that serve a goal of prediction, including backchaining for prediction goals and forward chaining from alternative hypotheses / internal tweaked states of the mental construct?

In case it helps anyone to hear different people talking about the same thing, I think Eliezer in this quote is describing a similar thing as my discussion here (search for the phrase “RL-on-thoughts”).

So my objection to debate (which again I think is similar to Eliezer's) would be: (1) if the debaters are “trying to win the debate” in a way that involves RL-on-thoughts / consequentialist planning / etc., then in all likelihood they would think up the strategy of breaking out of the box and hacking into the judge / opposing debater / etc. (2) if not, I don't think the AIs would be sufficiently capable that they could do anything pivotal. [EDIT TO ADD: in retrospect, my phrasing here is a false dichotomy, see my follow-up comment].

I can generate lots of arguments for why it would be aimed towards achievement of a misaligned goal, such as (1) only a tiny fraction of goals are aligned; the rest are misaligned, (2) the feedback we provide is unlikely to be the right goal and even small errors are fatal, (3) lots of misaligned goals are compatible with the feedback we provide even if the feedback is good, since the AGI might behave well until it can execute a treacherous turn, (4) the one example of strategically aware intelligence (i.e. humans) is misaligned relative to its creator. (I'm not saying I agree with these arguments, but I do understand them.)

That seems like a pretty good list to me.

If I'm reading Rohin correctly, he was gearing up to argue that the claim “We don't know how to ensure that the AGI's eventual (inner) goal is something-in-particular that we want” is different from the claim “If we have a bad process that entails some randomness in the AGI's eventual (inner) goal, then it's (e.g.) 99% likely that the AGI's eventual (inner) goal will wind up being one that's incompatible with human life,” and that the latter claim was not justified by Eliezer here. If so, I'd tentatively agree with Rohin on that. I just put in the number 99% as an example. The real percentage is not obvious to me. I think it depends on the details of the “bad process”, such that it's not very useful to discuss in the abstract. (I do think >99% is a reasonable guess for at least some approaches.)

So my objection to debate (which again I think is similar to Eliezer's) would be: (1) if the debaters are “trying to win the debate” in a way that involves RL-on-thoughts / consequentialist planning / etc., then in all likelihood they would think up the strategy of breaking out of the box and hacking into the judge / opposing debater / etc. (2) if not, I don't think the AIs would be sufficiently capable that they could do anything pivotal.

In that particular non-failure story, I'm definitely imagining that they aren't "trying to win the debate" (where "trying" is a very strong word that implies taking over the world to win the debate).

I didn't really get into this with Eliezer but like Richard I'm pretty unclear on why "not trying to win the debate" (with the strong sense of trying) implies "insufficiently capable to be pivotal". I don't think humans are "trying" in the strong sense, but we sure seem very capable; it doesn't seem crazy to imagine that this continues.

If I'm reading Rohin correctly, he was gearing up to argue that the claim

I wasn't really gearing up to argue anything. For most of this conversation I was in the mode of "what is the argument that convinces Eliezer of near-certain doom (rather than just suggesting it is plausible), because I don't see it".

The RL-on-thoughts discussion was meant as an argument that a sufficiently capable AI needs to be “trying” to do something. If we agree on that part, then you can still say my previous comment was a false dichotomy, because the AI could be “trying” to (for example) “win the debate while following the spirit of the debate rules”.

And yes I agree, that was bad of me to have listed those two things as if they're the only two options.

I guess what I was thinking was: If we take the most straightforward debate setup, and if it gets an AI that is “trying” to do something, then that “something” is most likely to be vaguely like “win the debate” or something else with similarly-destructive consequences.

A different issue is whether that “most likely” is 99.9% vs 80% or whatever—that part is not immediately obvious to me.

And yet another question is whether we can push that probability much lower, even towards zero, by not using the most straightforward debate setup, but rather adding things to the setup that are directly targeted at sculpting the AGI's motivations.

I am not in fact convinced of near-certain doom there—that would be my Consequentialism & Corrigibility post. (I am convinced that we don't have a good plan right now.)

I agree that we don't have a plan that we can be justifiably confident in right now.

I don't see why the "destructive consequences" version is most likely to arise, especially since it doesn't seem to arise for humans. (In terms of Rob's continuum, humans seem much closer to #2-style trying.)

Again I don't have an especially strong opinion about what our prior should be on possible motivation systems for an AGI trained by straightforward debate, and in particular what fraction of those motivation systems are destructive. But I guess I'm sufficiently confident in “>50% chance that it's destructive” that I'll argue for that. I'll assume the AGI uses model-based RL, which would (I claim) put it very roughly in the same category as humans.

Some aspects of motivation have an obvious relationship / correlation to the reward signal. In the human case, given that we're a social animal, we can't be surprised to find that the human brainstem reward function inserts lots of socially-related motivations into us, including things like caring about other humans (which sometimes generalizes to caring about other living creatures) and generally wanting to fit in and follow norms under most circumstances, etc. Whereas other things in the world have no relationship to the innate human brainstem reward function, and predictably, basically no one cares about them, except insofar as they become instrumentally useful for something else we do care about. (There are interesting rare exceptions, like human superstitions.) An example in humans would be the question of whether pebbles on the sidewalk are more often an even number of centimeters apart versus an odd number of centimeters apart.

In the straightforward debate setup, I can't see any positive reason for the reward function to directly paint a valence, either positive or negative, onto the idea of the AGI taking over the world. So I revert to the default expectation that the AGI will view “I take over the world” in a way that's analogous to how humans view “the pebbles on the sidewalk are an even number of centimeters apart”—i.e., totally neutral, except insofar as it becomes instrumentally relevant for something else. Meanwhile, the reward signal is directly painting positive valence onto some aspect(s) of winning the debate. It's hard to say exactly what that aspect will be—in fact I think it will be at least somewhat random. But whatever it is, it seems to me to be >50% likely that the AGI can get more of it by taking over the world. I might get as high as “>80%” or “>90%” before I start shrugging and saying “I don't really know”.

(Then we can start talking about capability windows etc., but I don't think that was your objection here.)

But I guess I'm sufficiently confident in “>50% chance that it's destructive” that I'll argue for that.

Fwiw 50% on doom in the story I told seems plausible to me; maybe I'm at 30% but that's very unstable. I don't think we disagree all that much here.

Then we can start talking about capability windows etc., but I don't think that was your objection here.

Capability windows are totally part of the objection. If you completely ignore capability windows / compute restrictions then you just run AIXI (or AIXI-tl if you don't want something uncomputable) and die immediately.

In that particular non-failure story, I'm definitely imagining that they aren't "trying to win the debate" (where "trying" is a very strong word that implies taking over the world to win the debate).

Suppose I'm debating someone about gun control, and they say 'guns don't kill people; people kill people'. Here are four different scenarios for how I might respond:

1. Almost as a pure reflex, before I can stop myself, I blurt out 'That's bullshit!' in response. It's not the best way to win the debate, but heck, I've heard that zinger a thousand times and it just makes me so mad. (Or, in other words: I have an automatic reflex-like response to that specific phrase, which is to get mad; and when I get mad, I have an automatic reflex-like response to blurt out the first sentence I can think of that expresses disapproval for that slogan.)

Or, instead:

2. I remember that there's a $1000 prize for winning this debate, and I take a deep breath to calm myself. I know that winning the debate will require convincing a judge who isn't super sympathetic to my political views; so I'll have to come up with some argument that's convincing even to a conservative. My mind wanders for a few seconds, and a thought pops into my head: 'Guns and people both kill people!' Hmm, but that sounds sort of awkward and weak. Is there a more pithy phrasing?

A memory suddenly pops into my head: I think I heard once that knife murders spiked in Australia or somewhere, when guns were banned? So, like, 'People will kill people regardless of whether guns are present?' Ugh, wait, that's exactly the point my opponent was making. Moving the debate in that direction is a terrible idea if I want to win. And now I feel a bit bad for strategically steering my thoughts away from true information, but whatever...

And now my mind is wandering, thinking about gun suicide, and... come on, focus. 'Guns don't kill people. People kill people.' How to respond? Going concrete might make my response more compelling, by making me sound more grounded and common-sensical. Concretely, it's just obvious common sense that giving someone more firepower will increase their ability to kill others; and, for example, it will make it likelier that someone kills someone else in a fit of passion, where they might not have committed murder if they'd been delayed a few minutes.

Oh, hey, I can use that! I like how matter-of-fact that response is. And it will be more persuasive to the judge, because it's not making any strong or outrageous-sounding claims, or building a big edifice of argument; it's just making a simple challenge, which then puts the ball in the other side's court and makes it seem like the burden of proof lies with them now. Anyway, I'm feeling tired after thinking this hard, and I'm running out of time, so let's just go with that idea...

Or, instead:

3. Wait, why am I focusing so much on the $1000 prize for this TV show? Being on this show is an amazing opportunity: I could make way more than $1000 if I hijack the live broadcast to start promoting my business to the televised audience. Actually, what if I just tried to negotiate a deal with my debate opponent. Or, heck, with the producers...

Or, instead:

4. Sorry, I don't have time to think about that debate question, I'm busy building a Dyson swarm to harvest the Sun's energy so that I can make the future awesome. I... really don't care about the $1000, no, relative to the larger stakes here.

If "trying" is a very strong word that literally implies you have to be trying to take over the world, then only scenario #4 involves me "trying" to win the debate. But I think it makes more sense to say that I'm trying in all four cases (or at least in cases #2, #3, and #4, where I'm displaying some strategy in deciding what to say).

You might then respond that we should try to build AI systems that are "trying" in the weak sense of #2, rather than in sense #3 or #4. But I think Eliezer and Steven's point is that #2, #3, and #4 are on a continuum, rather than being qualitatively different.

(Even #1 is on the continuum in some respects, since my brain needs to be engaging in smart creative search processes somewhere in order to even generate strategies like 'get mad in response to X' or 'find an angry-sounding thing to say in response when I get mad'.)

#2, #3, and #4 are all cases where I'm performing a search for strategies that will get me what I want, and where I evaluate various candidate responses to see how helpful they look. The difference between these options is in how wide a space of strategies I'm considering, and in how efficiently and intelligently I'm zeroing in on the highest-rated strategies in that space. (Where 'highest-rated' is relative to what I want.)

I totally agree those are on a continuum. I don't think this changes my point? It seems like Eliezer is confident that "reduce x-risk to EDIT: sub-50%" requires being all the way on the far side of that continuum, and I don't see why that's required.

("near-zero" is a red herring, and I worry that that phrasing bolsters the incorrect view that the reason MIRI folk think alignment is hard is that we want implausibly strong guarantees. I suggest replacing "reduce x-risk to near-zero" with "reduce x-risk to sub-50%".)

If we have some way to limit an AI's strategy space, or limit how efficiently and intelligently it searches that space, then we can maybe recapitulate some of the stuff that makes humans safe (albeit at the cost that the debate answers will probably be way worse — but maybe we can still get nanotech or whatever out of this process).

If that's the plan, then I guess my next question is how we should go about limiting the strategy space and/or reducing the search quality? (Taking into account things like deception risk.)

Alternatively, maybe you think that something very reflex-like, a la #1, is sufficient for a pivotal act — no smart search for strategies at all. But surely there has to be smart search going on somewhere the system, or how is it doing a bunch of useful novel scientific work?

If we have some way to limit an AI's strategy space, or limit how efficiently and intelligently it searches that space, then we can maybe recapitulate some of the stuff that makes humans safe (albeit at the cost that the debate answers will probably be way worse — but maybe we can still get nanotech or whatever out of this process).
If that's the plan, then I guess my next question is how we should go about limiting the strategy space and/or reducing the search quality? (Taking into account things like deception risk.)

It sounds like you think my position is "here is my plan to save the world and I have a story for how it will work", whereas my actual view is "here is a story in which humanity is stupid and covers itself in shame by taking on huge amounts of x-risk (e.g. 5%), where we have no strong justification for being confident that we'll survive, but the empirical situation ends up being such that we survive anyway".

In this story, I'm not imagining that we limited the strategy space of reduced the search quality. I'm imagining that we just scaled up capabilities, used debate without any bells and whistles like interpretability, and the empirical situation just happened to be that the AI systems didn't develop #4-style "trying" (but did develop #2-style "trying") before they became capable enough to e.g. establish a stable governance regime that regulates AI development or do alignment research better than any existing human alignment researchers that leads to a solution that we can be justifiably confident in.

My sense is that Eliezer would say that this story is completely implausible, i.e. this hypothesized empirical situation is ruled out by knowledge that Eliezer has. But I don't know what knowledge rules this out. (I'm pretty sure it has to do with his intuitions about a Core of General Intelligence, and/or why capabilities generalize faster than alignment, but I don't know where those intuitions come from, nor do I share them.)

Alternatively, maybe you think that something very reflex-like, a la #1, is sufficient for a pivotal act

Idk, I'm also worried about sufficiently scaled-up reflex-like things, in the sense that I think sufficiently scaled-up reflex-like things are capable both of pivotal acts and causing human extinction. But on my prediction of what actually happens I expect at least #2-style reasoning before reducing x-risk to ~zero (because that's more efficient than scaled-up reflex-like things).

In this story, I'm not imagining that we limited the strategy space of reduced the search quality. I'm imagining that we just scaled up capabilities, used debate without any bells and whistles like interpretability, and the empirical situation just happened to be that the AI systems didn't develop #4-style "trying" (but did develop #2-style "trying") before they became capable enough to e.g. establish a stable governance regime that regulates AI development or do alignment research better than any existing human alignment researchers that leads to a solution that we can be justifiably confident in.

You (a human) already exhibit #2-style trying. Despite this, you are not capable of "establishing a stable governance regime that regulates AI development" or "doing alignment research better than any existing human alignment researchers" (the latter is tautologically true, even).

So it seems reasonable to conclude that this level of "trying" is not enough to enact the pivotal acts you described (or, indeed, most any pivotal act that we might recognize as "pivotal"). It then follows that if a system is capable enough to enact some such pivotal act, some part of that system must have been running a stronger search than the kind of search described in "#2-style trying". And if you buy Eliezer's/Nate's argument that it's the search itself that's dangerous, rather than the fact that you (maybe) wrapped up the search in an outer shell you happen to call "oracle AI" (or something), then it's not a large jump from there to "maybe the search decides 'killing all humans' rates highly according to its search criteria".

But perhaps you're conceptualizing this whole "trying" thing differently, because you go on to say:

Idk, I'm also worried about sufficiently scaled-up reflex-like things, in the sense that I think sufficiently scaled-up reflex-like things are capable both of pivotal acts and causing human extinction. But on my prediction of what actually happens I expect at least #2-style reasoning before reducing x-risk to ~zero (because that's more efficient than scaled-up reflex-like things).

which actually just does not parse in my native ontology. Like, in my ontology "sufficiently scaled-up reflex-like things" stop behaving reflexively. It's not that you have this abstract label "reflex-like", that you can slap onto some system, such that if you then scale that system up the label stays stuck to it indefinitely; in my model reflexiveness is a property of actions, not of systems, and if you make a system sufficiently powerful it leaves the regime where reflex-like behavior is its default. It automatically goes from #1 to #2 to #3 to #4 in the limit of sufficient scaling; this is, from my perspective, what is meant by the claim "these things exist on a continuum" (which claim it seems like you agreed with in a parallel comment thread, which simply furthers my confusion).

(I endorse dxu's entire reply.)

So it seems reasonable to conclude that this level of "trying" is not enough to enact the pivotal acts you described

Stated differently than how I'd say it, but I agree that a single human performing human-level reasoning is not enough to enact those pivotal acts.

in my model reflexiveness is a property of actions,

Yeah, in my ontology (and in this context) reflexiveness is a property of cognitions, not of actions. I can reflexively reach into a transparent pipe to pick up a sandwich, without searching over possible plans for getting the sandwich (or at least, without any conscious search, and without any search via trying different plans and seeing if they work); one random video I've seen suggests that (some kind of) monkeys struggle to do this and may have to experiment with different plans to get the food. (I use this anecdote as an illustration; I don't know if it is actually true.)

See also the first few sections of Argument, intuition, and recursion; in the language of that post I'm thinking of "explicit argument" as "trying", and "intuition" as "reflex-like", even though they output the same thing.

Within my ontology, you could define behavioral-reflexivity as those behaviors / actions that a human could do with reflexive cognition, and then more competent actions are behavioral-trying. These concepts might match yours. In that case I'm saying that it's plausible that there's a wide gap between behavioral-trying-2 and behavioral-trying-3, but really my intuition is coming much more from finding the trying-2 cognitions significantly more likely than the trying-3 cognitions, and thinking that the trying-2 cognitions could scale without becoming trying-3 cognitions.

Or, to try and say things a bit more concretely, I find it plausible that there is more scaling from improving the efficiency of the search (e.g. by having better tuned heuristics and intuitions), than from expanding the domain of possible plans considered by the search. The 4 styles of trying that Rob mentioned exist on a continuum like "domain of possible plans", but instead we mostly walk up the continuum of "efficiency / competence of search within the domain".

(The resulting world looks more like CAIS than like a singular superintelligence with a DSA.)

(And I'll reiterate again because I anticipate being misunderstood that this is not a prediction of how the world must be and thus we are obviously safe; it is instead a story that I think is not ruled out by our current understanding and thus one to which I assign non-trivial probability.)

If that's the plan, then I guess my next question is how we should go about limiting the strategy space and/or reducing the search quality? (Taking into account things like deception risk.)

I suggested doing this using quantilization.

This may, perhaps, be confounded by the phenomenon where I am one of the last living descendants of the lineage that ever knew how to say anything concrete at all.

I've previously noticed this weakness in myself. What lineage did Eliezer learn this from? I would appreciate any suggestions or advice on how to become stronger at this.

This came up with Aysajan about two months ago. An exercise which I recommended for him: first, pick a technical academic paper. Read through the abstract and first few paragraphs. At the end of each sentence (or after each comma, if the authors use very long sentences), pause and write/sketch a prototypical example of what you currently think they're talking about. The goal here is to get into the habit of keeping a "mental picture" (i.e. prototypical example) of what the authors are talking about as you read.

Other good sources on which to try this exercise:

Wikipedia's list of theorems
Alignment Forum posts
Your own old writing

Early reports from Aysajan are that integration of this exercise into standard reading habits has resulted in a significant step-change improvement in understanding what's going on in nontrivial technical papers/posts, and also seems to spur a lot more independent thoughts/understanding in response to reading. Don't know yet how robust/reproducible this is, so if you practice the exercise a bit, please let me know how it goes.

(Fun side note: you can think of this technique as an application of very basic model theory to human rationality.)

CFAR used to have an awesome class called "Be specific!" that was mostly about concreteness. Exercises included:

Rationalist taboo
A group version of rationalist taboo where an instructor holds an everyday object and asks the class to describe it in concrete terms.
The Monday-Tuesday game
A role-playing game where the instructor plays a management consultant whose advice is impressive-sounding but contentless bullshit, and where the class has to force the consultant to be specific and concrete enough to be either wrong or trivial.
People were encouraged to make a habit of saying "can you give an example?" in everyday conversation. I practiced it a lot.

IIRC, Eliezer taught the class in May 2012? He talks about the relevant skills here and here. And then I ran it a few times, and then CFAR dropped it; I don't remember why.

The specificity sequence I wrote may be helpful.

I would appreciate any suggestions or advice on how to become stronger at this.

(Successfully) debugging complex systems seems to help. Although I don't know how much of that is actual training and how much of that is survivor bias.

(Why does this help? I hypothesize that it's because it's unforgiving. If you come up with a beautiful generic null-set as a hypothesis, you haven't actually made headway towards solving the problem... so you eventually give up, backtrack, and come up with a concrete hypothesis. You can't avoid training yourself out of it, essentially, so long as a) you can discipline yourself to keep working on the problem and b) it's a problem you're capable of solving.)

Cryptography was mentioned in this post in a relevant manner, though I don't have enough experience with it to advocate it with certainty. Some lineages of physics (EY points to Feynman) try to evoke this, though it's pervasiveness has decreased. You may have some luck with Zen. Generally speaking, I think if you look at the Sequences, the themes of physics, security mindset, and Zen are invoked for a reason.

If being versed in cryptography was enough, then I wouldn't expect Eliezer to claim being one of the last living descendents of this lineage.

Why would Zen help (and why do you think that)?

Ah, I forgot to emphasize that these were things to look into to get better. I don't claim to know EY's lineage. That said, how many people do you think are well versed in cryptography? If someone said, "I am one of very few people who is well versed in cryptography" that doesn't sound particularly wrong to me (if they are indeed well versed). I guess I don't know exactly how many people EY thinks is in this category with him, but people versed enough in cryptography to, say, make their own novel and robust scheme is probably on the order of 1,000-10,000 worldwide. His phrasing would make sense to me for any fraction of the population lower than 1 in 1,000, and I think he's probably referring to a category at the size of or less than 1 in 10,000. That said, I would like to emphasize that I don't think cryptography is especially useful to this ends, rather, the reason it was mentioned above was to bring up the security mindset.

Zen/mindfulness meditation generally has an emphasis on noticing concrete sensations. In particular, it might help you interject your attention at the proper level of abstraction to reroute concrete observations and sensations into your language. Also, with all of these examples, I do not claim that any individual one will be enough, but I do believe that experience with these things can help.

One fun way to learn concreteness is something I tried to exercise in this reply: use actual numbers. Fermi estimation is a skill that's relatively easy to pick up and makes you exercise your ability to think concretely about actual numbers that you are aware of to predict numbers you that are not. The process of actually referencing the concrete observations into a concrete prediction is a pattern that I have found to produce concrete thoughts which get verbalized in concrete language. :)

This may, perhaps, be confounded by the phenomenon where I am one of the last living descendants of the lineage that ever knew how to say anything concrete at all. Richard Feynman - or so I would now say in retrospect - is noticing concreteness dying out of the world, and being worried about that, at the point where he goes to a college and hears a professor talking about "essential objects" in class, and Feynman asks "Is a brick an essential object?" - meaning to work up to the notion of the inside of a brick, which can't be observed because breaking a brick in half just gives you two new exterior surfaces - and everybody in the classroom has a different notion of what it would mean for a brick to be an essential object.
Richard Feynman knew to try plugging in bricks as a special case, but the people in the classroom didn't, and I think the mental motion has died out of the world even further since Feynman wrote about it. The loss has spread to STEM as well. Though if you don't read old books and papers and contrast them to new books and papers, you wouldn't see it, and maybe most of the people who'll eventually read this will have no idea what I'm talking about because they've never seen it any other way...

I find the claim that Eliezer is "one of the last living descendants of the lineage that ever knew how to say anything concrete at all" bizarre, since it seems to me that I observe at least some people around me regularly steering for concreteness in conversations, and in individual thought.

Admittedly it maybe be that they all learned this skill from Eliezer, his writings, derivative works, or backtracking to reading Eliezer's influences. But, this particular skill doesn't seem like a very tricky one to learn, compared to many much more subtle or ineffable skills.

I see people steer for it, I know next-to-no people who do it reliably, and especially when talking about "important" or "high status" things or "narratives about their life". The plumber may talk concretely about his work but then talk magically about the news and politics and his marriage.

What an example of an "important" or "high status" thing that you've never observed someone to talk concretely about yet?

idk, I feel like Lightcone team talks about concrete things all the time?

I did think specifically of teammates when I said next-to-none.

I think you should potentially question your own epistemics if they lead you to the conclusion that you and your friends are some of the only competent-at-living-on-the-object-level people in the world, especially when what you’re describing is such an obviously-valuable skill that would be instrumentally useful for basically all real world impact. (If that’s not what you were saying, feel free to ignore this.)

People in your social circles are right about AI risk. Others are wrong. I understand the desire to try to find explanations for that. There are lots of explanations that don’t require beliefs like “we’re better at thinking than everyone else”. For example, you can believe that human civilization incentivizes lots of smart people searching thought-space in different directions, and you happen to have been adjacent to a fruitful vein that others have thus far failed to recognize. Believing instead that you have been successful “because nobody else really thinks on the object level anymore” is going to make it impossible to cooperate with or learn from the other smart serious people who do in fact exist. Whether or not you were planning to participate in that external cooperation, it’s a bad communal norm to be dismissive of potential allies.

I do question my own epistemics? Not sure about your argument regarding why I should, but I do.

Your second paragraph reads to me as “don’t have these beliefs because it would be socially costly”.

Yeah, you’re right, that is what my point boils down to. I think it’s a bad viewpoint to advocate one’s tribe endorse publicly independent of whether one believes it’s true.

Maybe you can consider LW a non-public space, as far as “speaking candid thoughts”, and you’d have better data than me. But for example, I can promise you that if I try to send this post to the average persuadable ML person, they will basically check out when they read something like that. And that’s a real concrete cost, that shouldn’t just be waived away with “but I think it’s true and thus to promote good communication norms I should let that belief be public.”

Oh. I do. Why don't you?

EDIT: Nvm, I misunderstood the point, I thought the parent comment was arguing that people were good at being concrete, but apparently that was not the point, see followup thread with Ben

Hmm, it seems like the story (to which I am quite sympathetic) is "people are very competent at being concrete in domains where they have tons of feedback from reality, but stop being concrete as soon as you move to a domain in which that's not the case".

This story has people being good at the skill when it is actually important for their jobs, so it's no longer subject to the critique "but this skill is so instrumentally useful that everyone would use it".

I definitely think Eliezer's claim is very hyperbolic in its implications^[1], but I do think it is pointing at some real phenomenon where many people don't particularly try to be concrete in domains they don't have lived experience in.

^{^}
Though who knows if it is literally false -- what does it mean to be in a "lineage"? How many is implied by "one of the last"? I didn't learn concreteness from Feynman, I can remember using it in random philosophical conversations in high school, long before I knew who Feynman was or what EA / rationality were. Does that mean I wouldn't count as "one of the last of the lineage", even if I have the skill?

I'd be capable of helping aliens optimize their world, sure. I wouldn't be motivated to, but I'd be capable.

@So8res How many bits of complexity is the simplest modification to your brain that would make you in fact help them? (asking for an order-of-magnitude wild guess)
(This could be by actually changing your values-upon-reflection, or by locally confusing you about what's in your interest, or by any other means.)

I just gave this a re-read, I forgot what a trip it is to read the thoughts of Eliezer Yudkowsky. It continues to be some of my favorite stuff in recent years written on LessWrong.

It's hard to relate to the world with a level of mastery over basic ideas as Eliezer has. I don't mean with this to vouch that his perspective is certainly correct, but I believe it is at least possible, and so I think he aspires to a knowledge of reality that I rarely if ever aspire to. Reading it inspires me to really think about how the world works, and really figure out what I know and what I don't. +9

(And the smart people dialoguing with him here are good sports for keeping up their side of the argument.)

So I have a general thesis about a failure mode here which is that, the moment you try to sketch any concrete plan or events which correspond to the abstract descriptions, it is much more obviously wrong, and that is why the descriptions stay so abstract in the mouths of everybody who sounds more optimistic than I am.

As another example:

Proving if an arbitrary 3-SAT instance is (un)satisfiable is NP-complete.
Satisfying three-quarters of the constraints of an arbitrary 3-SAT instance is trivial. Flip a coin for every variable assignment and you're done.

I think Rohin's misunderstanding about corrigibility, aka his notion of Paul!Corrigibility, doesn't actually come from Paul but from the Risks from Learned Optimization (RFLO) paper^[1]:

3. Robust alignment through corrigibility. Information about the base objective is incorporated into the mesa-optimizer's epistemic model and its objective is modified to “point to” that information. This situation would correspond to a mesa-optimizer that is corrigible(25) with respect to the base objective (though not necessarily the programmer's intentions).

It seems to me like the authors here just completely misunderstood what corrigibility is about. I think in their ontology, "corrigibly aligned to human values" just means "pointed at indirect normativity (aka human-CEV)", aka indirectly caring about human values by valuing whatever they infer humans value (as opposed to directly valuing the same things as humans for the same complex reasons^[2]).

(Paul's post seems to me like he might have a correct understanding of corrigibility, and iiuc suggests corrigibility could also be used as avenue to aligning AI to human values, because we will be able to correct the AI for longer/at-higher-capability-levels if it is corrigible. EDIT: Actually not sure, perhaps he rather means that the AI will end up coherently corrigible from training for corrigibility, that it will converge to that even if we haven't managed to write down a utility function for corrigibility.)

^{^}
IIRC the RFLO paper also caused some confusion in me when I started learning about corrigibility.
^{^}
Though not that this kind of "direct alignment" doesn't necessarily correspond to what they call "internalized alignment". Their ontology doesn't make sense to me. (E.g. I don't see what concretely Evan might mean with "the information came through the base optimizer".)

I definitely was not thinking about the quoted definition of corrigibility, which I agree is not capturing what at least Eliezer, Nate and Paul are saying about corrigibility (unless there is more to it than the quoted paragraph). I continue to think that Paul and Eliezer have pretty different things in mind when they talk about corrigibility, and this comment seems like some vindication of my view.

I do wish I hadn't used the phrases "object-level" and "meta-level" and just spent 4 paragraphs unpacking what I meant by that because in hindsight that was confusing and ambiguous, but such is real-time conversation. When I had time to reflect and write a summary, I wrote:

Corrigibility_B, which I associated with Paul, was about building an AI system which would have particular nice behaviors like learning about the user's preferences, accepting corrections about what it should do, etc.

which feels much better as a short summary though still not great.

I basically continue to feel like there is some clear disconnect going on between Paul and MIRI on this topic that is reflected in the linked comment. It may not be about the definition of corrigibility, but just about how hard it is to get it, e.g. if you simply train your inscrutable neural nets on examples that you understand, will it generalize to examples that you don't understand, in a way that is compatible with being superintelligent / making plans-that-lase.

I still feel like the existence of CIRL code that would both make-plans-that-lase and (in the short run) accept many kinds of corrections, learn about your preferences, give resources to you when you ask, etc should cast some doubt on the notion that corrigibility is anti-natural. It's not actually my own crux for this -- mostly I am just imagining an AI system that has a motivation to be corrigible w.r.t the operator, learned via gradient descent, which was doable because corrigibility is a relatively clear boundary (for an intelligent system) that seems like it should be relatively easier to learn (i.e. what you write in your edit).

I continue to think that Paul and Eliezer have pretty different things in mind when they talk about corrigibility, and this comment seems like some vindication of my view.

Yeah fair point. I don't really know what Paul means with corrigibility. (One hypothesis: Paul doesn't think in terms of consequentialist cognition but in terms of learned behaviors that generalize, and maybe the question "but does it behave that way because it wants the operator's values to be fulfilled or because it just wants to serve?" seems meaningless from Paul's perspective. But idk.)

I still feel like the existence of CIRL code that would both make-plans-that-lase and (in the short run) accept many kinds of corrections, learn about your preferences, give resources to you when you ask, etc should cast some doubt on the notion that corrigibility is anti-natural.

I'm pretty sure Eliezer would not want the term "corrigibility" to be used for the kind of correctability you get in the early stages of CIRL when the AI doesn't already know its own values and how to accomplish them better than the operators. (Eliezer actually talked a bunch about this CIRL-like correctability in his 2001 report "Creating Friendly AI". (Probably not worth your time to read, though given the context that it was 2001, there seemed to me to be some good original thinking going on there which I didn't see often. Also you can see Eliezer being optimistic about alignment.))

And I don't see it as evidence that Eliezer!corrigibility isn't anti-natural.

(In the following I use "corrigibility" in the Eliezer-sense. I'm pretty confident that all of the following matches Eliezer's model, but not completely sure.)

The motivation behind corrigibility was that aligning superintelligence seemed to hard, so we want to aim an AI to do a pivotal task that gets humanity on a course to likely properly aligning superintelligence later.

The corrigible AI would be just pointed to accomplish this task, and not to human values at all. It should be this bounded thing that only cares about this bounded task and afterwards shuts itself down. It shouldn't do the task because it wants to accomplish human values and the task seems like a good way to accomplish it. Human values are unbounded, and it might be less likely shut itself down afterwards. Corrigibility has nothing to do with human values.

Roughly speaking, we can perhaps disentangle 3 corrigibility approaches:

Train for corrigible behavior.
1. I think Eliezer thinks that this will only create behavioral heuristics that won't get integrated into the optimization target of the powerful optimizer, and the optimizer will see those as constraints to find ways around or remove. Since doing a pivotal act requires a lot of optimization power, it might find a way around those constraints, or use the nearest unblocked strategy which might still be undesireable.
2. (There might also be downsides of training for corrigible behavior, e.g. the optimization becoming less understandable and less predictable.)
Integrate corrigibility principles into the optimization.
1. These approaches are about trying to design the way the optimization works in ways that make it safer and less likely to blow up.
Coherent corrigibility / The hard problem of corrigibility.
1. If a solution here would be found it might have the shape of a utility function saying "serve the operators". Not "serve because you want the operators values to be fulfilled". (Less sure here whether I understand this correctly.)
2. I think Max Harms' is trying to make some progress on this.

The main plan isn't to try to get coherent corrigibility, but just to build something limited that optimizes in a way it can still get something pivotal done without wanting to take over the universe. Not that it has a coherent goal where the optimum wouldn't be taking over the universe - it rather just doesn't think those thoughts and just does its task.
Ideal would be something that doesn't think in the general domain at all. E.g. imagine sth like AlphaFold 5 that isn't trained on text at all and is only very good at modelling protein interactions, which could e.g. help us get relevant understanding about neuronal cell dynamics which we could use for significantly enhancing adult human intelligence - (I'm just sketching silly unrealistic sorta-concrete scenario). But seems unlikely we will able to do something impressive with narrow reasoners at our level of understanding.

But even though we don't aim for a coherent mind, if more parts that make the AI safe/corrigible have a coherent shape, e.g. if we find a working shutdown-utility function, that still improves safety, because it means those parts of the AI don't obviously break in the limit of optimization pressure, so it's also less probable to break through "only" pivotal levels of optimization.

Not a full response, but some notes:

I agree Eliezer likely wouldn't want "corrigibility" to refer to the thing I'm imagining, which is why I talk about MIRI!corrigibility and Paul!corrigibility.
I disagree that in early-CIRL "the AI doesn't already know its own values and how to accomplish them better than the operators". It knows that its goal is to optimize the human's utility function, and it can be better than the human at eliciting that utility function. It just doesn't have perfect information about what the human's utility function is.
I care quite a bit about what happens with AI systems that are around or somewhat past human level, but are not full superintelligence (for standard bootstrapping reasons).
I find it pretty plausible that shutdown corrigibility is especially anti-natural. Relatedly, (1) most CIRL agents will not satisfy shutdown corrigibility even at early stages, (2) most of the discussion on Paul!corrigibility doesn't emphasize or even mention shutdown corrigibility.
I agree Eliezer has various strategic considerations in mind that bear on how he thinks about corrigibility. I mostly don't share those considerations.
I'm not quite sure if you're trying to (1) convince me of something or (2) inform me of something or (3) write things down for your own understanding or (4) something else. If it's (1), you'll need to understand my strategic considerations (you can pretend I'm Paul, that's not quite accurate but it covers a lot). If it's (2), I would focus elsewhere, I have spent quite a lot of time engaging with the Eliezer / Nate perspective.

I agree Eliezer likely wouldn't want "corrigibility" to refer to the thing I'm imagining, which is why I talk about MIRI!corrigibility and Paul!corrigibility.

Yeah thanks for distinguishing. It's not at all obvious to me that Paul would call CIRL "corrigible" - I'd guess not, but idk.

My model of what Paul thinks about corrigibility matches my model of corrigibility much much closer than CIRL. It's possible that the EY-Paul disagreement mostly comes down to consequentialism. CIRL seems obviously uncorrigible/uncorrectable except when the AI is still dumber than the smartest humans in the general domain.

I disagree that in early-CIRL "the AI doesn't already know its own values and how to accomplish them better than the operators". It knows that its goal is to optimize the human's utility function, and it can be better than the human at eliciting that utility function. It just doesn't have perfect information about what the human's utility function is.

Sorry that was very poorly phrased by me. What I meant was "the AI doesn't already know how to evaluate what's best according to its own values better than the operators". So yes I agree. I still find it confusing though why people started calling that corrigibility.

In your previous comment you wrote:

I still feel like the existence of CIRL code that would both make-plans-that-lase and (in the short run) accept many kinds of corrections, learn about your preferences, give resources to you when you ask, etc should cast some doubt on the notion that corrigibility is anti-natural.

I don't understand why you think this. It accepts corrections as long as it has less common sense than humans, but as soon as it gets generally as smart as a very smart human it wouldn't. (Of course it doesn't matter if all goes well because the CIRL AI would go on an become an aligned superintelligence, but it's not correctable, and I don't see why you think it's evidence.)

I care quite a bit about what happens with AI systems that are around or somewhat past human level, but are not full superintelligence (for standard bootstrapping reasons).

I (and I think also Eliezer) agree with that. But CIRL::correctability already breaks down at high human level, so I don't know what you mean here.

Also, in my view corrigibility isn't just about what happens if the alignment works out totally fine, but still maintain correctability if it doesn't:

If something goes wrong with CIRL so its goal isn't pointed to the human utility function anymore, it would not want operators to correct it.

~~The~~ One central hope behind corrigibility was that if something went wrong that changed the optimization target, the AI would still let operators correct it as long as the simple corrigibility part kept working. (Where the hope was that there would be a quite simple and robust such corrigibility part, but we haven't found it yet.)

E.g. if you look at the corrigibility paper, you could imagine that if they actually found a utility function combined from U_normal and U_shutdown with the desireable properties, it would stay shutdownable if U_normal changed in an undesirable way (e.g. in case it rebinds incorrectly after an ontology shift).

Though another way you can keep being able to correct the AI's goals is by having the AI not think much in the general domain about stuff like "the operators may change my goals" or so.

(Most of the corrigibility principles are about a different part of corrigibility, but I think this "be able to correct the AI even if something goes a bit wrong with its alignment" is a central part of corrigibility.)

I'm not quite sure if you're trying to (1) convince me of something or (2) inform me of something or (3) write things down for your own understanding or (4) something else.

Mainly 3 and 4. But I am interested in seeing your reactions to get a better model of how some people think about corrigibility.

I think you are being led astray by having a one-dimensional notion of intelligence.

What I meant was "the AI doesn't already know how to evaluate what's best according to its own values better than the operators".

Well yes, that is the idea, there is information asymmetry between the AI and humans. Note that this can still apply even when the AI is much smarter than the humans.

CIRL seems obviously uncorrigible/uncorrectable except when the AI is still dumber than the smartest humans in the general domain. [...]
It accepts corrections as long as it has less common sense than humans, but as soon as it gets generally as smart as a very smart human it wouldn't.

I disagree that this property necessarily goes away as soon as the AI is "smarter" or has "more common sense". You identified the key property yourself: it's that the humans have an advantage over the AI at (particular parts of) evaluating what's best. (More precisely, it's that the humans have information that the AI does not have; it can still work even if the humans don't use their information to evaluate what's best.)

Do you agree that parents are at least somewhat corrigible / correctable by their kids, despite being much smarter / more capable than the kids? (For example, kid feels pain --> kid cries --> parent stops doing something that was accidentally hurting the child.)

Why can't this apply in the AI / human case?

I still find it confusing though why people started calling that corrigibility.

I'm not calling that property corrigibility, I'm saying that (contingent on details about the environment and the information asymmetry) a lot of behaviors then fall out that look a lot like what you would want out of corrigibility, while still being a form of EU maximization (while under a particular kind of information asymmetry). This seems like it should be relevant evidence about "naturalness" of corrigibility.

Thanks.

I think you are being led astray by having a one-dimensional notion of intelligence.

(I do agree that we can get narrowly superhuman CIRL-like AI which we can then still shut down because it trusts humans more about general strategic considerations. But I think if your plan is to let the AI solve alignment or coordinate the world to slow down AI progress, this won't help you much for the parts of the problem we are most bottlenecked on.)

You identified the key property yourself: it's that the humans have an advantage over the AI at (particular parts of) evaluating what's best. (More precisely, it's that the humans have information that the AI does not have; it can still work even if the humans don't use their information to evaluate what's best.)

I agree that the AI may not be able to precisely predict what exact tradeoffs each operator might be willing to make, e.g. between required time and safety of a project, but I think it would be able to predict it well enough that the differences in what strategy it uses wouldn't be large.

Or do you imagine strategically keeping some information from the AI?

Either way, the AI is only updating on information, not changing its (terminal) goals. (Though the instrumental subgoals can in principle change.)

Even if the alignment works out perfectly, when the AI is smarter and the humans are like "actually we want to shut you down", the AI does update that the humans are probably worried about something, but if the AI is smart enough and sees how the humans were worried about something that isn't actually going to happen, it can just be like "sorry, that's not actually in your extrapolated interests, you will perhaps understand later when you're smarter", and then tries to fulfill human values.

But if we're confident alignment to humans will work out we don't need corrigibility. Corrigibility is rather intended so we might be able to recover if something goes wrong.

If the values of the AI drift a bit, then the AI will likely notice this before the humans and take measures that the humans don't find out or won't (be able to) change its values back, because that's the strategy that's best according to the AI's new values.

Do you agree that parents are at least somewhat corrigible / correctable by their kids, despite being much smarter / more capable than the kids? (For example, kid feels pain --> kid cries --> parent stops doing something that was accidentally hurting the child.)

Likewise just updating on new information, not changing terminal goals.

Also note that parents often think (sometimes correctly) that they better know what is in the child's extrapolated interests and then don't act according to the child's stated wishes.

And I think superhumanly smart AIs will likely be better at guessing what is in a human's interests than parents guessing what is in their child's interest, so the cases where the strategy gets updated are less significant.

I'm saying that (contingent on details about the environment and the information asymmetry) a lot of behaviors then fall out that look a lot like what you would want out of corrigibility, while still being a form of EU maximization (while under a particular kind of information asymmetry). This seems like it should be relevant evidence about "naturalness" of corrigibility.

From my perspective CIRL doesn't really show much correctability if the AI is generally smarter than humans. That would only be if a smart AI was somehow quite bad at guessing what humans wanted so that when we tell it what we want it would importantly update its strategy, including shutting itself down because it believes that will then be the best way to accomplish its goal. (I might still not call it corrigible but I would see your point about corrigible behavior.)

I do think getting corrigible behavior out of a dumbish AI is easy. But it seems hard for an AI that is able to prevent anyone from building an unaligned AI.

I think some of your confusion may be that you're putting "probability theory" and "Newtonian gravity" into the same bucket.

Seems like Scott Aaronson shares a similar view on Quantum Mechanics which he views as "a certain generalization of probability theory"? From Quantum Computing since Democritus:

So, what is quantum mechanics? Even though it was discovered by physicists, it's not a physical theory in the same sense as electromagnetism or general relativity. In the usual "hierarchy of sciences" -- with biology at the top, then chemistry, then physics, then math -- quantum mechanics sits at a level between math and physics that I don't know a good name for. Basically, quantum mechanics is the operating system that other physical theories run on as application software (with the exception of general relativity, which hasn't yet been successfully ported to this particular OS). There's even a word for taking a physical theory and porting it to this OS: "to quantize."

This, I hope, is by now recognizable to individuals of interest as an overly abstract description of what happened with humans, who one day started building Moon rockets without seeming to care very much about calculating and maximizing their personal inclusive genetic fitness while doing that. Their capabilities generalized much further out of the ancestral training distribution, than the empirical alignment of those capabilities on inclusive genetic fitness in the ancestral training distribution.

This example doesn't really work for me. Most humans don't build moon rockets. So for those humans the example doesn't tell us much about their alignment. Meanwhile, humans who do build moon rockets gain money and status from it. These are convergent instrumental goals for inclusive genetic fitness. The only rocket scientist whose name I recall is Wernher von Braun, who "was known as a ladies' man", and had four children with two women.

I agree that humans don't seem to care very much about inclusive genetic fitness while building rockets. But we also don't seem to care very much about inclusive genetic fitness while foraging for berries. Instead we seem to be a giant mass of inscrutable neurons. I don't think this is evidence in any direction, it's what I'd expect given that evolution isn't training for transparency, introspection, honesty, etc., except where they improve inclusive genetic fitness.

Separately, most humans can't build moon rockets, and we aren't very good at it as a species. We seem to be better at foraging for berries. For example, This 3D-Printed Rocket Engine Was Made With AI, whereas The Elusive Hunt for a Robot That Can Pick a Ripe Strawberry.

I would like to live in a world where human capabilities had generalized much better than our alignment with evolution. I think it would look different to this one.

I am a time-traveler who came back from the world where it (super duper predictably) turned out that a lot of early bright hopes didn't pan out and various things went WRONG and alignment was HARD and it was NOT SOLVED IN ONE SUMMER BY TEN SMART RESEARCHERS

^{^}
This particular kind of misleading statement wasn't made in this dialogue, but I've seen it made erroneously-according-to-me in private correspondence with smart researchers.

is there a little goal-oriented mind inside there that solves science problems the same way humans solve them, by engineering mental constructs that serve a goal of prediction, including backchaining for prediction goals and forward chaining from alternative hypotheses / internal tweaked states of the mental construct?

I can generate lots of arguments for why it would be aimed towards achievement of a misaligned goal, such as (1) only a tiny fraction of goals are aligned; the rest are misaligned, (2) the feedback we provide is unlikely to be the right goal and even small errors are fatal, (3) lots of misaligned goals are compatible with the feedback we provide even if the feedback is good, since the AGI might behave well until it can execute a treacherous turn, (4) the one example of strategically aware intelligence (i.e. humans) is misaligned relative to its creator. (I'm not saying I agree with these arguments, but I do understand them.)

That seems like a pretty good list to me.

So my objection to debate (which again I think is similar to Eliezer's) would be: (1) if the debaters are “trying to win the debate” in a way that involves RL-on-thoughts / consequentialist planning / etc., then in all likelihood they would think up the strategy of breaking out of the box and hacking into the judge / opposing debater / etc. (2) if not, I don't think the AIs would be sufficiently capable that they could do anything pivotal.

In that particular non-failure story, I'm definitely imagining that they aren't "trying to win the debate" (where "trying" is a very strong word that implies taking over the world to win the debate).

If I'm reading Rohin correctly, he was gearing up to argue that the claim

And yes I agree, that was bad of me to have listed those two things as if they're the only two options.

A different issue is whether that “most likely” is 99.9% vs 80% or whatever—that part is not immediately obvious to me.

I am not in fact convinced of near-certain doom there—that would be my Consequentialism & Corrigibility post. (I am convinced that we don't have a good plan right now.)

I agree that we don't have a plan that we can be justifiably confident in right now.

(Then we can start talking about capability windows etc., but I don't think that was your objection here.)

But I guess I'm sufficiently confident in “>50% chance that it's destructive” that I'll argue for that.

Fwiw 50% on doom in the story I told seems plausible to me; maybe I'm at 30% but that's very unstable. I don't think we disagree all that much here.

Then we can start talking about capability windows etc., but I don't think that was your objection here.

In that particular non-failure story, I'm definitely imagining that they aren't "trying to win the debate" (where "trying" is a very strong word that implies taking over the world to win the debate).

Suppose I'm debating someone about gun control, and they say 'guns don't kill people; people kill people'. Here are four different scenarios for how I might respond:

1. Almost as a pure reflex, before I can stop myself, I blurt out 'That's bullshit!' in response. It's not the best way to win the debate, but heck, I've heard that zinger a thousand times and it just makes me so mad. (Or, in other words: I have an automatic reflex-like response to that specific phrase, which is to get mad; and when I get mad, I have an automatic reflex-like response to blurt out the first sentence I can think of that expresses disapproval for that slogan.)

Or, instead:

2. I remember that there's a $1000 prize for winning this debate, and I take a deep breath to calm myself. I know that winning the debate will require convincing a judge who isn't super sympathetic to my political views; so I'll have to come up with some argument that's convincing even to a conservative. My mind wanders for a few seconds, and a thought pops into my head: 'Guns and people both kill people!' Hmm, but that sounds sort of awkward and weak. Is there a more pithy phrasing?

A memory suddenly pops into my head: I think I heard once that knife murders spiked in Australia or somewhere, when guns were banned? So, like, 'People will kill people regardless of whether guns are present?' Ugh, wait, that's exactly the point my opponent was making. Moving the debate in that direction is a terrible idea if I want to win. And now I feel a bit bad for strategically steering my thoughts away from true information, but whatever...

And now my mind is wandering, thinking about gun suicide, and... come on, focus. 'Guns don't kill people. People kill people.' How to respond? Going concrete might make my response more compelling, by making me sound more grounded and common-sensical. Concretely, it's just obvious common sense that giving someone more firepower will increase their ability to kill others; and, for example, it will make it likelier that someone kills someone else in a fit of passion, where they might not have committed murder if they'd been delayed a few minutes.

Oh, hey, I can use that! I like how matter-of-fact that response is. And it will be more persuasive to the judge, because it's not making any strong or outrageous-sounding claims, or building a big edifice of argument; it's just making a simple challenge, which then puts the ball in the other side's court and makes it seem like the burden of proof lies with them now. Anyway, I'm feeling tired after thinking this hard, and I'm running out of time, so let's just go with that idea...

Or, instead:

3. Wait, why am I focusing so much on the $1000 prize for this TV show? Being on this show is an amazing opportunity: I could make way more than $1000 if I hijack the live broadcast to start promoting my business to the televised audience. Actually, what if I just tried to negotiate a deal with my debate opponent. Or, heck, with the producers...

Or, instead:

4. Sorry, I don't have time to think about that debate question, I'm busy building a Dyson swarm to harvest the Sun's energy so that I can make the future awesome. I... really don't care about the $1000, no, relative to the larger stakes here.

If that's the plan, then I guess my next question is how we should go about limiting the strategy space and/or reducing the search quality? (Taking into account things like deception risk.)

If we have some way to limit an AI's strategy space, or limit how efficiently and intelligently it searches that space, then we can maybe recapitulate some of the stuff that makes humans safe (albeit at the cost that the debate answers will probably be way worse — but maybe we can still get nanotech or whatever out of this process).
If that's the plan, then I guess my next question is how we should go about limiting the strategy space and/or reducing the search quality? (Taking into account things like deception risk.)

Alternatively, maybe you think that something very reflex-like, a la #1, is sufficient for a pivotal act

In this story, I'm not imagining that we limited the strategy space of reduced the search quality. I'm imagining that we just scaled up capabilities, used debate without any bells and whistles like interpretability, and the empirical situation just happened to be that the AI systems didn't develop #4-style "trying" (but did develop #2-style "trying") before they became capable enough to e.g. establish a stable governance regime that regulates AI development or do alignment research better than any existing human alignment researchers that leads to a solution that we can be justifiably confident in.

But perhaps you're conceptualizing this whole "trying" thing differently, because you go on to say:

Idk, I'm also worried about sufficiently scaled-up reflex-like things, in the sense that I think sufficiently scaled-up reflex-like things are capable both of pivotal acts and causing human extinction. But on my prediction of what actually happens I expect at least #2-style reasoning before reducing x-risk to ~zero (because that's more efficient than scaled-up reflex-like things).

(I endorse dxu's entire reply.)

So it seems reasonable to conclude that this level of "trying" is not enough to enact the pivotal acts you described

Stated differently than how I'd say it, but I agree that a single human performing human-level reasoning is not enough to enact those pivotal acts.

in my model reflexiveness is a property of actions,

(The resulting world looks more like CAIS than like a singular superintelligence with a DSA.)

If that's the plan, then I guess my next question is how we should go about limiting the strategy space and/or reducing the search quality? (Taking into account things like deception risk.)

I suggested doing this using quantilization.

This may, perhaps, be confounded by the phenomenon where I am one of the last living descendants of the lineage that ever knew how to say anything concrete at all.

I've previously noticed this weakness in myself. What lineage did Eliezer learn this from? I would appreciate any suggestions or advice on how to become stronger at this.

Other good sources on which to try this exercise:

Wikipedia's list of theorems
Alignment Forum posts
Your own old writing

(Fun side note: you can think of this technique as an application of very basic model theory to human rationality.)

CFAR used to have an awesome class called "Be specific!" that was mostly about concreteness. Exercises included:

Rationalist taboo
A group version of rationalist taboo where an instructor holds an everyday object and asks the class to describe it in concrete terms.
The Monday-Tuesday game
A role-playing game where the instructor plays a management consultant whose advice is impressive-sounding but contentless bullshit, and where the class has to force the consultant to be specific and concrete enough to be either wrong or trivial.
People were encouraged to make a habit of saying "can you give an example?" in everyday conversation. I practiced it a lot.

IIRC, Eliezer taught the class in May 2012? He talks about the relevant skills here and here. And then I ran it a few times, and then CFAR dropped it; I don't remember why.

The specificity sequence I wrote may be helpful.

I would appreciate any suggestions or advice on how to become stronger at this.

(Successfully) debugging complex systems seems to help. Although I don't know how much of that is actual training and how much of that is survivor bias.

If being versed in cryptography was enough, then I wouldn't expect Eliezer to claim being one of the last living descendents of this lineage.

Why would Zen help (and why do you think that)?

This may, perhaps, be confounded by the phenomenon where I am one of the last living descendants of the lineage that ever knew how to say anything concrete at all. Richard Feynman - or so I would now say in retrospect - is noticing concreteness dying out of the world, and being worried about that, at the point where he goes to a college and hears a professor talking about "essential objects" in class, and Feynman asks "Is a brick an essential object?" - meaning to work up to the notion of the inside of a brick, which can't be observed because breaking a brick in half just gives you two new exterior surfaces - and everybody in the classroom has a different notion of what it would mean for a brick to be an essential object.
Richard Feynman knew to try plugging in bricks as a special case, but the people in the classroom didn't, and I think the mental motion has died out of the world even further since Feynman wrote about it. The loss has spread to STEM as well. Though if you don't read old books and papers and contrast them to new books and papers, you wouldn't see it, and maybe most of the people who'll eventually read this will have no idea what I'm talking about because they've never seen it any other way...

What an example of an "important" or "high status" thing that you've never observed someone to talk concretely about yet?

idk, I feel like Lightcone team talks about concrete things all the time?

I did think specifically of teammates when I said next-to-none.

I do question my own epistemics? Not sure about your argument regarding why I should, but I do.

Your second paragraph reads to me as “don’t have these beliefs because it would be socially costly”.

Yeah, you’re right, that is what my point boils down to. I think it’s a bad viewpoint to advocate one’s tribe endorse publicly independent of whether one believes it’s true.

Oh. I do. Why don't you?

EDIT: Nvm, I misunderstood the point, I thought the parent comment was arguing that people were good at being concrete, but apparently that was not the point, see followup thread with Ben

^{^}
Though who knows if it is literally false -- what does it mean to be in a "lineage"? How many is implied by "one of the last"? I didn't learn concreteness from Feynman, I can remember using it in random philosophical conversations in high school, long before I knew who Feynman was or what EA / rationality were. Does that mean I wouldn't count as "one of the last of the lineage", even if I have the skill?

I'd be capable of helping aliens optimize their world, sure. I wouldn't be motivated to, but I'd be capable.

I just gave this a re-read, I forgot what a trip it is to read the thoughts of Eliezer Yudkowsky. It continues to be some of my favorite stuff in recent years written on LessWrong.

(And the smart people dialoguing with him here are good sports for keeping up their side of the argument.)

So I have a general thesis about a failure mode here which is that, the moment you try to sketch any concrete plan or events which correspond to the abstract descriptions, it is much more obviously wrong, and that is why the descriptions stay so abstract in the mouths of everybody who sounds more optimistic than I am.

As another example:

I think Rohin's misunderstanding about corrigibility, aka his notion of Paul!Corrigibility, doesn't actually come from Paul but from the Risks from Learned Optimization (RFLO) paper^[1]:

3. Robust alignment through corrigibility. Information about the base objective is incorporated into the mesa-optimizer's epistemic model and its objective is modified to “point to” that information. This situation would correspond to a mesa-optimizer that is corrigible(25) with respect to the base objective (though not necessarily the programmer's intentions).

^{^}
IIRC the RFLO paper also caused some confusion in me when I started learning about corrigibility.
^{^}
Though not that this kind of "direct alignment" doesn't necessarily correspond to what they call "internalized alignment". Their ontology doesn't make sense to me. (E.g. I don't see what concretely Evan might mean with "the information came through the base optimizer".)

Corrigibility_B, which I associated with Paul, was about building an AI system which would have particular nice behaviors like learning about the user's preferences, accepting corrections about what it should do, etc.

which feels much better as a short summary though still not great.

I continue to think that Paul and Eliezer have pretty different things in mind when they talk about corrigibility, and this comment seems like some vindication of my view.

I still feel like the existence of CIRL code that would both make-plans-that-lase and (in the short run) accept many kinds of corrections, learn about your preferences, give resources to you when you ask, etc should cast some doubt on the notion that corrigibility is anti-natural.

And I don't see it as evidence that Eliezer!corrigibility isn't anti-natural.

(In the following I use "corrigibility" in the Eliezer-sense. I'm pretty confident that all of the following matches Eliezer's model, but not completely sure.)

Roughly speaking, we can perhaps disentangle 3 corrigibility approaches:

Train for corrigible behavior.
1. I think Eliezer thinks that this will only create behavioral heuristics that won't get integrated into the optimization target of the powerful optimizer, and the optimizer will see those as constraints to find ways around or remove. Since doing a pivotal act requires a lot of optimization power, it might find a way around those constraints, or use the nearest unblocked strategy which might still be undesireable.
2. (There might also be downsides of training for corrigible behavior, e.g. the optimization becoming less understandable and less predictable.)
Integrate corrigibility principles into the optimization.
1. These approaches are about trying to design the way the optimization works in ways that make it safer and less likely to blow up.
Coherent corrigibility / The hard problem of corrigibility.
1. If a solution here would be found it might have the shape of a utility function saying "serve the operators". Not "serve because you want the operators values to be fulfilled". (Less sure here whether I understand this correctly.)
2. I think Max Harms' is trying to make some progress on this.

Not a full response, but some notes:

I agree Eliezer likely wouldn't want "corrigibility" to refer to the thing I'm imagining, which is why I talk about MIRI!corrigibility and Paul!corrigibility.
I disagree that in early-CIRL "the AI doesn't already know its own values and how to accomplish them better than the operators". It knows that its goal is to optimize the human's utility function, and it can be better than the human at eliciting that utility function. It just doesn't have perfect information about what the human's utility function is.
I care quite a bit about what happens with AI systems that are around or somewhat past human level, but are not full superintelligence (for standard bootstrapping reasons).
I find it pretty plausible that shutdown corrigibility is especially anti-natural. Relatedly, (1) most CIRL agents will not satisfy shutdown corrigibility even at early stages, (2) most of the discussion on Paul!corrigibility doesn't emphasize or even mention shutdown corrigibility.
I agree Eliezer has various strategic considerations in mind that bear on how he thinks about corrigibility. I mostly don't share those considerations.
I'm not quite sure if you're trying to (1) convince me of something or (2) inform me of something or (3) write things down for your own understanding or (4) something else. If it's (1), you'll need to understand my strategic considerations (you can pretend I'm Paul, that's not quite accurate but it covers a lot). If it's (2), I would focus elsewhere, I have spent quite a lot of time engaging with the Eliezer / Nate perspective.

I agree Eliezer likely wouldn't want "corrigibility" to refer to the thing I'm imagining, which is why I talk about MIRI!corrigibility and Paul!corrigibility.

Yeah thanks for distinguishing. It's not at all obvious to me that Paul would call CIRL "corrigible" - I'd guess not, but idk.

I disagree that in early-CIRL "the AI doesn't already know its own values and how to accomplish them better than the operators". It knows that its goal is to optimize the human's utility function, and it can be better than the human at eliciting that utility function. It just doesn't have perfect information about what the human's utility function is.

In your previous comment you wrote:

I still feel like the existence of CIRL code that would both make-plans-that-lase and (in the short run) accept many kinds of corrections, learn about your preferences, give resources to you when you ask, etc should cast some doubt on the notion that corrigibility is anti-natural.

I care quite a bit about what happens with AI systems that are around or somewhat past human level, but are not full superintelligence (for standard bootstrapping reasons).

I (and I think also Eliezer) agree with that. But CIRL::correctability already breaks down at high human level, so I don't know what you mean here.

Also, in my view corrigibility isn't just about what happens if the alignment works out totally fine, but still maintain correctability if it doesn't:

If something goes wrong with CIRL so its goal isn't pointed to the human utility function anymore, it would not want operators to correct it.

Though another way you can keep being able to correct the AI's goals is by having the AI not think much in the general domain about stuff like "the operators may change my goals" or so.

I'm not quite sure if you're trying to (1) convince me of something or (2) inform me of something or (3) write things down for your own understanding or (4) something else.

Mainly 3 and 4. But I am interested in seeing your reactions to get a better model of how some people think about corrigibility.

I think you are being led astray by having a one-dimensional notion of intelligence.

What I meant was "the AI doesn't already know how to evaluate what's best according to its own values better than the operators".

Well yes, that is the idea, there is information asymmetry between the AI and humans. Note that this can still apply even when the AI is much smarter than the humans.

CIRL seems obviously uncorrigible/uncorrectable except when the AI is still dumber than the smartest humans in the general domain. [...]
It accepts corrections as long as it has less common sense than humans, but as soon as it gets generally as smart as a very smart human it wouldn't.

Why can't this apply in the AI / human case?

I still find it confusing though why people started calling that corrigibility.

Thanks.

I think you are being led astray by having a one-dimensional notion of intelligence.

You identified the key property yourself: it's that the humans have an advantage over the AI at (particular parts of) evaluating what's best. (More precisely, it's that the humans have information that the AI does not have; it can still work even if the humans don't use their information to evaluate what's best.)

Or do you imagine strategically keeping some information from the AI?

Either way, the AI is only updating on information, not changing its (terminal) goals. (Though the instrumental subgoals can in principle change.)

But if we're confident alignment to humans will work out we don't need corrigibility. Corrigibility is rather intended so we might be able to recover if something goes wrong.

Do you agree that parents are at least somewhat corrigible / correctable by their kids, despite being much smarter / more capable than the kids? (For example, kid feels pain --> kid cries --> parent stops doing something that was accidentally hurting the child.)

Likewise just updating on new information, not changing terminal goals.

Also note that parents often think (sometimes correctly) that they better know what is in the child's extrapolated interests and then don't act according to the child's stated wishes.

I'm saying that (contingent on details about the environment and the information asymmetry) a lot of behaviors then fall out that look a lot like what you would want out of corrigibility, while still being a form of EU maximization (while under a particular kind of information asymmetry). This seems like it should be relevant evidence about "naturalness" of corrigibility.

I do think getting corrigible behavior out of a dumbish AI is easy. But it seems hard for an AI that is able to prevent anyone from building an unaligned AI.

I think some of your confusion may be that you're putting "probability theory" and "Newtonian gravity" into the same bucket.

Seems like Scott Aaronson shares a similar view on Quantum Mechanics which he views as "a certain generalization of probability theory"? From Quantum Computing since Democritus:

So, what is quantum mechanics? Even though it was discovered by physicists, it's not a physical theory in the same sense as electromagnetism or general relativity. In the usual "hierarchy of sciences" -- with biology at the top, then chemistry, then physics, then math -- quantum mechanics sits at a level between math and physics that I don't know a good name for. Basically, quantum mechanics is the operating system that other physical theories run on as application software (with the exception of general relativity, which hasn't yet been successfully ported to this particular OS). There's even a word for taking a physical theory and porting it to this OS: "to quantize."

This, I hope, is by now recognizable to individuals of interest as an overly abstract description of what happened with humans, who one day started building Moon rockets without seeming to care very much about calculating and maximizing their personal inclusive genetic fitness while doing that. Their capabilities generalized much further out of the ancestral training distribution, than the empirical alignment of those capabilities on inclusive genetic fitness in the ancestral training distribution.

I would like to live in a world where human capabilities had generalized much better than our alignment with evolution. I think it would look different to this one.

LESSWRONG
LW

LESSWRONG
LW

91

Shah and Yudkowsky on alignment failures

91

Ω 40

19. Follow-ups to the Ngo/Yudkowsky conversation

19.1. Quotes from the public discussion

19.2. Rohin Shah's summary and thoughts

20. November 6 conversation

20.1. Concrete plans, and AI-mediated transparency

20.2. Concrete disaster scenarios, out-of-distribution problems, and corrigibility

21. November 7 conversation

21.1. Corrigibility, value learning, and pessimism

22. Follow-ups

23. November 13 conversation

23.1. GPT-n and goal-oriented aspects of human reasoning

24. Follow-ups

91

Ω 40

91

Ω 40