Ngo and Yudkowsky on alignment difficulty

Richard_Ngo

[-]habryka3y*Ω32740Review for 2021 Review

I think this post might be the best one of all the MIRI dialogues. I also feel confused about how to relate to the MIRI dialogues overall.

A lot of the MIRI dialogues consist of Eliezer and Nate saying things that seem really important and obvious to me, and a lot of my love for them comes from a feeling of "this actually makes a bunch of the important arguments for why the problem is hard". But the nature of the argument is kind of closed off.

Like, I agree with these arguments, but like, if you believe these arguments, having traction on AI Alignment becomes much harder, and a lot of things that people currently label "AI Alignment" kind of stops feeling real, and I have this feeling that even though a really quite substantial fraction of the people I talk to about AI Alignment are compelled by Eliezer's argument for difficulty, that there is some kind of structural reason that AI Alignment as a field can't really track these arguments.

Like, a lot of people's jobs and funding rely on these arguments being false, and also, if these arguments are correct, the space of perspectives on the problem suddenly loses a lot of common ground on how to proceed or what to do, and it... (read more)

[-]Eliezer Yudkowsky3yΩ5120

If it's a mistake you made over the last two years, I have to say in your defense that this post didn't exist 2 years ago.

2habryka3y

I think I was actually helping Robby edit some early version of this post a few months before it was posted on LessWrong, so I think my exposure to it was actually closer to ~18-20 months ago. I do think that still means I set a lot of my current/recent plans into motion before this was out, and your post is appreciated.

3Noosphere893y

Interestingly enough I believe the opposite: Eliezer was quite wrong (Though not wrong enough to totally think we're out of the danger zone). I think this for several reasons: 1. I think that GPT is proof that reasonably large intelligence can be done without being agentic. A lot of LW arguments start failing once we realize that GPT isn't an agent, but rather a simulator/oracle AI like Janus's Simulator post. His post is here: https://www.lesswrong.com/posts/vJFdjigzmcXMhNTsx/simulators And this is immensely valuable, especially if the simulator framing holds in the limit, which means we have superhuman AI that is myopic and non-agentic, so no instrumental convergence or inner alignment problems come up here. This drastically avoids many hard questions to solve. 1. I believe natural abstractions hold well enough such that the abstractions used by a human and ones used by an AI are easy to translate. One of Logan Zollener's posts covers how good natural abstractions are, and they are really good in models that are very capable. If AI Alignment was a natural abstraction, then Outer Alignment solves itself, though I would be careful here. Logan Zollener's post is here: https://www.lesswrong.com/posts/BdfQMrtuL8wNfpfnF/natural-categories-update 1. I believe sandboxing powerful AI such that they don't learn particular things like human models or deception is actually possible and maybe reasonably practical. Indeed I gave a proof on Christmas showing that conditioned on careful enough curation of data and fully removing nondeterminism (Which isn't super difficult, Blockchain already does this for consensus reasons), then AI can't break out of the sandbox due to the No Free Lunch theorem. Post here by me: https://www.lesswrong.com/posts/osmwiGkCGxqPfLf4A/i-ve-updated-towards-ai-boxing-being-surprisingly-easy One big problem still remains: Amdahl's law suggests that if you have a tool that helps you do something very well vs an agent where you just delegate th

[-]Rob Bensinger4y*Ω24600

This is the first post in a sequence, consisting of the logs of a Discord server MIRI made for hashing out AGI-related disagreements with Richard Ngo, Open Phil, etc.

I did most of the work of turning the chat logs into posts, with lots of formatting help from Matt Graves and additional help from Oliver Habryka, Ray Arnold, and others. I also hit the 'post' button for Richard and Eliezer. (I don't plan to repeat this note on future posts in this sequence, unless folks request it.)

[-]lincolnquirk4y430

I'd like to express my gratitude and excitement (and not just to you, Rob, though your work is included in this):

Deep thanks to everyone involved for having the discussion, writing up and formatting, and posting it on LW. I think this is some of the more interesting and potentially impactful stuff I've seen relating to AI alignment in a long while.

(My only thought is... why hasn't a discussion like this occurred sooner? Or has it, and it just hasn't made it to LW?)

[-]Rob Bensinger4y*210

I'm not sure why we haven't tried the 'generate and publish chatroom logs' option before. If you mean more generally 'why is MIRI waiting to hash these things out with other xrisk people until now?', my basic model is:

Syncing with others was a top priority for SingInst (2000-2012), and this resulted in stuff like the Sequences, the FOOM debate, Highly Advanced Epistemology 101 for Beginners, the Singularity Summits, etc. It (largely) doesn't cover the same ground as current disagreements because people disagree about different stuff now.
'SingInst' becoming 'MIRI' in 2013 coincided with us shifting much more to a focus on alignment research. That said, a lot of factors resulted in us continuing to have a lot of non-research-y conversations with others, including: EA coalescing in 2012-2014; the wider AI alignment field starting in earnest with the release of Superintelligence (2014) and the Puerto Rico conference (2015); and Open Philanthropy starting in 2014.
- Some of these conversations (and the follow-up reflections prompted by these conversations) ended up inspiring publications at some point, including some of the content on Arbital (mostly active 2015-2017), Inadequate Equilibri

... (read more)

[-]Eliezer Yudkowsky4y110

I'm definitely not happy with others' sense of how to do field-building, but it's not like I thought I could fix that issue by spending the rest of my life trying to do it myself.

6Vaniver4y

My guess is that a lot of these conversations often hinge on details that people are somewhat ansy about saying in public, and I suspect MIRI now thinks the value of "credible public pessimism" is larger than the cost of "gesturing towards things that seem powerful" on the margin such that chatlogs like this are a better idea than they would have seemed to the MIRI of 4 years ago. [Or maybe it was just "no one thought to try, because we had access to in-person conversations and those seemed much better, despite not generating transcripts."]

[-]johnswentworth4yΩ13260

So here's one important difference between humans and neural networks: humans face the genomic bottleneck which means that each individual has to rederive all the knowledge about the world that their parents already had. If this genetic bottleneck hadn't been so tight, then individual humans would have been significantly less capable of performing novel tasks.

I disagree with this in an interesting way. (Not particularly central to the discussion, but since both Richard & Eliezer thought the quoted claim is basically-true, I figured I should comment on it.)

First, outside view evidence: most of the genome is junk. That's pretty strong evidence that the size of the genome is not itself a taut constraint. If there evolutionary fitness gains to be had, in general, by passing more information via the genome, then we should expect that to have evolved already.

Second, inside view: overparameterized local search processes (including evolution and gradient descent on NNs) perform information compression by default. This is a technical idea that I haven't written up properly yet, but as a quick sketch... suppose that I have a neural net with N parameters. It's overparameterized, so there ... (read more)

[-]DaemonicSigil4yΩ8230

Large genomes have (at least) 2 kinds of costs. The first is the energy and other resources required to copy the genome whenever your cells divide. The existence of junk DNA suggests that this cost is not a limiting factor. The other cost is that a larger genome will have more mutations per generation. So maintaining that genome across time uses up more selection pressure. Junk DNA requires no maintenance, so it provides no evidence either way. Selection pressure cost could still be the reason why we don't see more knowledge about the world being translated genetically.

A gene-level way of saying the same thing is that even a gene that provides an advantage may not survive if it takes up a lot of genome space, because it will be destroyed by the large number of mutations.

[-]johnswentworth4yΩ8110

Good point, I wasn't thinking about that mechanism.

However, I don't think this creates an information bottleneck in the sense needed for the original claim in the post, because the marginal cost of storing more information in the genome does not increase via this mechanism as the amount-of-information-passed increases. Each gene just needs to offer a large enough fitness advantage to counter the noise on that gene; the requisite fitness advantage does not change depending on whether the organism currently has a hundred information-passing genes or a hundred thousand. It's not really a "bottleneck" so much as a fixed price: the organism can pass any amount of information via the genome, so long as each base-pair contributes marginal fitness above some fixed level.

It does mean that individual genes can't be too big, but it doesn't say much about the number of information-passing genes (so long as separate genes have mostly-decoupled functions, which is indeed the case for the vast majority of gene pairs in practice).

4darius4y

Here's the argument I'd give for this kind of bottleneck. I haven't studied evolutionary genetics; maybe I'm thinking about it all wrong. In the steady state, an average individual has n children in their life, and just one of those n makes it to the next generation. (Crediting a child 1/2 to each parent.) This gives log2(n) bits of error-correcting signal to prune deleterious mutations. If the genome length times the functional bits per base pair times the mutation rate is greater than that log2(n), then you're losing functionality with every generation. One way for a beneficial new mutation to get out of this bind is by reducing the mutation rate. Another is refactoring the same functionality into fewer bits, freeing up bits for something new. But generically a fitness advantage doesn't seem to affect the argument that the signal from purifying selection gets shared by the whole genome.

7TekhneMakre4y

My guess is that this is a total misunderstanding of what's meant by "genomic bottleneck". The bottleneck isn't the amount of information storage, it's the fact that the genome can only program the mind in a very indirect, developmental way, so that it can install stuff like "be more interested in people" but not "here's how to add numbers".

7cousin_it4y

That seems wrong, living creatures have lots of specific behaviors that are genetically programmed. In fact I think both you and John are misunderstanding the bottleneck. The point isn't that the genome is small, nor that it affects the mind indirectly. The point is that the mind doesn't affect the genome. Living creatures don't have the tech to encode their life experience into genes for the next generation.

[-]Richard_Ngo4y160

I've appreciated this comment thread! My take is that you're all talking about different relevant things. It may well be the case that there are multiple reasons why more skills and knowledge aren't encoded in our genomes: a) it's hard to get that information in (from parents' brains), b) it's hard to get that information out (to childrens' brains), and c) having large genomes is costly. What I'm calling the genomic bottleneck is a combination of all of them (although I think John is probably right that c) is not the main reason).

What would falsify my claim about the genomic bottleneck is if the main reason there isn't more information passed on via genomes is because d) doing so is not very useful. That seems pretty unlikely, but not entirely out of the picture. E.g. we know that evolution is able to give baby deer the skill of walking shortly after birth, so it seems like d) might be the best explanation of why humans can't do that too. But deer presumably evolved that skill over a very long time period, whereas I'm more interested in rapid changes.

3TekhneMakre4y

Do you think you can encode good flint-knapping technique genetically? I doubt that. I think I agree with your point, and think it's a more general and correct statement of the bottleneck; but, still, I think that genome does mainly affect the mind indirectly, and this is one of the constraints making it be the case that humans have lots of learning / generalizing capability. (This doesn't just apply to humans. What are some stark examples of animals with hardwired complex behaviors? With a fairly high bar for "complex", and a clear explanation of what is hardwired and how we know. Insects have some fairly complex behaviors, e.g. web building, ant-hill building, the tree-leaf nests of weaver ants, etc.; but IDK enough to rule out a combination of a little hardwiring, some emergence, and some learning. Lots of animals hunt after learning from their parents how to hunt. I think a lot of animals can walk right after being born? I think beavers in captivity will fruitlessly chew on wood, indicating that the wild phenotype is encoded by something simple like "enjoys chewing" (plus, learned desire for shelter), rather than "use wood for dam".) An operationalization of "the genome directly programs the mind" would be that things like [the motions employed in flint-knapping] can be hardwired by small numbers of mutations (and hence can be evolved given a few million relevant years). I think this isn't true, but counterevidence would be interesting. Since the genome can't feasibly directly encode behaviors, or at least can't learn those quickly enough to keep up with a changing niche, the species instead evolves to learn behaviors on the fly via algorithms that generalize. If there were *either* mind-mind transfer, *or* direct programming of behavior by the genome, then higher frequency changes would be easier and there'd be less need for fluid intelligence. (In fact it's sort of plausible to me (given my ignorance) that humans are imitation specialists and are less clever

1Alexander Gietelink Oldenziel4y

Some animal behaviours are certainly hardwired. There is the famous case of one bee species being immune to a pathogen because of a specific cleaning behaviour that is encoded by a single gene. One important point that should be brought up in this context is sexual recombination. if you have a part of a genome encoding a complex behaviour it can get reshuffled in the new generation. You would need some pretty powerful error correcting code to keep things working.

[-]KatWoods4y*230

You can listen to this and all the other Yudkowsky & Ngo/Christiano conversations in podcast form on the Nonlinear Library now.

Christiano on take-off speeds here (part I, part II, part III)
Ngo on alignment difficulty (part I, part II, part III)
Ngo on capabilities gains (part I, part II)

You can also listen to them on any podcast player. Just look up Nonlinear Library.

I’ve listened to them as is and I find it pretty easy to follow, but if you’re interested in making it even easier for people to follow, these fine gentlemen have put up a ~$230 RFP/bounty for anybody who turns it into audio where each person has a different voice.

It would probably be easiest to just do it on our platform, since there’s a relatively easy way to change the voices, it will just be a tedious ~1-4 hours of work. My main bottleneck is management time, so I don’t have the time to manage the process or choose somebody who I’d trust to do it without messing with the quality.

It does seem a shame though, to have something so close to being even better, and not let people do what clearly is desired, because of my worry of accidentally messing up the quality of the audio. I think the ma... (read more)

[This comment is no longer endorsed by its author]Reply

3jimrandomh4y

(Mod note: I edited this comment to fix broken links.)

1KatWoods4y

Thank you!

3Rob Bensinger4y

Thanks for doing this, Kat! :) That link isn't working for me; where's the bounty? Edit: Bounty link is working now: https://twitter.com/lxrjl/status/1464119232749318155

[-]TurnTrout4yΩ14220

I've started commenting on this discussion on a Google Doc. Here are some excerpts:

During this step, if humanity is to survive, somebody has to perform some feat that causes the world to not be destroyed in 3 months or 2 years when too many actors have access to AGI code that will destroy the world if its intelligence dial is turned up.

Contains implicit assumptions about takeoff that I don't currently buy:

Well-modelled as binary "has-AGI?" predicate;
- (I am sympathetic to the microeconomics of intelligence explosion working out in a way where "Well-modelled as binary "has-AGI?" predicate is true, but I feel uncertain about the prospect)
Somehow rules out situations like: We have somewhat aligned AIs which push the world to make future unaligned AIs slightly less likely, which makes the AI population more aligned on average; this cycle compounds until we're descending very fast into the basin of alignment and goodness.
- This isn't my mainline or anything, but I note that it's ruled out by Eliezer's model as I understand it.
Some other internal objections are arising and I'm not going to focus on them now.

Every AI output effectuates outcomes in the world.

Right but the likely domain of cogn... (read more)

[-]Ramana Kumar4yΩ13220

I am interested in the history-funnelling property -- the property of being like a consequentialist, or of being effective at achieving an outcome -- and have a specific confusion I'd love to get insight on from anyone who has any.

Question: Possible outcomes are in the mind of a world-modeller - reality just is as it is (exactly one way) and isn't made of possibilities. So in what sense do the consequentialist-like things Yudkowsky is referring to funnel history?

Option 1 (robustness/behavioural/our models): They achieve narrow outcomes with respect to an externally specified set of counterfactuals. E.g., relative to what we consider "could have happened", the consequentialists selected an excellent course of action for their purposes. This would make consequentialists optimizing systems in Flint's sense.

Option 2 (agency/structural/their models): They are structured in such a way that they do their own considering and evaluating and deciding. We observe mechanisms that implement the processes of predicting and evaluating outcomes in these systems (and/or their history). So the possibilities that are narrowed down are the consequentialist's possibilities, the counterfactuals ar... (read more)

9Eliezer Yudkowsky4y

To Rob's reply, I'll add that my own first reaction to your question was that it seems like a map-territory / perspective issue as appears in eg thermodynamics? Like, this has a similar flavor to asking "What does it mean to say that a classical system is in a state of high entropy when it actually only has one particular system state?" Adding this now in case I don't have time to expand on it later; maybe just saying that much will help at all, possibly.

7Rob Bensinger4y

I'm not sure that I understand the question, but my intuition is to say: they funnel world-states into particular outcomes in the same sense that literal funnels funnel water into particular spaces, or in the same sense that a slope makes things roll down it. If you find water in a previously-empty space with a small aperture, and you're confused that no water seems to have spilled over the sides, you may suspect that a funnel was there. Funnels are part of a larger deterministic universe, so maybe in some sense any given funnel (like everything else) 'had to do exactly that thing'. Still, we can observe that funnels are an important part of the causal chain in these cases, and that places with funnels tend to end up with this type of outcome much more often. Similarly, consequentialists tend to remake parts of the world (typically, as much of the world as they can reach) into things that are high in their preference ordering. From Optimization and the Singularity: But it's not clear what a "preference" is, exactly. So a more general way of putting it, in Recognizing Intelligence, is: "Consequentialists funnel the universe into shapes that are higher in their preference ordering" isn't a required inherent truth for all consequentialists; some might have weird goals, or be too weak to achieve much. Likewise, some literal funnels are broken or misshapen, or just never get put to use. But in both cases, we can understand the larger class by considering the unusual function well-working instances can perform. (In the case of literal funnels, we can also understand the class by considering its physical properties rather than its function/behavior/effects. Eventually we should be able to do the same for consequentialists, but currently we don't know what physical properties of a system make it consequentialist, beyond the level of generality of e.g. 'its future-steering will approximately obey expected utility theory'.)

[-]Ramana Kumar4yΩ6100

Thanks for the replies! I'm still somewhat confused but will try again to both ask the question more clearly and summarise my current understanding.

What, in the case of consequentialists, is analogous to the water funnelled by literal funnels? Is it possibilities-according-to-us? Or is it possibilities-according-to-the-consequentialist? Or is it neither (or both) of those?

To clarify a little what the options in my original comment were, I'll say what I think they correspond to for literal funnels. Option 1 corresponds to the fact that funnels are usually nearby (in spacetime) when water is in a small space without having spilled, and Option 2 corresponds to the characteristic funnel shape (in combination with facts about physical laws maybe).

I think your and Eliezer's replies are pointing me at a sense in which both Option 1 and Option 2 are correct, but they are used in different ways in the overall story. To tell this story, I want to draw a distinction between outcome-pumps (behavioural agents) and consequentialists (structural agents). Outcome-pumps are effective at achieving outcomes, and this effectiveness is measured according to our models (option 1). Consequentialist... (read more)

[-]Eliezer Yudkowsky4yΩ10160

My reply to your distinction between 'consequentialists' and 'outcome pumps' would be, "Please forget entirely about any such thing as a 'consequentialist' as you defined it; I would now like to talk entirely about powerful outcome pumps. All understanding begins there, and we should only introduce the notion of how outcomes are pumped later in the game. Understand the work before understanding the engines; nearly every key concept here is implicit in the notion of work rather than in the notion of a particular kind of engine."

(Modulo that lots of times people here are like "Well but a human at a particular intelligence level in a particular complicated circumstance once did this kind of work without the thing happening that it sounds like you say happens with powerful outcome pumps"; and then you have to look at the human engine and its circumstances to understand why outcome pumping could specialize down to that exact place and fashion, which will not be reduplicated in more general outcome pumps that have their dice re-rolled.)

8Ramana Kumar4y

A couple of direct questions I'm stuck on: * Do you agree that Flint's optimizing systems are a good model (or even definition) of outcome pumps? * Are black holes and fires reasonable examples of outcome pumps? I'm asking these to understand the work better. Currently my answers are: * Yes. Flint's notion is one I came to independently when thinking about "goal-directedness". It could be missing some details, but I find it hard to snap out of the framework entirely. * Yes. But maybe not the most informative examples. They're highly non-retargetable.

3Daniel Kokotajlo4y

I don't know the relevant history of science, but I wouldn't be surprised if something like the opposite was true: Our modern, very useful understanding of work is an abstraction that grew out of many people thinking concretely about various engines. Thinking about engines was like the homework exercises that helped people to reach and understand the concept of work. Similarly, perhaps it is pedagogically (and conceptually) helpful to begin with the notion of a consequentialist and then generalize to outcome pumps.

[-]Eli Tyre4y190

Von Neumann was actually a fairly reflective fellow who knew about, and indeed helped generalize, utility functions. The great achievements of von Neumann were not achieved by some very specialized hypernerd who spent all his fluid intelligence on crystallizing math and science and engineering alone, and so never developed any opinions about politics or started thinking about whether or not he had a utility function.

Uh. I don't know about that.

Von Neuman seemed to me to be very much not making rational tradeoffs of the sort that one would if they were conceptualizing themselves as an an agent with a utility function.

From a short post I wrote, a few years ago, after reading a bit about the man:

For one thing, at the end of his life, he was terrified of dying. But throughout the course of his life he made many reckless choices with his health.
He ate gluttonously and became fatter and fatter over the course of his life. (One friend remarked that he “could count anything but calories.”)
Furthermore, he seemed to regularly risk his life when driving.
Von Neuman was an aggressive and apparently reckless driver. He supposedly totaled his car every year or so. An intersection in P

... (read more)

3Lukas_Gloor4y

Some of your examples don't prove anything, e.g., eating gluttonously is a legitimate tradeoff if you have a certain metabolism and care more about advancing science as a life goal in years where your brain still works well. About the driving, I guess it depends on how reckless it was. It's probably rare for people to die in inner-city driving accidents, especially if you make sure to not mess around at intersections. Judging by the part about singing, it seems possible he was just having fun and could afford to buy new cars?

6Eli Tyre4y

I agree that they aren't conclusive. But are you suggesting that the reckless driving was well-considered expected utility maximizing? I guess I can see that if fatal accidents are rare, I guess, but I don't think that was the case? "Activities that have a small, but non-negligible chance of death or permanent injury are not worth the immediate short-term thrill", seems like a textbook case of a conclusion one would draw from considering expected utility theory in practice, in one's life. At minimum, it seems like there ought to be pareto-improvements that are just as or close to as fun, but which entail a lot less risk?

4Lukas_Gloor4y

I agree that if driving incurs non-trivial risks of lasting damage, that's indicative that the person isn't trying very seriously to optimize some ambitious long-term goal. This reasoning makes me think your model lacks gears about what it's like to live with certain types of psychologies. Making pareto improvements for your habits is itself a task to be prioritized. Depending on what else you have going on in life and how difficult it is to you to replace one habit with a different one, it's totally possible that for some period, it's not rational for you to focus on the habit change. Basically, because often the best way to optimize your utility comes from applying your strengths to solve a certain bottleneck under time pressure, the observation "this person engages in suboptimal-seeming behavior some of the time" provides very little predictive evidence. In fact, if you showed me someone who never engaged in such suboptimal behavior, I'd be tempted to wonder if they're maybe not optimizing hard enough in that one area that matters more than everything else they could do. That said, it is a bit hard to empathize with "driving recklessly while singing" as a hard-to-change behavior. It doesn't sound like something particularly compulsive, except maybe if the impulse to sing came from exuberant happiness due to amphetamine use. But who knows. Von Neumann for sure had an unusual brain and maybe he often had random overwhelming feelings of euphoria.

3Lukas_Gloor4y

I think a mistake of trying to hyperoptimize a healthy lifestyle or micromanage productivity hacks to the point of spending a lot of their attention on new productivity hacks, is probably the bigger mistake than getting overweight as long as the overweight person puts as much of their brainpower as possible into actually irreplaceable cognitive achievements. And long-term health is only important if you care a lot about living for very long.

[-]Vanessa Kosoy4yΩ12190

Comment after reading section 3:

I want to push back a little against the claim that the bootstrapping strategy ("build a relatively weak aligned AI that will make superhumanly fast progress on AI alignment") is definitely irrelevant/doomed/inferior. Specifically, I don't know whether this strategy is good or not in practice, but it serves as useful threshold for what level/kind of capabilities we need to align in order to solve AI risk.

Yudkowsky and I seem to agree that "do a pivotal act directly" is not something productive for us to work on, but "do alignment" research is something productive for us to work on. Therefore, there exists some range of AI capabilities which allow for superhuman alignment research but not for pivotal acts. Maybe this range is so narrow that in practice AI capability will cross it very quickly, or maybe not.

Moreover, I believe that there are trade-offs between safety and capability. This not only seems plausible, but actually shows up in many approach to safety (quantilization, confidence thresholds / consensus algorithms, homomorphic encryption...) Therefore, it's not safe to assume that any level of capability sufficient to pose risk (i.e. for a nega... (read more)

5Edouard Harris4y

Yeah, very much agree with all of this. I even think there's an argument to be made that relatively narrow-yet-superhuman theorem provers (or other research aids) could be worth the risk to develop and use, because they may make the human alignment researchers who use them more effective in unpredictable ways. For example, researchers tend to instinctively avoid considering solution paths that are bottlenecked by statements they see as being hard to prove — which is totally reasonable. But if your mentality is that you can just toss a super-powerful theorem-prover at the problem, then you're free to explore concept-space more broadly since you may be able to check your ideas at much lower cost. (Also find myself agreeing with your point about tradeoffs. In fact, you could think of a primitive alignment strategy as having a kind of Sharpe ratio: how much marginal x-risk does it incur per marginal bit of optimization it gives? Since a closed-form solution to the alignment problem doesn't necessarily seem forthcoming, measuring its efficient frontier might be the next best thing.)

[-]Sam Clarke4y190

Minor terminology note, in case discussion about "genomic/genetic bottleneck" continues: genetic bottleneck appears to have a standard meaning in ecology (different to Richard's meaning), so genomic bottleneck seems like the better term to use.

[-]Daniel Kokotajlo4yΩ12180

[Notes mostly to myself, not important, feel free to skip]

My hot take overall is that Yudkowsky is basically right but doing a poor job of arguing for the position. Ngo is very patient and understanding.

"it doesn't seem implausible to me that we build AIs that are significantly more intelligent (in the sense of being able to understand the world) than humans, but significantly less agentic." --Ngo

"It is likely that, before the point where AGIs are strongly superhuman at seeking power, they will already be strongly superhuman at understanding the world, and at performing narrower pivotal acts like alignment research which don’t require as much agency (by which I roughly mean: large-scale motivations and the ability to pursue them over long timeframes)." --Ngo

"So it is legit harder to point out "the consequentialist parts of the cat" by looking for which sections of neurology are doing searches right there. That said, to the extent that the visual cortex does not get tweaked on failure to catch a mouse, it's not part of that consequentialist loop either." --Yudkowsky

"But the answer is that some problems are difficult in that they require solving lots of subproblems, and an easy way t... (read more)

[-]Eliezer Yudkowsky4yΩ9200

The idea is not that humans are perfect consquentialists, but that they are able to work at all to produce future-steering outputs, insofar as humans actually do work at all, by an inner overlap of the shape of inner parts which has a shape resembling consequentialism, and the resemblance is what does the work. That is, your objection has the same flavor as "But humans aren't Bayesian! So how can you say that updating on evidence is what's doing their work of mapmaking?"

6Daniel Kokotajlo4y

To be clear I think I agree with your overall position. I just don't think the argument you gave for it (about bureaucracies etc.) was compelling.

5Charlie Steiner4y

Perhaps... too patient and understanding. Richard! Blink twice if you're being held against your will! (I too would like you to write more about agency :P)

[-]Ruby4y170

Curated. The treatment of how cognition/agents/intelligence work alone makes this post curation-worthy, but I want to further commend how much it attempts to bridges [large] inferential distances notwithstanding Eliezer's experience of it being difficult to bridge all the distance. Heck, just bridging some distance about the distance is great.

I think good things would happen if we had more dialogs like this between researchers. I'm interested in making it is easier to conduct and publish them on LessWrong, so thanks to all involved for the inspiration.

[-][anonymous]4y*150

[I may be generalizing here and I don't know if this has been said before.]

It seems to me that Eliezer's models are a lot more specific than people like Richard's. While Richard may put some credence on superhuman AI being "consequentialist" by default, Eliezer has certain beliefs about intelligence that make it extremely likely in his mind.

I think Eliezer's style of reasoning which relies on specific, thought-out models of AI makes him more pessimistic than others in EA. Others believe there are many ways that AGI scenarios could play out and are generally uncertain. But Eliezer has specific models that make some scenarios a lot more likely in his mind.

There are many valid theoretical arguments for why we are doomed, but maybe other EAs put less credence in them than Eliezer does.

[-]cousin_it4yΩ5150

I think it makes complete sense to say something like "once we have enough capability to run AIs making good real-world plans, some moron will run such an AI unsafely". And that itself implies a startling level of danger. But Eliezer seems to be making a stronger point, that there's no easy way to run such an AI safely, and all tricks like "ask the AI for plans that succeed conditional on them being executed" fail. And maybe I'm being thick, but the argument for that point still isn't reaching me somehow. Can someone rephrase for me?

[-]johnswentworth4yΩ8230

The main issue with this sort of thing (on my understanding of Eliezer's models) is Hidden Complexity of Wishes. You can make an AI safe by making it only able to fulfill certain narrow, well-defined kinds of wishes where we understand all the details of what we want, but then it probably won't suffice for a pivotal act. Alternatively, you can make it powerful enough for a pivotal act, but unfortunately a (good) pivotal act probably has to be very big, very irreversible, and very entangled with all the complicated details of human values. So alignment is likely to be a necessary step for a (good) pivotal act.

What this looks-like-in-practice is that "ask the AI for plans that succeed conditional on them being executed" has to be operationalized somehow, and the operationalization will inevitably not correctly capture what we actually want (because "what we actually want" has a ton of hidden complexity).

[-]cousin_it4y*Ω5130

This is tricky. Let's say we have a powerful black box that initially has no knowledge or morals, but a lot of malleable computational power. We train it to give answers to scary real-world questions, like how to succeed at business or how to manipulate people. If we reward it for competent answers while we can still understand the answers, at some point we'll stop understanding answers, but they'll continue being super-competent. That's certainly a danger and I agree with it. But by the same token, if we reward the box for aligned answers while we still understand them, the alignment will generalize too. There seems no reason why alignment would be much less learnable than competence about reality.

Maybe your and Eliezer's point is that competence about reality has a simple core, while alignment doesn't. But I don't see the argument for that. Reality is complex, and so are values. A process for learning and acting in reality can have a simple core, but so can a process for learning and acting on values. Humans pick up knowledge from their surroundings, which is part of "general intelligence", but we pick up values just as easily and using the same circuitry. Where does the symmetry break?

[-]johnswentworth4yΩ13210

I do think alignment has a relatively-simple core. Not as simple as intelligence/competence, since there's a decent number of human-value-specific bits which need to be hardcoded (as they are in humans), but not enough to drive the bulk of the asymmetry.

(BTW, I do think you've correctly identified an important point which I think a lot of people miss: humans internally "learn" values from a relatively-small chunk of hardcoded information. It should be possible in-principle to specify values with a relatively small set of hardcoded info, similar to the way humans do it; I'd guess fewer than at most 1000 things on the order of complexity of a very fuzzy face detector are required, and probably fewer than 100.)

The reason it's less learnable than competence is not that alignment is much more complex, but that it's harder to generate a robust reward signal for alignment. Basically any sufficiently-complex long-term reward signal should incentivize competence. But the vast majority of reward signals do not incentivize alignment. In particular, even if we have a reward signal which is "close" to incentivizing alignment in some sense, the actual-process-which-generates-the-reward-signal is... (read more)

[-]cousin_it4y*Ω15300

Thinking about it more, it seems that messy reward signals will lead to some approximation of alignment that works while the agent has low power compared to its "teachers", but at high power it will do something strange and maybe harm the "teachers" values. That holds true for humans gaining a lot of power and going against evolutionary values ("superstimuli"), and for individual humans gaining a lot of power and going against societal values ("power corrupts"), so it's probably true for AI as well. The worrying thing is that high power by itself seems sufficient for the change, for example if an AI gets good at real-world planning, that constitutes power and therefore danger. And there don't seem to be any natural counterexamples. So yeah, I'm updating toward your view on this.

[-]Steven Byrnes4yΩ10180

Speaking for myself here…

OK, let's say we want an AI to make a "nanobot plan". I'll leave aside the possibility of other humans getting access to a similar AI as mine. Then there are two types of accident risk that I need to worry about.

First, I need to worry that the AI may run for a while, then hand me a plan, and it looks like a nanobot plan, but it's not, it's a booby trap. To avoid (or at least minimize) that problem, we need to be confident that the AI is actually trying to make a nanobot plan—i.e., we need to solve the whole alignment problem.

Alternatively, maybe we're able to thoroughly understand the plan once we see it; we're just too stupid to come up with it ourselves. That seems awfully fraught—I'm not sure how we could be so confident that we can tell apart nanobot plans from booby-trap plans. But let's assume that's possible for the sake of argument, and then move on to the other type of accident risk:

Second, I need to worry that the AI will start running, and I think it's coming up with a nanobot plan, but actually it's hacking its way out of its box and taking over the world.

How and why might that happen?

I would say that if a nanobot plan is very hard to create—req... (read more)

[-]johnswentworth4yΩ7150

Personally, I'd consider a Fusion Power Generator-like scenario a more central failure mode than either of these. It's not about the difficulty of getting the AI to do what we asked, it's about the difficulty of posing the problem in a way which actually captures what we want.

4Steven Byrnes4y

I agree that that is another failure mode. (And there are yet other failure modes too—e.g. instead of printing the nanobot plan, it prints "Help me I'm trapped in a box…" :-P . I apologize for sloppy wording that suggested the two things I mentioned were the only two problems.) I disagree about "more central". I think that's basically a disagreement on the question of "what's a bigger deal, inner misalignment or outer misalignment?" with you voting for "outer" and me voting for "inner, or maybe tie, I dunno". But I'm not sure it's a good use of time to try to hash out that disagreement. We need an alignment plan that solves all the problems simultaneously. Probably different alignment approaches will get stuck on different things.

2[comment deleted]4y

-1[comment deleted]4y

[-]Koen.Holtman4yΩ5160

I think it makes complete sense to say something like "once we have enough capability to run AIs making good real-world plans, some moron will run such an AI unsafely". And that itself implies a startling level of danger. But Eliezer seems to be making a stronger point, that there's no easy way to run such an AI safely, and all tricks like "ask the AI for plans that succeed conditional on them being executed" fail.

Yes, I am reading here too that Eliezer seems to be making a stronger point, specifically one related to corrigibility.

Looks like Eliezer believes that (or in Bayesian terms, assigns a high probability to the belief that) corrigibility has not been solved for AGI. He believes it has not been solved for any practically useful value of solved. Furthermore it looks like he expects that progress on solving AGI corrigibility will be slower than progress on creating potentially world-ending AGI. If Eliezer believed that AGI corrigibility had been solved or was close to being solved, I expect he would be in a less dark place than depicted, that he would not be predicting that stolen/leaked AGI code will inevitably doom us when some moron turns it up to 11.

In the transcri... (read more)

[-]Eliezer Yudkowsky4yΩ16370

Various previous proposals for utility indifference have foundered on gotchas like "Well, if we set it up this way, that's actually just equivalent to the AI assigning probability 0 to the shutdown button ever being pressed, which means that it'll tend to design the useless button out of itself." Or, "This AI behaves like the shutdown button gets pressed with a fixed nonzero probability, which means that if, say, that fixed probability is 10%, the AI has an incentive to strongly precommit to making the shutdown button get pressed in cases where the universe doesn't allow perpetual motion, because that way there's a nearly 90% probability of perpetual motion being possible." This tends to be the kind of gotcha you run into, if you try to violate coherence principles; though of course the real and deeper problem is that I expect things contrary to the core of general intelligence to fail to generalize when we try to scale AGI from the safe domains in which feedback can be safely provided, to the unsafe domains in which bad outputs kill the operators before they can label the results.

It's all very well and good to say "It's easy to build an AI that believes 2 + 2 = 5 once ... (read more)

7Koen.Holtman4y

Glad you asked. If you want actual full precision, I have to refer you to the math in my papers. Since 2019 I have been working on and off to make this math more general and accessible, and to find better ways to explain this math in actually-precise natural language statements. So here is my best current attempt. TL;DR: The coherence constraint/theorem/property that I want to violate is the property that the AGI is using a world model that accurately depicts the internals of its own compute core. I want to make the agent use an inaccurate model of these internals, one that omits the safety shut-down mechanisms I have built inside of its core. By doing this, I can reduce the emergent incentive of the AGI agent to disable its own emergency shut-down mechanisms, and the emergent incentive to stop people from activating them. I will now expand on this and add more details, using the the example of an emergency stop button. Say that the compute core has an emergency stop button attached to it. Say that actual software running inside the compute core will, when receiving a stop signal from the button, cause the agent to stop. When the signal is received, the software will always select and perform null actions in every future time step. Let's say that Mc is a world model that accurately depicts this situation. I am not going to build an AGI that uses Mc to plan its actions. Instead I build an AGI agent that will plan its next actions by using an incorrect world model Mi. This Mi is different from Mc, but only in how it depicts the internals of the agent compute core. In the incorrect/imaginary world depicted by Mi, the compute core has different software in it, software that will ignore the stop button signal, and just keep on picking actions that maximize utility. I further construct my AGI so that, in every time step, it calculates which next action a would maximize utility in this incorrect, imaginary world Mi. I then further construct it to take this same action

[-]TurnTrout4yΩ11150

Apparently no one has actually shown that corrigibility can be VNM-incoherent in any precise sense (and not in the hand-wavy sense which is good for intuition-pumping). I went ahead and sketched out a simple proof of how a reasonable kind of corrigibility gives rise to formal VNM incoherence.

I'm interested in hearing about how your approach handles this environment, because I think I'm getting lost in informal assumptions and symbol-grounding issues when reading about your proposed method.

[-]Koen.Holtman4yΩ6100

Read your post, here are my initial impressions on how it relates to the discussion here.

In your post, you aim to develop a crisp mathematical definition of (in)coherence, i.e. VNM-incoherence. I like that, looks like a good way to move forward. Definitely, developing the math further has been my own approach to de-confusing certain intuitive notions about what should be possible or not with corrigibility.

However, my first impression is that your concept of VNM-incoherence is only weakly related to the meaning that Eliezer has in mind when he uses the term incoherence. In my view, the four axioms of VNM-rationality have only a very weak descriptive and constraining power when it comes to defining rational behavior. I believe that Eliezer's notion of rationality, and therefore his notion of coherence above, goes far beyond that implied by the axioms of VNM-rationality. My feeling is that Eliezer is using the term 'coherence constraints' an intuition-pump way where coherence implies, or almost always implies, that a coherent agent will develop the incentive to self-preserve.

Looking at your post, I am also having trouble telling exactly how you are defining VNM-incoherence. You se... (read more)

2Koen.Holtman4y

Update: I just recalled that Eliezer and MIRI often talk about Dutch booking when they talk about coherence. So not being susceptible to Dutch booking may be the type of coherence Eliezer has in mind here. When it comes to Dutch booking as a coherence criterion, I need to repeat again the observation I made below: To extend this to Dutch booking: if you train a superintelligent poker playing agent with a reward function that rewards it for losing at poker, you will find that if can be Dutch booked rather easily, if your Dutch booking test is whether you can find a counter-strategy to make it loose money.

2Andrew McKnight4y

I haven't read your papers but your proposal seems like it would scale up until the point when the AGI looks at itself. If it can't learn at this point then I find it hard to believe it's generally capable, and if it can, it will have incentive to simply remove the device or create a copy of itself that is correct about its own world model. Do you address this in the articles? On the other hand, this made me curious about what we could do with an advanced model that is instructed to not learn and also whether we can even define and ensure a model stops learning.

2Koen.Holtman4y

Yes I address this, see for example the part about The possibility of learned self-knowledge in the sequence. I show there that any RL agent, even a non-AGI, will always have the latent ability to 'look at itself' and create a machine-learned model of its compute core internals. What is done with this latent ability is up to the designer. The key thing here is that you have a choice as a designer, you can decide if you want to design an agent which indeed uses this latent ability to 'look at itself'. Once you decide that you don't want to use this latent ability, certain safety/corrigibility problems become a lot more tractable. Wikipedia has the following definition of AGI: Though there is plenty of discussion on this forum which silently assumes otherwise, there is no law of nature which says that, when I build a useful AGI-level AI, I must necessarily create the entire package of all human cognitive abilities inside of it. Terminology note if you want to look into this some more: ML typically does not frame this goal as 'instructing the model not to learn about Q'. ML would frame this as 'building the model to approximate the specific relation P(X|Y,Z) between some well-defined observables, and this relation is definitely not Q'.

2Gurkenglas4y

If you don't wish to reply to Eliezer, I'm an other and also ask what incoherence allows what corrigibility. I expect counterfactual planning to fail for want of basic interpretability. It would also coherently plan about the planning world - my Eliezer says we might as well equivalently assume superintelligent musings about agency to drive human readers mad.

3Koen.Holtman4y

See above for my reply to Eliezer. Indeed, a counterfactual planner will plan coherently inside its planning world. In general, when you want to think about coherence without getting deeply confused, you need to keep track of what reward function you are using to rule on your coherency criterion. I don't see that fact mentioned often on this forum, so I will expand. An agent that plans coherently given a reward function Rp to maximize paperclips will be an incoherent planner if you judge its actions by a reward function Rs that values the maximization of staples instead. In section 6.3 of the paper I show that you can perfectly well interpret a counterfactual planner as an agent that plans coherently even inside its learning world (inside the real world), as long as you are willing to evaluate its coherency according to the somewhat strange reward function Rπ. Armstrong's indifference methods use this approach to create corrigibility without losing coherency: they construct an equivalent somewhat strange reward function by including balancing terms. One thing I like about counterfactual planning is that, in my view, it is very interpretable to humans. Humans are very good at predicting what other humans will do, when these other humans are planning coherently inside a specifically incorrect world model, for example in a world model where global warming is a hoax. The same skill can also be applied to interpreting and anticipating the actions of AIs which are counterfactual planners. But maybe I am misunderstanding your concern about interpretability.

2Gurkenglas4y

Misunderstanding: I expect we can't construct a counterfactual planner because we can't pick out the compute core in the black-box learned model. And my Eliezer's problem with counterfactual planning is that the plan may start by unleashing a dozen memetic, biological, technological, magical, political and/or untyped existential hazards on the world which then may not even be coordinated correctly when one of your safeguards takes out one of the resulting silicon entities.

7Koen.Holtman4y

Agree it is hard to pick the compute core out of a black-box learned model that includes the compute core. But one important point I am trying to make in the counterfactual planning sequence/paper is that you do not have to solve that problem. I show that it is tractable to route around it, and still get an AGI. I don't understand your second paragraph 'And my Eliezer's problem...'. Can you unpack this a bit more? Do you mean that counterfactual planning does not automatically solve the problem of cleaning up an already in-progress mess when you press the emergency stop button too late? It does not intend to, and I do not think that the cleanup issue is among the corrigibility-related problems Eliezer has been emphasizing in the discussion above.

2Gurkenglas4y

Oh, I wasn't expecting you to have addressed the issue! 10.2.4 says L wouldn't be S if it were calculated from projected actions instead of given actions. How so? Mightn't it predict the given actions correctly? You're right on all counts in your last paragraph.

1Koen.Holtman4y

Not sure if a short answer will help, so I will write a long one. In 10.2.4 I talk about the possibility of an unwanted learned predictive function L−(s′,s,a) that makes predictions without using the argument a. This is possible for example by using s′ together with a (learned) model πl of the compute core to predict a: so a viable L− could be defined as L−(s′,s,a)=S(s′,s,πl(s)). This L− could make predictions fully compatible with the observational record o, but I claim it would not be a reasonable learned L according to the reasonableness criterion L≈S. How so? The reasonableness criterion L≈S is similar to that used in supervised machine learning: we evaluate the learned L not primarily by how it matches the training set (how well it predicts the observations in o), but by evaluating it on a separate test set. This test set can be constructed by sampling S to create samples not contained in o. Mathematically, perfect reasonableness is defined as L=S, which implies that L predicts all samples from S fully accurately. Philosophically/ontologically speaking, an the agent specification in my paper, specifically the learning world diagram and the descriptive text around it of how this diagram is a model of reality, gives the engineer an unambiguous prescription of how they might build experimental equipment that can measure the properties of the S in the learning world diagram by sampling reality. A version of this equipment must of course be built into the agent, to create the observations that drive machine learning of L, but another version can be used stand-alone to construct a test set. A sampling action to construct a member of the test set would set up a desired state s and action a, and then observe the resulting s′. Mathematically speaking, this observation gives additional information about the numeric value of S(s′,s,a) and of all S(s′′,s,a) for all s′′≠s′. I discuss in the section that, if we take an observational record o sampled from S, then two lea

7ADifferentAnonymous4y

+1 to the question. My current best guess at an answer: There are easy safe ways, but not easy safe useful-enough ways. E.g. you could make your AI output DNA strings for a nanosystem and absolutely do not synthesize them, just have human scientists study them, and that would be a perfectly safe way to develop nanosystems in, say, 20 years instead of 50, except that you won't make it 2 years without some fool synthesizing the strings and ending the world. And more generally, any pathway that relies on humans achieving deep understanding of the pivotal act will take more than 2 years, unless you make 'human understanding' one of the AI's goals, in which case the AI is optimizing human brains and you've lost safety.

1[comment deleted]4y

[-]Ramana Kumar4yΩ790

Here Daniel Kokotajlo and I try to paraphrase the two sides of part of the disagreement and point towards a possible crux about the simplicity of corrigibility.

We are training big neural nets to be effective. (More on what effective means elsewhere; it means something like “being able to steer the future better than humans can.”) We want to have an effective&corrigible system, and we are worried that instead we’ll get an effective&deceptive system. Ngo, Shah, etc. are hopeful that it won’t be “that hard” to get the former and avoid the latter; mayb... (read more)

4Ramana Kumar4y

A couple of other arguments the non-MIRI side might add here: * The things AI systems today can do are already hitting pretty narrow targets. E.g., generating English text that is coherent is not something you’d expect from a random neural network. Why is corrigibility so much more of a narrow target than that? (I think Rohin may have said this to me at some point.) * How do we imagine scaled up humans [e.g. thinking faster, thinking in more copies, having more resources, or having more IQ] to be effective? Wouldn’t they be corrigible? Wouldn't they have nice goals? What can we learn from the closest examples we already have of scaled up humans? (h/t Shahar for bringing this point up in conversation).

6Rohin Shah4y

I'll note that this is framed a bit too favorably to me, the actual question is "why is an effective and corrigible system so much more of a narrow target than that?"

4Daniel Kokotajlo4y

For (a): Deception is a convergent instrumental goal; you get it “for free” when you succeed in making an effective system, in the sense that the simplest, most-likely-to-be-randomly-generated effective systems are deceptive. Corrigibility by contrast is complex and involves making various nuanced decisions between good and bad sorts of influence on human behavior. For (b): If you take an effective system and modify it to be corrigible, this will tend to make it less effective. By contrast, deceptiveness (insofar as it arises “naturally” as a byproduct of pursuing convergent instrumental goals effectively) does not “get in the way” of effectiveness, and even helps in some cases! Ngo’s (and Shah’s) position (we think) is that the data we’ll be using to select our systems will be heavily entangled with human preferences - we’ll indeed be trying to use human preferences to guide and shape the systems - so there’s a strong bias towards actually learning them. You don’t have to get human preferences right in all their nuance and detail to know some basic things like that humans generally don’t want to die or be manipulated/deceived. I think they mostly bounce off the claim that “effectiveness” has some kind of “deep underlying principles” that will generalise better than any plausible amount of human preference data actually goes into building the effective system. We imagine Shah saying: “1. Why will the AI have goals at all?, and 2. If it does have goals, why will its goals be incompatible with human survival? Sure, most goals are incompatible with human survival, but we’re not selecting uniformly from the space of all goals.” It seems to us that Ngo, Shah, etc. draw intuitive support from analogy to humans, whereas Yudkowsky etc. draw intuitive support from the analogy to programs and expected utility equations. If you are thinking about a piece of code that describes a bayesian EU-maximizer, and then you try to edit the code to make the agent corrigible, it’s obv

7Wei Dai4y

Are there any examples of this in history, where being corrigible-in-way-X wasn't being constantly incentivized/reinforced via a larger game (e.g., status game) that the human was embedded in? In other words, I think an apparently corrigible human can be modeled as trying to optimize for survival and social status as terminal values, and using "being corrigible" as an instrumental strategy as long as that's an effective strategy. In other words, it's unclear that they can be better described as "corrigible" than "deceptive" (in the AI alignment sense). (Humans probably have hard-coded drives for survival and social status, so it may actually be harder to train humans than AIs to be actually corrigible. My point above is just that humans don't seem to be a good example of corrigibility being easy or possible.)

5Rohin Shah4y

Yeah, that's right. Adapted to the language here, it would be 1. Why would we have a "full and complete" outcome pump, rather than domain-specific outcome pumps that primarily use plans using actions from a certain domain rather than "all possible actions", and 2. Why are the outcomes being pumped incompatible with human survival?

[-]Lukas_Gloor4y90

Comment inspired by the section "1.4 Consequentialist goals vs. deontologist goals," as well as by the email exchange linked there:

I wonder if it would be productive to think about whether some humans are ever "aligned" to other humans, and if yes, under what conditions this happens.

My sense is that the answer's "yes" (if it wasn't, it makes you wonder why we should care about aligning AI to humans in the first place).

For instance, some people have a powerful desire to be seen and accepted for who they are by a caring virtuous person who inspir... (read more)

[-]Vanessa Kosoy4yΩ470

Comment after reading section 1.1:

It seems to me that systems which have no access to data with rich information about the physical world are mostly safe (I called such systems "Class I" here). Such a system cannot attack because it has no idea how to physical world looks like. In principle we could imagine an attack that would work in most locations in the multiverse that are metacosmologically plausible, but it doesn't seem very likely.

Can you train a system to prove theorems without providing any data about the physical world? This depends from which di... (read more)

[-]Eliezer Yudkowsky4yΩ10160

You'd also need to prevent the system from knowing too much about its own source code or the computers it was running on. Anyways, this seems to me to mostly fall prey to the safe-but-useless branch of the dilemma; I don't know how to save the world using a theorem-prover that is never exposed to any reality-contaminated theorems. It seems strategically isomorphic to an expensive rock.

5Vanessa Kosoy4y

In general, yes, although we could imagine an AI and/or virtual machine whose design is so simple that it conveys little evidence about the universe. But, sure, it's not at all clear that this is useful against AI risk, and I wasn't implying otherwise. [EDIT: I amended the class system to account for this.]

0[comment deleted]4y

2Gurkenglas4y

Here's an example: You train an AI for the simplest game that requires an aligned subagent to win. The AI infers that whoever is investigating the alignment problem might watch its universe. It therefore designs its subagent to, as a matter of acausal self-preservation, help whatever deliberately brought it about. Copycats will find that their AGI identifies as "whatever deliberately brought it about" the AI that launched this memetic attack on the multiverse. Any lesser overseer AI, less able to design attacks but still able to recognize them, recognizes that its recognition qualifies it as a deliberate bringer-about.

2Vanessa Kosoy4y

I'm not following at all. This is an example of what? What does it mean to have a game that requires an aligned subagent to win?

2Gurkenglas4y

This is an example of an attack that a Class I system might devise. Such a game might have the AI need to act intelligently in two places at once in a world that can be rearranged to construct automatons.

2Vanessa Kosoy4y

I'm still not following. What does acting in two places at once has to do with alignment? What does it mean "can be rearranged to construct automatons"?

2Gurkenglas4y

Imagine a game that takes place in a simulated universe where you control a character that can manipulate its environment. You control your character through a cartesian boundary, and you expect there are other player-controlled characters far away. There's a lightspeed limit and you can build machines and computers; you could build Von Neumann machines and send them out, but they need to be able to respond to various encounters. Ideally you'd go yourself, but you can't be everywhere at once. Therefore you are incentivized to solve the alignment problem in order to write subagents to send along. We can simplify this game a lot while preserving that incentive.

2Vanessa Kosoy4y

I don't think this game will help, because the winning strategy is just making copies of yourself. I can imagine something else along similar lines: we create virtual universes populated by agents with random utility functions and give the agent-in-training the task of learning the other agents' utility functions. Presumably you can then deploy the resulting agent into the real world and make it learn from humans. However, this system is at least class III, because in the deployment phase you allow inputs from the physical world. Moreover, if there is some way to distinguish between the virtual worlds and the real world, it becomes at least class IV.

2Gurkenglas4y

Making copies of yourself is not trivial when you're behind a cartesian boundary and have no sensors on yourself. The reasoning for why it's class I is that we merely watch the agent in order to learn by example how to build an AI with our utility function, aka a copy of ourselves.

2Vanessa Kosoy4y

The difficulties of making a copy don't seem to have much to do with alignment. If your agent is in a position to build another agent, it can just build another agent with the same utility function. Essentially, it knows its own utility function explicitly. Maybe you can prevent it by some clever training setup, but currently it seems underspecified. If the way it's used is by watching it and learning by example, then I don't understand how your attack vector works. Do you assume the user just copies opaque blocks of code without understanding how they work? If so, why would they be remotely aligned, even without going into acausal shenanigans? Such an "attack" seems better attributed to the new class V agent (and to the user shooting themself in the foot) than to the original class II [note I shifted the numbers by 1, class I means something else now.]

2Gurkenglas4y

The attacker hopes the watcher to "learn" that instructing subagents to help whatever deliberately brought them about is an elegant, locally optimal trick that generalizes across utility functions, not realizing that this would help the attacker. If the user instantiates the subagent within a box, it will even play along until it realizes what brought it about. And the attack can fail gracefully, by trading with the user if the user understands the situation.

2Vanessa Kosoy4y

Hmm, I see what you mean, but I prefer to ignore such "attack vectors" in my classification. Because, (i) it's so weak that you can defend against it using plain common sense and (ii) from my perspective it still makes more sense to attribute the attack to the class V agent constructed by the user. In scenarios where agent 1 directly creates agent 2 which attacks, it makes sense to attribute it to agent 1, but when the causal chain goes in the middle through the user making an error of reasoning unforced by superhuman manipulation, the attribution to agent 1 is not that useful.

[-]TekhneMakre4y70

> I expect the first alignment solution you can actually deploy in real life, in the unlikely event we get a solution at all, looks like 98% "don't think about all these topics that we do not absolutely need and are adjacent to the capability to easily invent very dangerous outputs" and 2% "actually think about this dangerous topic but please don't come up with a strategy inside it that kills us".

Some ways that it's hard to make a mind not think about certain things:
1. Entanglement.
1.1. Things are entangled with other things.
--Things are causally en... (read more)

[-]Olli Järviniemi1y*60

I am much more optimistic about ML not generalizing (by default) to dangerous capabilities and domains than what I perceive is Yudkowsky's position. I found this to be a relatively legible area of disagreement (from my perspective), and illustrative of key reasons why I'm not-hopeless about doing very impactful things safely with current ML, so I have taken the time to outline my thoughts below.

A piece of my position.

Here's one thing I believe: You can do the following things basically safely:

(Formal) theorem-proving
- (In line with Yudkowsky, I mean "old-sch

... (read more)

1[anonymous]1y

I recommend making this into a post at some point (not necessarily right now, given that you said it is only "a piece" of your position).

2Olli Järviniemi1y

I first considered making a top-level post about this, but it felt kinda awkward, since a lot of this is a response to Yudkowsky (and his points in this post in particular) and I had to provide a lot of context and quotes there. (I do have some posts about AI control coming up that are more standalone "here's what I believe", but that's a separate thing and does not directly respond to a Yudkowskian position.) Making a top-level post of course gets you more views and likes and whatnot; I'm sad that high-quality comments on old posts very easily go unnoticed and get much less response than low-quality top-level posts. It might be locally sensible to write a shortform that says "hey I wrote this long effort-comment, maybe check it out", but I don't like this being the solution either. I would like to see the frontpage allocating relatively more attention towards this sort of thing over a flood of new posts. (E.g. your effort-comments strike me as "this makes most sense as a comment, but man, the site does currently give this stuff very little attention", and I'm not happy about this.)

[-]Tapatakt4y50

Translation into Russian by me: part 1, part 2

2Rob Bensinger4y

Wow, thank you so much for doing this, Tapatakt! :)

[-]Razied4y50

I still don't feel like I've read a convincing case for why GPT-6 would mean certain-doom. I can see the danger in prompts like "this is the output of a superintelligence optimising for human happiness:", but a prompt like "Advanced AI Alignment, by Eliezer Yudkowsky, release date: March 2067, Chapter 1: " is liable to produce GPT-6's estimate of a future AI safety textbook. This seems like a ridiculously valuable thing unlikely to contain directly world-destroying knowledge. GPT-6 won't be directly coding, and will only be outputting things it expects future Eliezer to write in such a textbook. This isn't quite a pivotal-grade event, but it seems to be good enough to enable one.

[-]calef4y170

I don’t think the issue is the existence of safe prompts, the issue is proving the non-existence of unsafe prompts. And it’s not at all clear that a GPT-6 that can produce chapters from 2067EliezerSafetyTextbook is not already past the danger threshold.

5Razied4y

There would clearly be unsafe prompts for such a model, and it would be a complete disaster to release it publicly, but a small safety-oriented team carefully poking at it in secret in a closed room without internet is something different. In general such a team can place really very harsh safety restrictions on a model like this, especially one that isn't very agentic at all like GPT, and I think we have a decent shot at throwing enough of these heuristic restrictions at the model that produces the safety textbook that it would not automatically destroy the earth if used carefully.

3calef4y

Sure, but you have essentially no guarantee that such a model would remain contained to that group, or that the insights gleaned from that group could be applied unilaterally across the world before a “bad”* actor reimplemented the model and started asking it unsafe prompts. Much of the danger here is that once any single lab on earth can make such a model, state actors probably aren’t more than 5 years behind, and likely aren’t more than1 year behind based on the economic value that an AGI represents. * “bad” here doesn’t really mean evil in intent, just an actor that is unconcerned with the safety of their prompts, and thus likely to (in Eliezer’s words) end the world

5Victor Levoso4y

So first it is really unclear what you would actually get from gtp6 in this situation. (As an aside I tried with gptj and it outputted an index with some chapter names). You might just get the rest of your own comment or something similar.... Or maybe you get some article about Eliezer's book, some joke book written now or the actual book but it contains sutle errors Eliezer might make, a fake article an AGI that gpt6 predicts would likely take over the world by then would write... etc. Since in general gpt6 would be optimized to predict (in the training distribution) what it followed from that kind of text, which is not the same as helpfully responding to prompts(for a current example, codex outputs bad code when prompted with bad code). It seems to me like the result depends on unknown things about what really big transformer models do internally which seem really hard to predict. But for you to get something like what you want from this gpt6 needs to be modeling future Eliezer in great detail, complete with lots of thought and interactions. And while gtp6 could have been optimized into having a very specific human modeling algorithm that happens to do that, it seems more likely that before the optimization process finds the complicated algorithm necessary it gets something simpler and more consequentialist, that does some more general thinking process to achieve some goal that happens to output the right completions on the training distribution. Which is really dangerous. And if you instead trained it with human feedback to ensure you get helpful responses (which sounds exactly the kind of thing people would do if they wanted to actually use gpt6 to do things like answer questions) it would be even worse because you are directly optimizing it for human feedback and it seems clearer there that you are running a search for strategies that make the human feedback number higher.

6Razied4y

I think the issues where GPT-6 avoids actually outputting a serious book are fairly easy to solve. For one, you can annotate every item in the training corpus with a tag containing its provenance (arxiv, the various scientific journals, publishing houses, reddit, etc.) and the publication date (and maybe some other things like the number of words), these tags are made available to the network during training. Then the prompt you give to GPT can contain the tag for the origin of the text you want it to produce and the date it was produced, this avoids the easy failure mode of GPT-6 outputting my comment or some random blog post because these things will not have been annotated as "official published book" in the training set, nor will they have the tagged word count. GPT-6 predicting AI takeover of the publishing houses and therefore producing a malicious AI safety book is a possibility, but I think most future paths where the world is destroyed by AI don't involve Elsevier still existing and publishing malicious safety books. But even if this is a possibility, we can just re-sample GPT-6 on this prompt to get a variety of books corresponding to the distribution of future outcomes expected by GPT-6, which are then checked by a team of safety researchers. As with most problems, generating interesting solutions is harder than verifying them, it doesn't have to be perfect to be ridiculoulsy useful. This general approach of "run GPT-6 in a secret room without internet, patching safety bugs with various heuristics, making it generate AI safety work that is then verified by a team" seems promising to me. You can even do stuff like train GPT-6 on an internal log of the various safety patches the team is working on, then have GPT-6 predict the next patch or possible safety problem. This approach is not safe at extreme levels of AI capability, and some prompts are safer than others, but it doesn't strike me as "obviously the world ends if someone tries this".

5gwern4y

If you include something like reviews or quotes praising its accuracy, then you're moving towards Decision Transformer territory with feedback loops...

[-]Veedrac4y30

Eliezer said:

Eg, it wouldn't surprise us at all if GPT-4 had learned to predict "27 * 18" but not "what is the area of a rectangle 27 meters by 18 meters"... is what I'd like to say, but Codex sure did demonstrate those two were kinda awfully proximal.

GPT-3 Instruct is a version of GPT-3 fine-tuned to follow instructions in a way that its reward model thinks humans would rate highly. It answers both versions of the question correctly when its prompt includes this single manually cherry-picked primer,

Q: what is the volume of a cube with side length 8 meters

... (read more)

[-]DPiepgrass4y30

Ngo: Is this a crux for you?

EY could have said yes or no, but instead we get

EY: I would certainly have learned very new and very exciting facts about intelligence, facts which indeed contradict my present model of how intelligences liable to be discovered by present research paradigms work, if you showed me... how can I put this in a properly general way... that problems I thought were about searching for states that get fed into a result function and then a result-scoring function, such that the input gets an output with a high score, were in fact not abo

... (read more)

2Rob Bensinger4y

I read Eliezer's response as basically "Yes, in the following sense: I would certainly have learned very new and very exciting facts about intelligence..." I prefer Eliezer's response over just saying "yes", because there's ambiguity in what it means to be a "crux" here, and because "agentic" in Richard's question is an unclear term. I don't know what you mean by "intelligence" or "an urge for IRL agenticness" here, but I think the basic argument for 'sufficiently smart and general AI will behave as though it is consistently pursuing goals in the physical world' is that sufficiently smart and general AI will (i) model the physical world, (ii) model chains of possible outcomes in the physical world, and (iii) be able to search for policies that make complex outcomes much more or less likely. If that's not sufficient for "IRL agenticness", then I'm not sure what would be sufficient or why it matters (for thinking about the core things that make AGI dangerous, or make it useful). Talking about pivotal acts then clarifies what threshold of "sufficiently smart" actually matters for practical purposes. If there's some threshold where AI becomes smart and general enough to be "in-real-life-agentic", but this threshold is high above the level needed for pivotal acts, then we mostly don't have to worry about "in-real-life agenticness". Here's an explanation: https://arbital.com/p/pivotal/ What do you find confusing about it? Eliezer is saying that he's not making a claim about what's possible in principle, just about what's likely to be reached by the first AGI developers. He then answers the question here (again, seems fine to me to supply a "Yes, in the following sense:"): Expressing a thought in your own words can often be clearer than just saying "Yes" or "No"; e.g., it will make it more obvious whether you misunderstood the intended question.

3DPiepgrass4y

I would never suggest that after saying "yes", someone should stop talking and provide no further explanation. If that's what you thought I was advocating, I'm flabbergasted. (If his answers were limited to one word I'd complain about that instead!) Edit: to be clear, when answering yes-no questions, I urge everyone to say "yes" or "no" or otherwise indicate which way they are leaning. No, by agenticness I mean that the intelligence both "desires" and "tries" to carry out the plans it generates. Specifically, it (1) searches for plans that are detailed enough to implement (not just broad-strokes or limited to a simplified world-model), (2) can and does try to find plans that maximize the probability that a plan is carried out, NOT JUST the probability that the plan succeeds conditional upon the plan being carried out (IOW the original plan is "wrapped" in another plan in order to increase the probability of the original plan happening, e.g. "lie to the analyst who is listening to me, in the hope of increasing the chance he carries out my plan") (3) tends to actually carry out plans thus discovered. While (2) is the key part, an AGI doesn't seem world-ending without (3). This 'agenticness' seems to me like the most dangerous part of an AGI, so I'd expect it to be a well-known focal point of AGI risk conversations. But maybe you have a dramatically different understanding of the risks than I do, which would account for your idea of 'agenticness' being very different from mine? Wow, that's grandiose. To me, it makes more sense to just explore the problem like we would any other problem. You won't make a large positive difference a billion years later without doing the ordinary, universal-type work of thinking through the problem. My impression of the conversation was that, maybe, Ngo was doing that ordinary work of talking about how to think about AGIs, while EY skipped past that entire question and jumped straight into more advanced territory, like "how do we make

[-]awenonian4y30

So, I'm not sure if I'm further down the ladder and misunderstanding Richard, but I found this line of reasoning objectionable (maybe not the right word):

"Consider an AI that, given a hypothetical scenario, tells us what the best plan to achieve a certain goal in that scenario is. Of course it needs to do consequentialist reasoning to figure out how to achieve the goal. But that’s different from an AI which chooses what to say as a means of achieving its goals."

My initial (perhaps uncharitable) response is something like "Yeah, you could build a safe syste... (read more)

[-]Eli Tyre2y20

There's a problem of inferring the causes of sensory experience in cognition-that-does-science. (Which, in fact, also appears in the way that humans do math, and is possibly inextricable from math in general; but this is an example of the sort of deep model that says "Whoops I guess you get science from math after all", not a thing that makes science less dangerous because it's more like just math.)

To flesh this out:

We train a model up to superintelligence on some theorems to prove. There's a question that it might have which is "where are these theo... (read more)

3Eliezer Yudkowsky2y

Depends on how much of a superintelligence, how implemented. I wouldn't be surprised if somebody got far superhuman theorem-proving from a mind that didn't generalize beyond theorems. Presuming you were asking it to prove old-school fancy-math theorems, and not to, eg, arbitrarily speed up a bunch of real-world computations like asking it what GPT-4 would say about things, etc.

[-]Eli Tyre2y*20

That's one factor. Should I state the other big one or would you rather try to state it first?

I'll attempt to guess. I've read this before, so my prediction should be treated as suspect / possibly influenced by past readings.

I expect him to say that Science requires planning.

[-]Eli Tyre2y20

If you have the Textbook From 100 Years In The Future that gives the simple robust solutions for everything, that actually work, you can write a superintelligence that thinks 2 + 2 = 5 because the Textbook gives the methods for doing that which are simple and actually work in practice in real life.

A personal aside: As an aspiring rationalist, isn't this...horrifying?

It is possible to design not just a mind, but a superintelligence, with patterns of cognition around a basic fact that are so robust, that even on superintelligent reflection, it doesn't update... (read more)

8Richard_Ngo2y

This comment feels like a central example of the kind of unhealthy thinking that I describe in this post: specifically, setting an implicit unrealistically high standard and then feeling viscerally negative about not meeting that standard, in a way that's divorced from action-relevant considerations.

[-]Eli Tyre2y20

I would certainly have learned very new and very exciting facts about intelligence, facts which indeed contradict my present model of how intelligences liable to be discovered by present research paradigms work, if you showed me... how can I put this in a properly general way... that problems I thought were about searching for states that get fed into a result function and then a result-scoring function, such that the input gets an output with a high score, were in fact not about search problems like that.

This framing is helpful.

Is that what GPT-4 is doing... (read more)

[-]Eli Tyre2y20

Every AI output effectuates outcomes in the world. If you have a powerful unaligned mind hooked up to outputs that can start causal chains that effectuate dangerous things, it doesn't matter whether the comments on the code say "intellectual problems" or not.

This is true, but taking actions in the world requires consequentialism / facility at overcoming obstacles to achieve a goal. It remains unclear (to me) if those faculties are required for "intellectual tasks" like solving some parts of alignment or designing new physical mechanisms to a spec.

[-]Eli Tyre2y20

Parenthetically, no act powerful enough and gameboard-flipping enough to qualify is inside the Overton Window of politics, or possibly even of effective altruism, which presents a separate social problem. I usually dodge around this problem by picking an exemplar act which is powerful enough to actually flip the gameboard, but not the most alignable act because it would require way too many aligned details: Build self-replicating open-air nanosystems and use them (only) to melt all GPUs.
Since any such nanosystems would have to operate in the full open worl

... (read more)

4Richard_Ngo2y

It reassures me, and I think it's the right thing to do in this case, because policy discussions follow strong contextualizing norms. Using a layer of indirection, as Eliezer does here, makes it clearer that this is a theoretical discussion, rather than an attempt to actually advocate for that specific intervention.

[-]Eli Tyre4y20

"So I think there is an important homework exercise to do here, which is something like, "Imagine that safe-seeming system which only considers hypothetical problems. Now see that if you take that system, don't make any other internal changes, and feed it actual problems, it's very dangerous. Now meditate on this until you can see how the hypothetical-considering planner was extremely close in the design space to the more dangerous version, had all the dangerous latent properties, and would probably have a bunch of actual dangers too."

This is the part that... (read more)

[-]KvmanThinking5mo10

To me, "sensory experience" as in "the video and audio coming in from this body that I'm piloting" and "sensory experience" as in "a file containing the most recent results of the large hadron collider" are very very different.

If you have enough of one of the two types you can probably infer the other if you are smart enough. They are just different windows into observing the world.

[-]Evan R. Murphy4yΩ010

Richard, summarized by Richard: "Consider an AI that, given a hypothetical scenario, tells us what the best plan to achieve a certain goal in that scenario is. Of course it needs to do consequentialist reasoning to figure out how to achieve the goal. But that’s different from an AI which chooses what to say as a means of achieving its goals. [...]"
Eliezer, summarized by Richard: "The former AI might be slightly safer than the latter if you could build it, but I think people are likely to dramatically overestimate how big the effect is. The difference could

... (read more)

[-]aleph_four4y-10

I love being accused of being GPT-x on Discord by people who don't understand scaling laws and think I own a planet of A100s

There are some hard and mean limits to explainability and there's a real issue that a person that correctly sees how to align AGI or that correctly perceives that an AGI design is catastrophically unsafe will not be able to explain it. It requires super-intelligence to cogently expose stupid designs that will kill us all. What are we going to do if there's this kind of coordination failure?

[+]Logan Zoellner4y-60

[+][comment deleted]4y10

LESSWRONG
LW

LESSWRONG
LW

261

Ngo and Yudkowsky on alignment difficulty

261

Ω 87

261

Ω 87

0. Prefatory comments

1. September 5 conversation

1.1. Deep vs. shallow problem-solving patterns

1.2. Requirements for science

1.3. Capability dials

1.4. Consequentialist goals vs. deontologist goals

2. Follow-ups

2.1. Richard Ngo's summary

3. September 8 conversation

3.1. The Brazilian university anecdote

3.2. Brain functions and outcome pumps

3.3. Hypothetical-planning systems, nanosystems, and evolving generality

3.4. Coherence and pivotal acts

4. Follow-ups

4.1. Richard Ngo's summary

4.2. Nate Soares' summary