Fictionalized/Paraphrased version of a real dialog between me and John Wentworth.
Fictionalized Me: So, in the Eliezer/Richard dialogs, Eliezer is trying to get across this idea that consequentialism deeply permeates optimization, and this is important, and that's one reason why Alignment is Hard. But something about it is confusing and slippery, and he keeps trying to explain it and it keeps not-quite-landing.
I think I get it, but I'm not sure I could explain it. Or, I'm not sure who to explain it to. I don't think I could tell who was making a mistake, where "consequentialism is secretly everywhere" is a useful concept for realizing-the-mistake.
Fictionalized John: [stares at me]
Me: Okay, I guess I'm probably supposed to try and explain this and see what happens.
Me: Okay, so the part that's confusing here is that this is supposed to be something that Eliezer thinks thoughtful, attentive people like Richard (and Paul?) aren't getting, despite them having read lots of relevant material and paying attention and being generally on board with "alignment is hard."
...so, what is a sort of mistake I could imagine a smart, thoughtful person who read the sequences making here?
My Eliezer-model imagines someone building what they think is an aligned ML system. They've trained it carefully to do things they reflectively approve of, they've put a lot of work into making it interpretable and honest. This Smart Thoughtful Researcher has read the sequences and believes that alignment is hard and whatnot. Nonetheless, they'll have failed to really grok this "consequentialism-is-more-pervasive-and-important-than-you-think" concept. And this will cause doom when they try to scale up their project to accomplish something actually hard.
I... guess what I think Eliezer thinks is that Thoughful Researcher isn't respecting inner optimizers enough. They'll have built their system to be carefully aligned, but to do anything hard, it'll end up generating inner-optimizers that aren't aligned, and the inner-optimizers will kill everyone.
John: Nod. But not quite. I think you're still missing something.
You're familiar with the arguments of convergent instrumental goals?
Me: i.e. most agents will end up wanting power/resources/self-preservation/etc?
But not only is "wanting power and self preservation" convergently instrumental. Consequentialism is convergently instrumental. Consequentialism is a (relatively) simple, effective process for accomplishing goals, so things that efficiently optimize for goals tend to approximate it.
Now, say there's something hard you want to do, like build a moon base, or cure cancer or whatever. If there were a list of all the possible plans that cure cancer, ranked by "likely to work", most of the plans that might work route through "consequentalism", and "acquire resources."
Not only that, most of the plans route through "acquire resources in a way that is unfriendly to human values." Because in the space of all possible plans, while consequentialism doesn't take that many bits to specify, human values are highly complex and take a lot of bits to specify.
Notice that I just said "in the space of all possible plans, here are the most common plans." I didn't say anything about agents choosing plans or acting in the world. Just listing the plans. And this is important because the hard part lives in the choosing of the plans.
Now, say you build an oracle AI. You've done all the things to try and make it interpretable and honest and such. If you ask it for a plan to cure cancer, what happens?
Me: I guess it gives you a plan, and... the plan probably routes through consequentialist agents acquiring power in an unfriendly way.
Okay, but if I imagine a researcher who is thoughtful but a bit too optimistic, what they might counterargue with is: "Sure, but I'll just inspect the plans for whether they're unfriendly, and not do those plans."
And what I might then counterargue their counterargument with is:
1) Are you sure you can actually tell which plans are unfriendly and which are not?
2) If you're reading very carefully, and paying lots of attention to each plan... you'll still have to read through a lot of plans before you get to one that's actually good.
John: Bingo. I think a lot of people imagine asking an oracle to generate 100 plans, and they think that maybe half the plans will be pretty reasonable. But, the space of plans is huge. Exponentially huge. Most plans just don't work. Most plans that work route through consequentialist optimizers who convergently seek power because you need power to do stuff. But then the space of consequentialist power-seeking plans are still exponentially huge, and most ways of seeking power are unfriendly to human values. The hard part is locating a good plan that cures cancer that isn't hostile to human values in the first place.
Me: And it's not obvious to me whether this problem gets better or worse if you've tried to train the oracle to only output "reasonable seeming plans", since that might output plans that are deceptively unaligned.
John: Do you understand why I brought up this plan/oracle example, when you originally were talking about inner optimizers?
Me: Hmm. Um, kinda. I guess it's important that there was a second example.
Me: Okay, so partly you're pointing out that hardness of the problem isn't just about getting the AI to do what I want, it's that doing what I want is actually just really hard. Or rather, the part where alignment is hard is precisely when the thing I'm trying to accomplish is hard. Because then I need a powerful plan, and it's hard to specify a search for powerful plans that don't kill everyone.
John: Yeah. One mistake I think people end up making here is that they think the problem lives in the AI-who's-deciding/doing things, as opposed to in the actual raw difficulty of the search.
Me: Gotcha. And it's important that this comes up in at least two places – inner optimizers with an agenty AI, and an oracle that just output plans that would work. And the fact that it shows up in two fairly different places, one of which I hadn't thought of just now, is suggestive that it could show up in even more places I haven't thought of at all.
And this is confusing enough that it wasn't initially obvious to Richard Ngo, who's thought a ton about alignment. Which bodes ill for the majority of alignment researchers who probably are less on-the-ball.
I'm tempted to say "the main reason" why Alignment Is Hard, but then remembered Eliezer specifically reminded everyone not to summarize him as saying things like "the key reason for X" when he didn't actually say that, and often is tailoring his arguments to a particular confusion with his interlocuter.
Suppose we took this whole post and substituted every instance of "cure cancer" with the following:
Version A: "win a chess game against a grandmaster"
Version B: "write a Shakespeare-level poem"
Version C: "solve the Riemann hypothesis"
Version D: "found a billion-dollar company"
Version E: "cure cancer"
Version F: "found a ten-trillion-dollar company"
Version G: "take over the USA"
Version H: "solve the alignment problem"
Version I: "take over the galaxy"
And so on. Now, the argument made in version A of the post clearly doesn't work, the argument in version B very likely doesn't work, and I'd guess that the argument in version C doesn't work either. Suppose I concede, though, that the argument in version I works: that searching for an oracle smart enough to give us a successful plan for taking over the galaxy will very likely lead us to develop an agentic, misaligned AGI. Then that still leaves us with the question: what about versions D, E, F, G and H? The argument is structurally identical in each case - so what is it about "curing cancer" that is so hard that, unlike winning chess or (possibly) solving the Riemann hypothesis, when we train for that we'll get misaligned agents instead?
W... (read more)
For all tasks A-I, most programs that we can imagine writing to do that task will need to search through various actions and evaluating the consequences. The only one of those tasks we currently know how to solve with a program is A, and chess programs do indeed use search and evaluation.
I'd guess that whether something can be done safely is mostly a function of how easy it is, and how isolated it is from the real world. The Riemann hypothesis seems pretty tricky, but it's isolated from the real world, so it can probably be solved by a safe system. Chess is isolated and easy. Starting a billion dollar company is very entangled with the real world, and very tricky. So we probably couldn't do it without a dangerous system.
This all makes sense, except for the bit where you draw the line at a certain level of "tricky and entangled with world". Why isn't it the case that danger only arises for the first AIs that can do tasks half as tricky? Twice as tricky? Ten times as tricky?
Consider what happens if you had to solve your list of problems and didn't inherently care about human values? To what extent would you do 'unfriendly' things via consequentialism? How hard would you need to be constrained to stop doing that? Would it matter if you could also do far trickier things by using consequentialism and general power-seeking actions?
The reason, as I understand it, that a chess-playing AI does things the way we want it to is that we constrain the search space it can use because we can fully describe that space, rather than having to give it any means of using any other approaches, and for now that box is robust.
But if someone gave you or I the same task, we wouldn't learn chess, we would buy a copy of Stockfish, or if it was a harder task (e.g. be better than AlphaZero) we'd go acquire resources using consequentialism. And it's reasonable to think that if we gave a fully generic but powerful future AI the task of being the best at chess, at some point it's going to figure out that the way to do that is acquire resources via consequentialism, and potentially to kill or destroy all its potential opponents. Winner.
Same with the poem or the hypothesi... (read more)
I agree with basically your whole comment. But it doesn't seem like you're engaging with the frame I'm using. I'm trying to figure out how agentic the first AI that can do task X is, for a range of X (with the hope that the first AI that can do X is not very agentic, for some X that is a pivotal task). The claim that a highly agentic highly intelligent AI will likely do undesirable things when presented with task X is very little evidence about this, because a highly agentic highly intelligent AI will likely do undesirable things when presented with almost any task.
Instead of "dumb" or "narrow" I'd say "having a strong comparative advantage in X (versus humans)". E.g. imagine watching evolution and asking "will the first animals that take over the world
be able to solvehave already solved the Riemann hypothesis", and the answer is no because humans intelligence, while general, is still pointed more at civilisation-building-style tasks than mathematics.
Similarly, I don't expect that any AI which can do a bunch of groundbreaking science to be "narrow" by our current standards, but I do hope that they have a strong comparative disadvantage at taking-over-world-style tasks, compared with doing-science-style tasks.
And that's related to agency, because what we mean by agency is not far off "having a comparative advantage in taking-over-world style tasks".
Now, I expect that at some point, this line of reasoning stops being useful, because your systems are general enough and agentic enough that, even if their comparative advantage isn't taking over the world, they can pretty easily do that anyway. But the question is whether this line of reasoning is still useful for the first systems which can do pivotal task X. Eliezer thinks no, because he consid... (read more)
For the purposes of this argument, I'm interested in what can be done safely by some AI we can build. If you can solve alignment safely with some AI, then you're in a good situation. What an arbitrarily powerful optimiser will do isn't the crux, we all agree that's dangerous.
But I think Richard’s point is ‘but we totally built AIs that defeated grand chess masters without destroying the world. So, clearly it’s possible to use tool AI to do this sort of thing. So… why do think various domains will reliably output horrible outcomes? If you need to cure cancer, maybe there is an analogous way to cure cancer that just… isn’t trying that hard?’
Richard is that what you were aiming at?
The reason why we can easily make AIs which solve chess without destroying the world is because we can make specialized AIs such that they can only operate in the theoretical environment of states of chess boards, and in that environment we can tell it exactly what it's goal should be.
If we tell an AGI to generate plans for winning at chess, and it knows about the outside world, then because the state space is astronomically larger, it is astronomically more difficult to tell it what it's goal should be, and so any goal we do give it either satisfies corrigibility, and we can tell it "do what I want", or incompletely captures what we mean by 'win this chess game'.
For cancer, there may well be a way to solve the problem using a specialized AI, which works in an environment space simple enough that we can completely specify our goal. I assume though that we are using a general AI in all the hypothetical versions of the problem, which has the property 'it's working in an environment space large enough that we can't specify what we want it to do' or if it doesn't know a priori it's plans can affect the state of a far larger environment space which can affect the environment space it cares about, it may deduce this, and figure out a way to exploit this feature.
The problem is not with whether we call the AI AGI or not, it's whether we can either 1) fully specify our goals in the environment space it's able to model (or otherwise not care too deeply about the environment space it's able to model), or 2) verify the effects of the actions it says to do have no disastrous consequences.
To determine whether a tool AI can be used to solve problems Paul wants to solve, or execute pivotal acts, we need a to both 1) determine that the environment is small enough for us to accurately express our goal, and 2) ensure the AI is unable to infer the existence of a broader environment.
(meta note: I'm making a lot of very confident statements, and very few are of the form "<statement>, unless <other statement>, then <statement> may not be true". This means I am almost certainly overconfident, and my model is incomplete, but I'm making the claims anyway so that they can be developed)
This dialog was much less painful for me to read than i expected, and I think it manages to capture at least a little of the version-of-this-concept that I possess and struggle to articulate!
(...that sentence is shorter, and more obviously praise, in my native tongue.)
A few things I'd add (epistemic status: some simplification in attempt to get a gist across):
Part of what's going on here is that reality is large and chaotic. When you're dealing with a large and chaotic reality, you don't get to generate a full plan in advance, because the full plan is too big. Like, imagine a reasoner doing biological experimentation. If you try to "unroll" that reasoner into an advance plan that does not itself contain the reasoner, then you find yourself building this enormous decision-tree, like "if the experiments come up this way, then I'll follow it up with this experiment, and if instead it comes up that way, then I'll follow it up with that experiment", and etc. This decision tree quickly explodes in size.... (read more)
As a causal chess player it seems unlikely to me that there are any such instructions that would lead a beginner to beat even a halfway decent player. Chess is very dependent on calculation (mentally stepping through the game tree) and evaluation (recognising if a position is good or bad). Given the slow clock speed of the human brain (compared to computers), our calculations are slow and so we must lean heavily on a good learned evaluation function, which probably can't be explicitly represented in a way that would be fast enough to execute manually. In other words you'd end up taking hours to make a move or something.
There's no shortcut like "just move these pawns 3 times in a mysterious pattern, they'll never expect it" - "computer lines" that bamboozle humans require deep search that you won't be able to do in realtime.
Edit: the Oracle's best chance against an ok player would probably be to give you a list of trick openings that lead to "surprise" checkmate and hope that the opponent falls into one, but it's a low percentage.
I agree that it's plausible chess-plans can be compressed without invoking full reasoners (and with a more general point that there are degrees of compression you can do short of full-on 'reasoner', and with the more specific point that I was oversimplifying in my comment). My intent with my comment was to highlight how "but my AI only generates plans" is sorta orthogonal to the alignment question, which is pushed, in the oracle framework, over to "how did that plan get compressed, and what sort of cognition is invoved in the plan, and why does running that cognition yield good outcomes".
I have not yet found a pivotal act that seems to me to require only shallow realtime/reactive cognition, but I endorse the exercise of searching for highly specific and implausibly concrete pivotal acts with that property.
While writing this, I was reminded of an older (2017) conversation between Eliezer and Paul on FB. I reread it to see whether Paul seemed like he'd be making the set of mistakes this post is outlining.
It seems like Paul acknowledges the issues here, but his argument is that you can amplify humans without routing through "the hard parts" that are articulated in this post. i.e. it seems like you can use current ML to build something that helps a human effectively "think longer", and he thinks one can do this without routing through the dangerous-plan-searchspace. I don't know if there's much counterargument beyond "no, if you're building an ML system that helps you think longer about anything important, you already need to have solved the hard problem of searching through plan-space for actually helpful plans."
But, here were some comments from that thread I found helpful to reread.
Eliezer:... (read more)
I'm having some trouble phrasing this comment clearly, and I'm also not sure how relevant it is to the post except that the post inspired the thoughts, so bear with me...
It seems important to distinguish between several things that could vary with time, over the course of a plan or policy:
... (read more)
- What information is known.
- This is related to Nate's comment here: it is much more computationally feasible to specify a plan/policy if it's allowed to contain terms that say "make an observation, then run this function on it to decide the next step," rather than writing out a lookup table pairing every sequence of observations to the next action.
- What objective function is being maximized.
- This is usually assumed (?) to be static in this kind of discussion, but in principle the objective could vary in response to future observations.
In principle, this is equivalent to a static objective function with terms for "how it would respond" to each possible sequence of observations (ignoring subtleties about orders over world-states vs. world-histories). But this has exactly the same structure as the previous point: it's more feasible to say "make an observation, then run this function to update the o
Curated. This is a great instance of someone increasing clarity on a load-bearing concept (at least in some models) through an earnest attempt to improve their own understanding.
I'm not sure I understand why it's important that the fraction of good plans is 1% vs .00000001%. If you have any method for distinguishing good from bad plans, you can chain it with an optimizer to find good plans even if they're rare. The main difficulty is generating enough bits--but in that light, the numbers I gave above are 7 vs 33 bits--not a clear qualitative difference. And in general I'd be kind of surprised if you could get up to say 50 bits but then ran into a fundamental obstacle in scaling up further.
Can you be more concrete about how you would do this? If my method for evaluation is "sit down and think about the consequences of doing this for 10 hours", I have no idea how I would chain it with an optimizer to find good plans even if they are rare.
Thanks for the push-back and the clear explanation. I still think my points hold and I'll try to explain why below.
This is true if all the other datapoints are entirely indistinguishable, and the only signal is "good" vs. "bad". But in practice you would compare / rank the datapoints, and move towards the ones that are better.
Take the backflip example from the human preferences paper: if your only signal was "is this a successful backflip?", then your argument would apply and it would be pretty hard to learn. But the signal is "is this more like a successful backflip than this other thing?" and this makes learning feasible.
More generally, I feel that the thing I'm arguing against would imply that ML in general is impossible (and esp. the human preferences work), so I think it would help to say explicitly where the disanalogy occurs.
I should note that comparisons is only one reason why the situation i... (read more)
That part does seem wrong to me. It seems wrong because 10^50 is possibly too small. See my post Seeking Power is Convergently Instrumental in a Broad Class of Environments:... (read more)
This is a helpful sentence.
This post seems to be using a different meaning of "consequentialism" to what I am familiar with (that of moral philosophy). Subsequently, I'm struggling to follow the narrative from "consequentialism is convergently instrumental" onwards.
Can someone give me some pointers of how I should be interpreting the definition of consequentialism here? If it is just the moral philosophy definition, then I'm getting very confused as to why "judge morality of actions by their consequences" is a useful subgoal for agents to optimize against...
Thanks, I found this helpful!
I think this is the most important premise. I don't have a solid justification for it yet, but I'm groping towards a non-solid justification at least in my agency sequence. I think John Wentworth's stuff on the good regulator theorem is another line of attack that could turn into a solid justification. TurnTrout also has releva... (read more)
Is there a book out there on instrumental convergence? I have not come across this idea before and would like to learn more about it.
Here is a couple of "hard" things you can easily do with hypercompute, without causing dangerous consequentialism.
Given a list of atom coordinates, run a quantum accuracy simulation of those atoms. (Where the atoms don't happen to make a computer running a bad program).
Find the smallest arrangement of atoms that forms a valid Or gate by brute forcing over the above simulator.
Brute forcing over large arrangements of atoms could find a design containing a computer containing an AI. But brute forcing over arrangements of 100 atoms should be fine, and ca... (read more)
I think there's some confusion going on with "consequentialism" here, and that's at least a part of what's at play with "why isn't everyone seeing the consequentialism all the time".
One question I asked myself reading this is "does the author distinguish 'consequentialism' with 'thinking and predicting' in this piece?" and I think it's uncertain and leaning towards 'no'.
So, how do other people use 'consequentialism'?
It's sometimes put forward as a moral tradition/ethical theory, as an alternative to both deontology and virtue ethics. I forget which p... (read more)
I'm pretty sure "consequentialism" here wasn't meant to mean anything to do with ethics in this case (which I acknowledge as confusing)
I think consequentialism-as-ethics means "the right/moral thing to do is to choose actions that have good consequences."
I think consequentialism as Eliezer/John meant here is more like "the thing to do is choose actions that have the consequences you want."
A consequentialist is something that thinks, predicts, and plans (and, if possible, acts) in such a way as to bring about particular consequences.
(I think it's plausible that we want different words for these things, but I think this use of the word consequentialism is fairly natural, and makes sense to see "moral consequentalism" as a subset of consequentialism.)
I agree that non-agentic AI is a fools errand when it comes to alignment, but there's one point where I sort of want to defend it as not being quite as bad as this post suggests:... (read more)
I wonder if the confusion isn't about implications of consequentialism, but about the implications of independent agents. Related to the (often mentioned, but never really addressed) problem that humans don't have a CEV, and we have competition built-in to our (inconsistent) utility functions.
I have yet to see a model of multiple agents WRT "alignment". The ONLY reason that power/resources/self-preservation is instrumental is if there are unaligned agents in competition. If multiple agents agree on the best outcomes and the best way to ac... (read more)
Let's assume for a moment that consequentialism in Eliezer's sense is the most pervasive thing in the problem space (this is not a claim anyone has made as far as I can tell). What does leaning into consequentialism super hard look like in terms of approaches? The only line of attack l know of which seems to meet the description is the convergent power-seeking sequence.
1) It's easier to build a moon base with money. And*, it's easier to steal money than earn it.
*This is a hypothetical
2) Even replacing that plan with a one that 'human values' says works, is tricky. What is an acceptable way to earn money?
One does not en... (read more)
Rather than saying that most likely-to-work plans for curing cancer route through consequentialism, I think it would be more precise to say that most simple likely-to-work plans route through consequentialism.
For every plan that can be summarized as "build a powerful consequentialist and then delegate the problem to it", it seems like there should be a corresponding (perhaps very complicated) plan that can be summarized as "directly execute the plan that that consequentialist would have used if you had built it."
The size of that complexity penalty varies d... (read more)