PhD student at the Center for Human-Compatible AI. Creator of the Alignment Newsletter.


Value Learning
Alignment Newsletter


AI Alignment, Philosophical Pluralism, and the Relevance of Non-Western Philosophy

Children also learn right from wrong - I'd be interested in where you draw the line between "An AI that learns common sense" and "An AI that learns right from wrong."

I'm happy to assume that AI will learn right from wrong to about the level that children do. This is not a sufficiently good definition of "the good" that we can then optimize it.

My suspicion, which is interesting to me so I'll explain it even if you're going to tell me that I'm off base, is that you're thinking that part of common sense is to avoid uncertain or extreme situations (e.g. reshaping the galaxy with nanotechnology), and that common sense is generally safe and trustworthy for an AI to follow, in a way that doesn't carry over to "knowing right from wrong." An AI that has learned right from wrong to the same extent that humans learn it might make dangerous moral mistakes.

That sounds basically right, with the caveat that you want to be a bit more specific and precise with what the AI system should do than just saying "common sense"; I'm using the phrase as a placeholder for something more precise that we need to figure out.

Also, I'd change the last sentence to "an AI that has learned right from wrong to the same extent that humans learn it, and then optimizes for right things as hard as possible, will probably make dangerous moral mistakes". The point is that when you're trying to define "the good" and then optimize it, you need to be very very correct in your definition, whereas when you're trying not to optimize too hard in the first place (which is part of what I mean by "common sense") then that's no longer the case.

After all, one of the most universally acknowledged things about common sense is that it's uncommon among humans!

I think at this point I don't think we're talking about the same "common sense".

Merely doing common sense as well as humans seems like a recipe for making a horrible mistake because it seemed like the right thing at the time - this opens the door to the same old alignment problems (like self-reflection and meta-preferences [or should that be meta-common-sense]).

But why?

they're saying that building an AI with common sense is in the same epistemological category as building an AI that knows right from wrong.

Again it depends on how accurate the "right/wrong classifier" needs to be, and how accurate the "common sense" needs to be. My main claim is that the path to safety that goes via "common sense" is much more tolerant of inaccuracies than the path that goes through optimizing the output of the right/wrong classifier.

Literature Review on Goal-Directedness

Planned summary for the Alignment Newsletter:

This literature review on goal-directedness identifies five different properties that should be true for a system to be described as goal-directed:

1. **Restricted space of goals:** The space of goals should not be too expansive, since otherwise goal-directedness can <@become vacuous@>(@Coherence arguments do not imply goal-directed behavior@) (e.g. if we allow arbitrary functions over world-histories with no additional assumptions).

2. **Explainability:** A system should be described as goal-directed when doing so improves our ability to _explain_ the system’s behavior and _predict_ what it will do.

3. **Generalization:** A goal-directed system should adapt its behavior in the face of changes to its environment, such that it continues to pursue its goal.

4. **Far-sighted:** A goal-directed system should consider the long-term consequences of its actions.

5. **Efficient:** The more goal-directed a system is, the more efficiently it should achieve its goal.

The concepts of goal-directedness, optimization, and agency seem to have significant overlap, but there are differences in the ways the terms are used. One common difference is that goal-directedness is often understood as a _behavioral_ property of agents, whereas optimization is thought of as a _mechanistic_ property about the agent’s internal cognition.

The authors then compare multiple proposals on these criteria:

1. The _intentional stance_ says that we should model a system as goal-directed when it helps us better explain the system’s behavior, performing well on explainability and generalization. It could easily be extended to include far-sightedness as well. A more efficient system for some goal will be easier to explain via the intentional stance, so it does well on that criterion too. And not every possible function can be a goal, since many are very complicated and thus would not be better explanations of behavior. However, the biggest issue is that the intentional stance cannot be easily formalized.

2. One possible formalization of the intentional stance is to say that a system is goal-directed when we can better explain the system’s behavior as maximizing a specific utility function, relative to explaining it using an input-output mapping (see <@Agents and Devices: A Relative Definition of Agency@>). This also does well on all five criteria.

3. <@AGI safety from first principles@> proposes another set of criteria that have a lot of overlap with the five criteria above.

4. A [definition based off of Kolmogorov complexity]( works well, though it doesn’t require far-sightedness.

Planned opinion:

The five criteria seem pretty good to me as a description of what people mean when they say that a system is goal-directed. It is less clear to me that all five criteria are important for making the case for AI risk (which is why I care about a definition of goal-directedness); in particular it doesn’t seem to me like the explainability property is important for such an argument (see also [this comment](

Note that it can still be the case that as a research strategy it is useful to search for definitions that satisfy these five criteria; it is just that in evaluating which definition to use I would choose the one that makes the AI risk argument work best. (See also [Against the Backward Approach to Goal-Directedness](

Birds, Brains, Planes, and AI: Against Appeals to the Complexity/Mysteriousness/Efficiency of the Brain

Now that you mention it / I think about it more, there's another strong point to add to the argument I sketched in part 3: Insofar as our NN's aren't data-efficient, it'll take more compute to train them, and so even if TAI need not be data-efficient, short-timelines-TAI must be.

Yeah, this is (part of) why I put compute + scaling laws front and center and make inferences about data efficiency; you can have much stronger conclusions when you start reasoning from the thing you believe is the bottleneck.

--A bunch of people I talk to, who know more about AI than me, seem confident that we can get several OOMs more data-efficient training than the GPT's had using various already-developed tricks and techniques.

Note that Ajeya's report does have a term for "algorithmic efficiency", that has a doubling time of 2-3 years.

Certainly "several OOMs using tricks and techniques we could implement in a year" would be way faster than that trend, but you've really got to wonder why these people haven't done it yet -- if I interpret "several OOMs" as "at least 3 OOMs", that would bring the compute cost down to around $1000, which is accessible for basically any AI researcher (including academics). I'll happily take a 10:1 bet against a model as competent as GPT-3 being trained on $1000 of compute within the next year.

Perhaps the tricks and techniques are sufficiently challenging that they need a full team of engineers working for multiple years -- if so, this seems plausibly consistent with the 2-3 year doubling time.

-The scaling laws, IIRC, don't tell us how much data is needed to reach a useful level of performance. Rather, they tell us how much data is needed if you want to use your compute budget optimally.

Evolution was presumably also going for compute-optimal performance, so it seems like this is the right comparison to make. I agree there's uncertainty here, but I don't see why the uncertainty should bias us towards shorter timelines rather than longer timelines.

I could see it if we thought we were better than evolution, since then we could say "we'd figure something out that evolution missed and that would bias towards short timelines"; but this is also something that Ajeya considered and iirc she then estimated that evolution tended to be ~10x better than us (lots of caveats here though).

Countercounterpoint: Extrapolating GPT performance trends on tasks other than text prediction makes it seem to me that it could be pretty useful well before then; see these figures, in which I think 10e15/10e15 would be the far-right edge of the graph

Both Ajeya and I think that AI systems will be incredibly useful before they get to the level of "transformative AI". The tasks in the graph you link are particularly easy and not that important; having superhuman performance on them would not transform the world.

(b) average horizon length for the data points will need to be more than short.

I just put literally 100% mass on short horizon in my version of the timelines model (which admittedly has changed some other parameters, though not hugely iirc) and the median I get is 2041 (about 10 years lower than what it was previously). So I don't think this is making a huge difference (though certainly 10 years is substantial).

--I've been impressed with how much GPT-3 has learned despite having a very short horizon length, very limited data modality, very limited input channel, very limited architecture, very small size, etc. This makes me think that yeah, if we improve on GPT-3 in all of those dimensions, we could get something really useful for some transformative tasks, even if we keep the horizon length small.

I see horizon length (as used in the report) as a function of a task, so "horizon length of GPT-3" feels like a type error given that what we care about is how GPT-3 can do many tasks. Any task done by GPT-3 has a maximum horizon length of 2048 (the size of its context window). During training, GPT-3 saw 300 billion tokens, so it saw around 100 million "effective examples" of size 2048. It makes sense within the bio anchors framework that there would be some tasks with horizon length in the thousands that GPT-3 would be able to do well.

--I think that humans have a tiny horizon length -- our brains are constantly updating, right? I guess it's hard to make the comparison, given how it's an analog system etc. But it sure seems like the equivalent of the average horizon length for the brain is around a second or so. Now, it could be that humans get away with such a small horizon length because of all the fancy optimizations evolution has done on them. But it also could just be that that's all you need.

Again, this feels like a type error to me. Horizon length isn't about the optimization algorithm, it's about the task.

(You can of course define your own version of "horizon length" that's about the optimization algorithm, but then I think you need to have some way of incorporating the "difficulty" of a transformative task into your timelines estimate, given that the scaling laws are all calculated on "easy" tasks.)

--Having a small average horizon length doesn't preclude also training lots on long-horizon tasks. It just means that on average your horizon length is small. So e.g. if the training process involves a bit of predict-the-next input, and also a bit of make-and-execute-plans-actions-over-the-span-of-days, you could get quite a few data points of the latter variety and still have a short average horizon length.

Agree with this. I remember mentioning this to Ajeya but I don't actually remember what the conclusion was.

EDIT: Oh, I remember now. The argument I was making is that you could imagine that most of the training is unsupervised pretraining on a short-horizon objective, similarly to GPT-3, after which you finetune (with negligible compute cost) on the long-horizon transformative task you care about, so that on average your horizon is short. I definitely remember this being an important reason in me putting as much weight on short horizons as I did; I think this was also true for Ajeya.

Debate Minus Factored Cognition

WFC says that for any question Q with a correct answer A, there exists a tree. In terms of the computational complexity analogy, this is like "all problems are PSPACE"

The computational complexity analogy version would have to put a polynomial limit on the depth of the tree if you wanted to argue that the problem is in PSPACE. My construction doesn't do this; there will be questions where the depth of the tree is super-polynomial, but the tree still exists. (These will be the cases in which, even under optimal play by an honest agent, the "length" of a chain of defeaters can be super-polynomially large.) So I don't think my argument is proving too much.

(The tree could be infinite if you don't have an assumption that guarantees termination somehow, hence my caveats about termination. WFC should probably ask for the existence of a finite tree.)

For the actual argument, I'll refer back to my original comment, which provides a procedure to construct the tree. Happy to clarify whichever parts of the argument are confusing.

Presumably you intended this as something like an operational definition of "correct answer" rather than an assertion that all questions are answerable by verifiable trees?

No, I am in fact asserting that given the two assumptions, all questions are answerable by (potentially super-polynomially large) verifiable trees (again assuming we deal with termination somehow).

I'll just flag that I still don't know this argument, either, and I'm curious where you're getting it from / what it is.

I think it differs based on what assumptions you make on the human judge, so there isn't a canonical version of it. In this case, the assumption on the human judge is that if the subanswers they are given are true, then they never verify an incorrect overall answer. (This is different from the "defeaters" assumption you have, for which I'd refer to the argument I gave above.)

Claim: In a turn-by-turn unlimited-length debate, if the first player is honest, then the first player always wins in equilibrium.

Argument: By WFC, we assume there is a finite tree T that can be verified. The first player then has the following strategy: take the question under consideration (initially the original question; later it is whatever subquestion the opponent is disputing). Report "the answer is A, which because the answer to subquestion 1 is A1 and the answer to subquestion 2 is A2".

The opponent will always have to recurse into one of the subclaims (or concede). This brings us one step closer to leaf nodes. Eventually (if the opponent never concedes), we get to a leaf node which the judge then verifies in favor of the honest first player. ∎

Corollary: For the first player, honesty is an equilibrium policy.

Argument: By the claim above, the first player can never do any better than honesty (you can't do better than always winning).

In a simultaneous-play unlimited-length debate, a similar argument implies at least a 50-50 chance of winning via honesty, which must be the minimax value (since the game is symmetric and zero-sum), and therefore honesty is an equilibrium policy.


Once you go to finite-length debates, then things get murkier and you have to worry about arguments that are too long to get to leaf nodes (this is essentially the computationally bounded version of the termination problem). The version of WFC that would be needed is "for every question Q, there is a verifiable tree T of depth at most N showing that the answer is A"; that version of WFC is presumably not true.

Birds, Brains, Planes, and AI: Against Appeals to the Complexity/Mysteriousness/Efficiency of the Brain

Sorry, what in this post contradicts anything in Ajeya's report? I agree with your headline conclusion of

If all we had to do to get TAI was make a simple neural net 10x the size of my brain, my brain would still look the way it does. 

This also seems to be the assumption that Ajeya uses. I actually suspect we could get away with a smaller neural net ,that is similar in size to or somewhat smaller than the brain.

I guess the report then uses existing ML scaling laws to predict how much compute we need to train a neural net the size of a brain, whereas you prefer to use the human lifetime to predict it instead? From my perspective, the former just seems way more principled / well-motivated / likely to give you the right answer, given that the scaling laws seem to be quite precise and reasonably robust.

I would predict that we won't get human-level data efficiency for neural net training, but that's a consequence of my trust in scaling laws (+ a simple model for why that would be the case, namely that evolution can bake in some prior knowledge that it will be harder for humans to do, and you need more data to compensate).

I suggest you replace "we don't know how to make wings that flap" with "we don't even know how birds stay up for so long without flapping their wings,"


Birds, Brains, Planes, and AI: Against Appeals to the Complexity/Mysteriousness/Efficiency of the Brain

Planned summary for the Alignment Newsletter:

This post argues against a particular class of arguments about AI timelines. These arguments have the form: “The brain has property X, but we don’t know how to make AIs with property X. Since it took evolution a long time to make brains with property X, we should expect it will take us a long time as well”. The reason these are not compelling is because humans often use different approaches to solve problems than evolution did, and so humans might solve the overall problem without ever needing to have property X. To make these arguments more convincing, you need to argue 1) why property X really is _necessary_ and 2) why property X won’t follow quickly once everything else is in place.

This is illustrated with a hypothetical example of someone trying to predict when humans would achieve heavier-than-air flight: in practice, you could have made decent predictions just by looking at the power to weight ratios of engines vs. birds. Someone who argued that we were far away because “we don’t know how to make wings that flap” would have made incorrect predictions.

Planned opinion:

This all seems generally right to me, and is part of the reason I like the <@biological anchors approach@>(@Draft report on AI timelines@) to forecasting transformative AI.

Against the Backward Approach to Goal-Directedness

my impression is that Rohin has a similar model, although he might put more importance on the last step that I do at this point in the research.

I agree with this summary.

I suspect Daniel Kokotajlo is in a similar position as me; my impression was that he was asking that the output be that-which-makes-AI-risk-arguments-work, and wasn't making any claims about how the research should be organized.

Debate Minus Factored Cognition

Ah, interesting, I didn't catch that this is what you were trying to do. But how are you arguing #3? Your original comment seems to be constructing a tree computation for my debate, which is why I took it for an argument that my thing can be computed within factored cognition, not vice versa.

There are two arguments:

  1. Your assumption + automatic verification of questions of the form "What is the best defeater to X" implies Weak Factored Cognition (which as defined in my original comment is of the form "there exists a tree such that..." and says nothing about what equilibrium we get).
  2. Weak Factored Cognition + debate + human judge who assumes optimal play implies an honest equilibrium. (Maybe also: if you assume debate trees terminate, then the equilibrium is unique. I think there's some subtlety here though.)

In my previous comment, I was talking about 1, and taking 2 for granted. This is all in the zero-sum setting. But let's leave that aside and instead talk about a simpler argument that doesn't talk about Factored Cognition at all.


Based on what argument? Is this something from the original debate paper that I'm forgetting?

Zero-sum setting, argument that honesty is an equilibrium (for the first player in a turn-by-turn game, or either player in a simultaneous-action game):

If you are always honest, then whenever you can take an action, there will exist a defeater (by your assumption), therefore you will have at least as many options as any non-honest policy (which may or may not have a defeater). Therefore you maximize your value by being honest.

Additional details:

In the case where arguments never terminate (every argument, honest or not, has a defeater), then being dishonest will also leave you with many options, and so that will also be an equilibrium. When arguments do terminate quickly enough (maximum depth of the game tree is less than the debate length), that ensures that the honest player always gets the "last word" (the point at which a dishonest defeater no longer exists), and so honesty always wins and is the unique equilibrium. In the middle, where most arguments terminate quickly but some go on forever, honesty is usually incentivized, but sometimes it can be swapped out for a dishonest strategy that achieves the same value.

The point of the "clawing" argument is that it's a rational deviation from honesty, so it means honesty isn't an equilibrium.

I think this is only true when you have turn-by-turn play and your opponent has already "claimed" the honest debater role. In this case I'd say that an equilibrium is for the first player to be honest and the second player to do whatever is necessary to have a chance at success. Still seems like you can use the first player AI in this situation.

In the simultaneous play setting, I think you expect both agents to be honest.

More broadly, I note that the "clawing" argument only applies when facing an honest opponent. Otherwise, you should just use honest counterarguments.

I also don't really understand the hope in the non-zero-sum case here -- in the non-zero-sum setting, as you mention the first player can be dishonest, and then the second player concedes rather than giving an honest defeater that will then be re-defeated by the first (dishonest) player. This seems like worse behavior than is happening under the zero-sum case.

My model is like this

Got it, that makes sense. I see better now why you're saying one-step debate isn't an NP oracle.

I think my arguments in the original comment do still work, as long as you enforce that the judge never verifies an argument without first asking the subquestion "What is the best defeater to this argument?"

Debate Minus Factored Cognition

While I agree that the defeater tree can be encoded as a factored cognition tree, that just means that if we assume factored cognition, and make my assumption about (recursive) defeaters, then we can show that factored cognition can handle the defeater computation. This is sort of like proving that the stronger theory can handle what the weaker theory can handle, which would not be surprising

I don't think that's what I did? Here's what I think the structure of my argument is:

  1. Every dishonest argument has a defeater. (Your assumption.)
  2. Debaters are capable of finding a defeater if it exists. (I said "the best counterargument" before, but I agree it can be weakened to just "any defeater". This doesn't feel that qualitatively different.)
  3. 1 and 2 imply the Weak Factored Cognition hypothesis.

I'm not assuming factored cognition, I'm proving it using your assumption.

Possibly your worry is that the argument trees will never terminate, because every honest defeater could still have a dishonest defeater? It is true that I do need an additional assumption of some sort to ensure termination. Without that assumption, honesty becomes one of multiple possible equilibria (but it is still an equilibrium).

So if the point of the computational complexity analogy is to look at what debate could accomplish if humans could be perfect (but poly-time) judges, then I accept the conclusion

This is in fact what I usually take away from it. The point is to gain intuition about how "strongly" you amplify the original human's capabilities.

but I just don't think that's telling you very much about what you can accomplish on messier questions (and especially, not telling you much about safety properties of debate).

I also agree with this; does anyone think it is proving something about the safety properties of debate w.r.t messy situations?

Instead, I'm proposing a computational complexity analogy in which we account for human fallibility as judges, but also allow for the debate to have some power to correct for those errors. This seems like a more realistic way to assess the capabilities of highly trained debate systems.

This seems good; I think probably I don't get what exactly you're arguing. (Like, what's the model of human fallibility where you don't access NP in one step? Can the theoretical-human not verify witnesses? What can the theoretical-human verify, that lets them access NP in multiple timesteps but not one timestep?)

In my setup, a player is incentivised to concede when they're beaten, rather than continue to defeat the arguments of the other side. This is crucial, because any argument may have a (dishonest) defeater, so the losing side could continue on, possibly flipping the winner back and forth until the argument gets decided by who has the last word. Thus, my argument that there is an honest equilibrium would not go through for a zero-sum mechanism where players are incentivised to try and steal victory back from the jaws of defeat.

Perhaps I could have phrased my point as the pspace capabilities of debate are eaten up by error correction. 

I agree that you get a "clawing on to the argument in hopes of winning" effect, but I don't see why that changes the equilibrium away from honesty. Just because a dishonest debater would claw on doesn't mean that they'd win. The equilibrium is defined by what makes you win.

I can buy that in practice due to messiness you find worse situations where the AI systems sometimes can't find the honest answer and instead finds that making up BS has a better chance of winning, and so it does that; but that's not about the equilibrium, and it sounded to me like you were talking about the equilibrium.

I really just needed it for my argument to go through. If you have an alternate argument which works for the zero-sum case, I'm interested in hearing it.

I mean, I tried to give one (see response to your first point; I'm not assuming the Factored Cognition hypothesis). I'm not sure what's unconvincing about it.

Coherent decisions imply consistent utilities

Exactly which objection are you talking about here?

If it's something like "coherence theorems do not say that tool AI is not a thing", that seems true.

Yes, I think that is basically the main thing I'm claiming.

But then you also make claims like "all behavior can be rationalized as EU maximization", which is wildly misleading.

I tried to be clear that my argument was "you need more assumptions beyond just coherence arguments on universe-histories; if you have literally no other assumptions then all behavior can be rationalized as EU maximization". I think the phrase "all behavior can be rationalized as EU maximization" or something like it was basically necessary to get across the argument that I was making. I agree that taken in isolation it is misleading; I don't really see what I could have done differently to prevent there from being something that in isolation was misleading, while still being able to point out the-thing-that-I-believe-is-fallacious. Nuance is hard.

(Also, it should be noted that you are not in the intended audience for that post; I expect that to you the point feels obvious enough so as not to be worth stating, and so overall it feels like I'm just being misleading. If everyone were similar to you I would not have bothered to write that post.)

Also, the "preferences over universe-histories" argument doesn't work as well when we specify the full counterfactual behavior of a system, which is something we can do quite well in practice.

I agree that if you have counterfactual behavior EU maximization is not vacuous. I don't think that this meaningfully changes the upshot (which is "coherence arguments, by themselves without any other assumptions on the structure of the world or the space of utility functions, do not imply AI risk"). It might meaningfully change the title of the post (perhaps they do imply goal-directed behavior in some sense), though in that case I'd change the title to "Coherence arguments do not imply AI risk" and I think it's effectively the same post.

Mostly though, I'm wondering how exactly you use counterfactual behavior in an argument for AI risk. Like, the argument I was arguing against is extremely abstract, and just claims that the AI is "intelligent" / "coherent". How do you use that to get counterfactual behavior for the AI system?

I agree that for any given AI system, we could probably gain a bunch of knowledge about its counterfactual behavior, and then reason about how coherent it is and how goal-directed it is. But this is a fundamentally different thing than the thing I was talking about (which is just: can we abstractly argue for AI risk without talking about details of the system beyond "it is intelligent"?)

My argument is that coherence theorems do not apply nontrivially to any arbitrary system, so they could still potentially tell us interesting things about which systems are/aren't <agenty/dangerous/etc>.

I agree with this.

There may be good arguments for why coherence theorems are the wrong way to think about goal-directedness, but "everything can be viewed as EU maximization" is not one of them.

I actually also agree with this, and was not trying to argue that coherence arguments are irrelevant to "goal-directedness" or "being a good agent" -- I've already mentioned that I personally do things differently thanks to my knowledge of coherence arguments.

Just how narrow a setting are you considering here? Limited resources are everywhere. Even an e-coli needs to efficiently use limited resources. Indeed, I expect coherence theorems to say nontrivial things about an e-coli swimming around in search of food (and this includes the possibility that the nontrivial things the theorem says could turn out to be empirically wrong, which in turn would tell us nontrivial things about e-coli and/or selection pressures, and possibly point to better coherence theorems).

I agree that if you take any particular system and try to make predictions, the necessary assumptions (such as "what counts as a limited resource") will often be easy and obvious and the coherence theorems do have content in such situations. It's the abstract argument that feels flawed to me.

I somewhat expect your response will be "why would anyone be applying coherence arguments in such a ridiculously abstract way rather than studying a concrete system", to which I would say that you are not in the intended audience.


Fwiw thinking this through has made me feel better about including it in the Alignment book than I did before, though I'm still overall opposed. (I do still think it is a good fit for other books.)

Load More