Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Much of the difficulty in making progress on AI safety comes from the lack of useful feedback loops. We do not have a superintelligent AI to run tests on and by the time we do, it will probably be too late. This means we have to resort to using proxies. In this post, I will highlight what I think are the four most useful proxies we have access to today: modern machine learning systems, humans, groups, and abstract models.

These proxies are not opposed to each other. We need all the data we can get, and the right strategy is to learn from all of them. However each proxy also has its flaws, and will break if used in the wrong way. It is important to pay attention to where each proxy works and where it fails.

Modern Machine Learning Systems

Pros: With ML systems, one can do experiments quickly to get fast feedback loops with lots of quantitative data. Also, AGI will probably be made using something similar to modern machine learning systems, so insights gained by thinking about ML systems today may be particularly easy to port over.

Cons: Modern ML systems are opaque, and it can be very difficult to understand how they are working. This may be a problem we need to overcome eventually anyway, but that doesn't make it any easier to get data from them today.

Modern ML systems are also not very smart, and it is reasonable to expect a qualitative difference between their behavior and the behavior of superintelligent systems. This may lead to a false sense of security, especially in solutions that depend implicitly or explicitly on the AI lacking some particular capability, or using a particular strategy.

Examples: Concrete Problems in AI Safety and Alignment for Advanced Machine Learning Systems both try to reason about AI Safety by using modern machine learning as their primary source of intuitions. The safety teams at OpenAI and DeepMind try to use this as a primary source of data. (Although it would be a very uncharitable to claim that they do not also draw from the other sources, especially abstract models.) AI Safety Gridworlds is a good example of developing intuitions about AI safety from looking at modern systems.


Pros: Humans are (at least on some axes) the most intelligent systems we have direct access to. We can use this to notice failures of other proxies by noticing that we can do better at optimizing some function using clever tricks that those systems might also be able to use when you scale them up. Also, humans are an intelligence that we can experience from the inside.

Cons: Humans are opaque. Even from our inside view, it is very difficult to understand how they work, and very hard to modify. They are also the most difficult to talk about rigorously. There is also the failure mode of anthropomorphizing badly and attributing arbitrary properties of humans (and especially human goals) to AGI.

Examples: Some have used the complexity of the human brain as a starting point for predicting how much hardware is necessary for AGI.

Ought plans to try to run experiments on humans that will help form intuitions on how easy it is to decompose big questions into small pieces, which can provide evidence about whether or not Paul Christiano style amplification could work.

We can use humans to provide lower bounds for what capabilities are possible, like in the AI box experiment.


Pros: Groups of humans can often be more transparent than individual humans, since the individuals within the group communicate in a transparent way. This is perhaps the most alien agent-like thing we have direct access to, which could provide a check for anthropomorphism. Thinking about groups could also encourage reductionism. All agents are made up of smaller parts, and groups make this more explicit.

Cons: This is probably the worst of the four sources. Groups are often not very agentic, and end up shooting themselves in the foot. Also, if your model of AI looks a lot like organizing a group of hypotheses (e.g. Solomonoff Induction), you are more likely to accidentally pass the buck and depend on an individual component being as powerful as the entire system.

Examples: Evolution is a particularly useful group to gain intuitions from. You can get an (upper bound) estimate for how difficult AI is by looking at how many bits of optimization evolution needed.

Evolution is also a very clear example and source of intuition for thinking about Optimization Daemons.

Some AI safety proposals look a lot like organizing a lot of smaller AIs together into one more powerful thing, like Paul Christiano's amplification and Eric Drexler's intelligence distillation.

Abstract Models

Pros: Abstract models can deal with systems that are much smarter than the other three sources. Instead of looking at the best optimizer we can actually run, we can think about the properties of the input that literally maximizes whatever thing we are optimizing. Abstract models also have more potential for rigor than the other sources, and conclusions that you get from thinking about abstract models often apply to many different concrete systems.

Cons: Abstract models can't be observed directly, so we have to depend on proofs that only tell us about some properties of the system. The insights we gain from abstract models might end up being very hard to port over into concrete systems. Abstract models often deal with upper bounds for capabilities, which may be unrealistic because they require more resources than our universe has access to.

Examples: AIXI is a useful specific abstract model of AGI. MIRI's Agent Foundations work is mostly trying to get data from thinking about abstract models. Concepts like orthogonality thesis and convergent instrumental goals come from thinking about abstract models.

New Comment
26 comments, sorted by Click to highlight new comments since: Today at 4:34 AM

Some Discussion Questions:

Am I missing something? (Animals?)

Do you think that one of these is actually just useless?

Am I wrong to think of evolution and designed groups as the same type of object?

Am I strawmanning/steelmanning any given proxy too much?

Do you disagree with my wrapping data and intuitions together and treating math as a type of empirical data?

Is it the case that takeoff will be slow and these proxies all suck, so we just have to wait for real feedback loops?

The list of things seems basically right to me, although I find it a bit jarring to put evolution under "groups" and I might want it to be in its own category or something. When analyzing groups and their agency it seems important to me to be very careful about what the group is actually optimizing for.

What is the group actually optimizing for? Not having an explicit optimization goal is kind of the point here, and it is the common theme between human groups and evolution: lots of agents optimize for something, and something very different happens as a result.

I just mean that it's easy to judge a group as being bad at doing stuff if you judge it to be optimizing poorly for a thing that it's not actually optimizing for. Nobody does the thing they are supposedly doing and so forth. There's a separate question of whether a group is doing a good job optimizing for what its founder(s) wanted it to optimize for.

I share some intuition with Qiaochu that evolution in the "group" section feels off. I feel like there should be a category of something like "other optimization processes" which have as examples "insect intelligences", "bird intelligences", "game-of-life-like scenarios" and "evolution".

I considered doing designed groups vs more accidental groups, but I didn't consider just having an other category. I think I wish I did that, but I am not going to bother to redo it. I dont think the structure as a taxonomy is that important, I just want people to be aware of what we have to work with.

I don't understand. "Evolution" and "group of insects" fall into one category for you, but "group of humans" does not?

I think I'm willing to defend this. As a human, I have some social intuitions about how groups of humans work that don't easily translate to an understanding of how either evolution or groups of insects work. This is also the reason I like having humans in their own category above.

Yep, my brain does social modeling that doesn’t really transfer to insects or other animals.

I would add in animals if you are asking questions about the nature of general intelligence. For example people claim monkeys are better at certain tasks than humans. What does that mean for the notion of general intelligence, if anything?


Is there still not a name for "Paul Christiano Style Ampliciation?" Can't we come up with a name like "Anthropomorphic Simulators" or something so that this can become a legit thing people talk about instead of always seeming the hobby horse of one dude?

We'll publish about a prototype version of this system soon, which will probably involve us settling on a name.

It's not really about simulators, in practice you will probably use RL+imitation for the distillation step.

"Christiano amplification" seems fine to me.

Currently feeling happiness at this comment, but also sadness at the fact that Paul hasn't already picked something he finds fitting and so it'll fall to some other optimisation process, like 'whatever term most people feel like they can easily understand' which is not optimising for the right thing.

(Has Paul picked something that I just didn't know and can start using?)

I remember hearing people call it iterative distillation and amplification (IDA), but I think this name might be too general.

"Iterated distillation and amplification" is probably my preferred name at the moment. What do you think makes it too general? That is, what do you think is covered by that name but shouldn't be?

(I think the name came from some combination of Ajeya and Daniel Dewey.)

I had also heard that term. When I heard it, it came with the tag 'sufficiently general as to apply to what AlphaGo Zero did' (I think AGZ, maybe a different AlphaGo) and I thought that meant it was too non-specific to apply to a potential path forward on alignment.

If the idea does have significant overlap with current systems (which I believe it does), it might be better to have a name that applies specifically to whichever part of the proposal is new / is different to what is already happening currently in capabilities research.

Yeah, I think Ben captures my objection - IDA captures what is different between your approach and MIRI's agenda, but not what is different between some existing AI systems and your approach.

This might not be a bad thing - perhaps you want to choose a name that is evocative of existing approaches to stress that your approach is the natural next step for AI development, for example.

I curated this post because I think it moves the needle forward on an important problem. While there's some discussion to be had about the exact ontology (and that discussion happened in the comments), it's concrete, simple, well-strutured, and pretty understandable to all.

What are the questions you are trying to answer about the first AGIs?

  • How they will behave?
  • What they will be capable of?
  • What is the nature of the property we call intelligence?

I find the second one much more interesting, with more data to be acquired. For the second one I would include things like modern computer hardware and what we have managed to achieve with it (and the nature and structure of those achievements).

All of these, and general orientation around the problem, and what concrete things we should do.

I've argued before that we should understand the process of science (how much analysis vs data processing vs real world tests), in order to understand how likely it is that AGI will be able to do science quickly. Which impacts the types of threats we should expect. We should also look at the process of programming with a similar lens to see how much a human level programmer could be improved upon. There is lots of non-human bounded activity in the process of industrial scale programming, lots of it are in running automated test suites. Will AIs need to run similar suites or can they do things in a more adequate way?

Information from sociology and history should impact our priors on the concrete strategies that may work. But that may be taken as a given and less interesting.

Surpisingly, national states was not metioned as possible example of a group. National states often suffer from following comands literary, that is if a parlaiment have passed a stupid law, law enforcemnt officers has to implement it. There are many other similarities.

I would be wary of abstract models. It's not clear to me the the AIs we will build will be similar to AIXI or that AIXI is even a natural limit point of some sort. I think that if abstract models lead to a result, but the other three proxies don't confirm it, then we should be suspicious of it.

AGI will probably be made using something similar to modern machine learning systems

Off topic, but is there agreement on this? I sure agree, but I recall seeing somewhere that EY thinks a hand-coded AI to be most likely, and the idea is rubbing me the wrong way.

I agree with some light suspicion, but I would be more warry of automatically rejecting things that dont match up with the othe proxies. I feel like (very roughly) the other three mostly give lower bounds, while (a certain type of abstract model) mostly gives upper bounds. When our best upper bounds and our best lower bounds dont match, the best response looks like large error bars.

I think that upper and lower bounds on capabilities are not the main thing we should be looking for, but I think we can also get pretty big lower bounds from starting with a human and imagining what happens when we can scale it up (more copies, more time, more introspection).

I dont want to speak for anyone else, sorry. That said, Eliezer is a coauthor of the AAMLS agenda.

I do think the claim I made is rather weak. I was talking about AGI (not AGI conditonal on solving alignment), I was (in my head) including hybrid appraches that only partially use ML like things, and I meant something similar in a pretty broad sense (they might become way more e.g. transparent before we get to AGI).