Much of the difficulty in making progress on AI safety comes from the lack of useful feedback loops. We do not have a superintelligent AI to run tests on and by the time we do, it will probably be too late. This means we have to resort to using proxies. In this post, I will highlight what I think are the four most useful proxies we have access to today: modern machine learning systems, humans, groups, and abstract models.
These proxies are not opposed to each other. We need all the data we can get, and the right strategy is to learn from all of them. However each proxy also has its flaws, and will break if used in the wrong way. It is important to pay attention to where each proxy works and where it fails.
Modern Machine Learning Systems
Pros: With ML systems, one can do experiments quickly to get fast feedback loops with lots of quantitative data. Also, AGI will probably be made using something similar to modern machine learning systems, so insights gained by thinking about ML systems today may be particularly easy to port over.
Cons: Modern ML systems are opaque, and it can be very difficult to understand how they are working. This may be a problem we need to overcome eventually anyway, but that doesn't make it any easier to get data from them today.
Modern ML systems are also not very smart, and it is reasonable to expect a qualitative difference between their behavior and the behavior of superintelligent systems. This may lead to a false sense of security, especially in solutions that depend implicitly or explicitly on the AI lacking some particular capability, or using a particular strategy.
Examples: Concrete Problems in AI Safety and Alignment for Advanced Machine Learning Systems both try to reason about AI Safety by using modern machine learning as their primary source of intuitions. The safety teams at OpenAI and DeepMind try to use this as a primary source of data. (Although it would be a very uncharitable to claim that they do not also draw from the other sources, especially abstract models.) AI Safety Gridworlds is a good example of developing intuitions about AI safety from looking at modern systems.
Pros: Humans are (at least on some axes) the most intelligent systems we have direct access to. We can use this to notice failures of other proxies by noticing that we can do better at optimizing some function using clever tricks that those systems might also be able to use when you scale them up. Also, humans are an intelligence that we can experience from the inside.
Cons: Humans are opaque. Even from our inside view, it is very difficult to understand how they work, and very hard to modify. They are also the most difficult to talk about rigorously. There is also the failure mode of anthropomorphizing badly and attributing arbitrary properties of humans (and especially human goals) to AGI.
Examples: Some have used the complexity of the human brain as a starting point for predicting how much hardware is necessary for AGI.
Ought plans to try to run experiments on humans that will help form intuitions on how easy it is to decompose big questions into small pieces, which can provide evidence about whether or not Paul Christiano style amplification could work.
We can use humans to provide lower bounds for what capabilities are possible, like in the AI box experiment.
Pros: Groups of humans can often be more transparent than individual humans, since the individuals within the group communicate in a transparent way. This is perhaps the most alien agent-like thing we have direct access to, which could provide a check for anthropomorphism. Thinking about groups could also encourage reductionism. All agents are made up of smaller parts, and groups make this more explicit.
Cons: This is probably the worst of the four sources. Groups are often not very agentic, and end up shooting themselves in the foot. Also, if your model of AI looks a lot like organizing a group of hypotheses (e.g. Solomonoff Induction), you are more likely to accidentally pass the buck and depend on an individual component being as powerful as the entire system.
Examples: Evolution is a particularly useful group to gain intuitions from. You can get an (upper bound) estimate for how difficult AI is by looking at how many bits of optimization evolution needed.
Evolution is also a very clear example and source of intuition for thinking about Optimization Daemons.
Pros: Abstract models can deal with systems that are much smarter than the other three sources. Instead of looking at the best optimizer we can actually run, we can think about the properties of the input that literally maximizes whatever thing we are optimizing. Abstract models also have more potential for rigor than the other sources, and conclusions that you get from thinking about abstract models often apply to many different concrete systems.
Cons: Abstract models can't be observed directly, so we have to depend on proofs that only tell us about some properties of the system. The insights we gain from abstract models might end up being very hard to port over into concrete systems. Abstract models often deal with upper bounds for capabilities, which may be unrealistic because they require more resources than our universe has access to.
Examples: AIXI is a useful specific abstract model of AGI. MIRI's Agent Foundations work is mostly trying to get data from thinking about abstract models. Concepts like orthogonality thesis and convergent instrumental goals come from thinking about abstract models.