In this post, I introduce a conceptual tool for thinking about the epistemic landscape of AI alignment and then describe three epistemic strategies for making progress on the alignment problem: 1) tinkering, 2) idealisation and 3) intelligence-in-the-wild.
How to make progress in AI alignment?
The future of AI progress is likely to critically shape, if not practically determine, the future of humanity, and of sentient life in general. Will our future look more like a world filled to the brim with things we find valuable, as well as sentient creatures to enjoy that goodness? Or will our future look more like one characterized by violence, mistrust, inequality, suffering, or even the absence of anything sentient at all?
As our civilization is making progress on our abilities to engineer and instantiate sophisticated forms of complex and intelligent behaviours in artificial systems, it becomes imperative to carefully think about which objectives this intelligence is being directed at, and how to do such "directing" robustly. Thus, the central question becomes: how can we make sure that ‘big effects’ caused by AI progress will be positive rather than harmful, safe rather than dangerous, helping to promote, discover, and enrich what is dear to us rather than destroying it? This is the guiding question of AI alignment, as I understand it.
Once we have established the importance of the problem, the next question that stands out is: How can we make progress?
This is a central question for the “philosophy of science” of AI alignment, as I see it. For example, in order to help answer the question of (the best) ways to make progress on the problem, we can reflect on its shape or structure. We can thus notice how one important defining characteristic of the alignment problem is that it concerns systems that do not exist yet—let’s call this the problem of “epistemic access”. This means that our typical epistemic strategies of science and engineering are less effective here than they are for a range of other societal problems (e.g., finding cures for diseases, improving the yield of a certain crop, building ever taller houses, etc.).
In this post, I will describe how I currently like to think about the landscape of epistemic strategies for making progress in AI alignment. Of course, there are several plausible ways of carving up the space of epistemic strategies, and each of them may serve different purposes. The tri-partition of epistemic strategies I introduce here is specifically grounded in asking what strategies we can adopt for overcoming the challenge of epistemic access, as introduced above.
Strategies for exploring the space of intelligent behaviour
So, what are different epistemic strategies we can use in order to (hopefully) make progress on AI alignment? Let’s start by introducing the following conceptual tool.
Imagine a space (of uncertain size) that corresponds to all possible manifestations of intelligent behaviour. Abstractly speaking, different epistemic strategies correspond to different approaches to charting out this possibility space, and the endeavour of AI alignment at large (roughly) corresponds to learning to safely navigate movement through this (design) space.
Now, let’s consider different ways of charting this space out.
Strategy 1: Tinkering
One approach is to work with contemporary ML paradigms and explore, with the help of empirical methods and trial and error, how those systems behave, how they fail to be safe and aligned, and what it would look like for them to be ( in Fig. 1). The hope is that (some of) those insights will generalize to more-advanced AI systems.
In terms of our figure, this approach explores the possibility space of intelligent behaviour by tinkering at the boundaries of the subspace that we currently know how to implement with ML technology.
Most of what is commonly referred to as “AI safety research” will fall into this category. Here are some examples:
- Risks in current and future ML systems, e.g., Amodei, Olah et al (2016). Concrete Problems in AI Safety.
- Reward modeling, e.g., Leike, Krueger et al (2018). Scalable agent alignment via reward modeling: a research direction.
- Iterated amplification, e.g., Christiano, Shlegeris, Amodei (2018). Supervising strong learners by amplifying weak experts.
- Interpretability, e.g., Olah, Cammarata et al (2020). Zoom In: An Introduction to Circuits.
- And many more…
Strategy 2: Idealization
The second approach I want to delineate here seeks theoretical clarity by starting from idealized frameworks of reasoning (e.g., decision theory, epistemology) and exploring, via extrapolation, what the in-the-limit behaviour of “Idealized Agents” would look like and what it would take for them to be safe and aligned ( in Fig. 1).
Compared to the former approach, looking at idealized frameworks and in-the-limit behaviour leads to different assumptions about what advanced AI systems will look like (i.e., not necessarily prosaic). (There are some reasons to think those assumptions are more relevant, and some reasons to think that they are less relevant.) At the same time, idealized frameworks face some challenges due to the difficulty of achieving empirical grounding compared to the “tinkering” approach.
Examples of work that count towards this approach include:
- Decision Theory, e.g., Yudkowsky and Soares (2017). Functional Decision Theory: A New Theory of Instrumental Rationality.
- Logical Uncertainty, e.g., Garrabrant et al. (2016). Logical Induction.
- Agent Foundations, e.g., Garrabrant and Demski (2018). Embedded agency.
- Infra-Bayesianism, e.g., Vanessa Kosoy (2021). Infra-Bayesian physicalism: a formal theory of naturalized induction.
- Power Seeking in Optimal Agents, e.g., Alexander Turner (2021). The Causes of Power-seeking and Instrumental Convergence.
Strategy 3: Learning from “intelligence-in-the-wild”
Finally, the third approach attempts to chart out the possibility space of intelligent behaviour by looking at how intelligent behaviour manifests in existing natural systems ( in Fig. 1). (This is, notably, the epistemic approach that PIBBSS is trying to explore and foster.)
Treating intelligent behaviour (and related features of interest) as a naturally occurring phenomenon suggests a sort of “realism” about intelligence. It’s not just some theoretical construct but something that we can observe every day in the real world. This strategy involves a certain type of empiricism (although of a different nature than the empiricism of the “tinkering“ approach).
Let’s look at some examples of how this approach has already been used in alignment research (although I’d argue it remains, overall, rarer than work along the lines of the other two approaches):
- Analogies from human brain, à la Steve Byrnes (2021). Value loading in the human brain: a worked example (among others).
- …uses neuroscience as a source of analogies for alignment.
- Analogies from real-world systems, à la John Wentworth (2020). Characterizing Real-World Agents as a Research Meta-Strategy (among others).
- …sets out to understand agency as a real-world phenomenon looking at systems in physics, biology, economics, and beyond.
- Analogies from social systems, à la Critch (2021). What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes.
- Analogies from biological systems, à la Hubinger et al. (2019). Risks from Learned Optimization in Advanced ML Systems.
- …explicitly uses biological evolution of humans as an example of/inspiration for mesa-optimizers. Evolution here is the search process, with a base-objective of increasing fitness and propagation of genes, but the learned models (humans) end up searching for different goals.
Figure 1: Epistemic strategies for making progress in AI alignment as charting the possibility space of intelligent behaviour; 1. Tinkering; 2. Idealisation; 3. Intelligence-in-the-wild
- We can analogise progress in AI alignment as charting out and learning how to navigate the possibility space of intelligent behaviour. Different research strategies in AI alignment correspond to different ways of charting out that space. For example:
- Tinkering is an epistemic approach whereby we explore the boundaries of intelligent behaviour that we currently know to implement in ML systems, using empirical and experimental methods.
- Idealization uses idealised frameworks of reasoning and extrapolation to explore the possibility space of intelligent behaviour “in-the-limits”.
- Intelligence-in-the-wild seeks to chart out the possibility space of intelligent behaviour by investigating how it manifests in existing (natural) systems. It is based on an assumption of “realism about intelligence” (see future post for more details).
Acknowledgements: I thank TJ/particlemania for discussing and improving many of the ideas in this post; I also thank Adam Shimi, Gavin Leech, Cara Selvarajah, Eddie Jean and Justis for useful comments on earlier drafts.
For more insight into the specific epistemic challenges we face in AI alignment, consider checking out other examples of Adam Shimi’s writing, e.g., On Solving Problems Before They Appear: The Weird Epistemologies of Alignment, Epistemological Vigilance for Alignment, or Robustness to Scaling Down: More Important Than I Thought.
There are different ways one can choose to taxonomise the space. For example, a common distinction that is sometimes made is the one between “de-confusion” and “problem-solving” work. I consider this distinction orthogonal to the three-way classification I am introducing here. In particular, my view is that the distinction introduced in this post (i.e., “tinkering”, “idealisation”, and “intelligence-in-the-wild”) is about what types of systems are being theorised about (roughly speaking: STOA ML systems, idealized systems, and natural systems), or where the theories seek empirical validity. In this post [forthcoming], I discuss, in more detail, what sorts of systems and empirical basis the intelligence-in-the-wild approach is interested in/based on. (Thanks to TJ for helping me to make this clarification explicit.)
It is important to note that these are not strict divisions, and some research agendas might involve both tinkering insights and idealisations. IDA is a good example of that. However, for the purposes of this post, I'm tentatively putting it in the tinkering list.
"Prosaic" is used in somewhat different ways by different people. Here, it is used in its ~technical usage meaning approximately: "using extensions of current SOTA ML".