Not really novel, but I've made this point quite often in recent conversations, so I decided to make a short write-up.
I think a wide range of alignment proposals will somehow have to deal with ontology identification, i.e. mapping an AI's beliefs about the world to a representations humans can correctly understand. This could happen explicitly (e.g. via an ELK solution or strong interpretability), or be more hidden, or maybe the proposal avoids the problem in some clever way. But I think alignment proposals should have some answer to the question "How are you dealing with ontology identification?"
A class of examples
One high-level approach you could try to use to align AI goes as follows:
- Somehow get a model of human values. This could be a utility function, but might also look more like a preference ordering etc.
- Build an AI that optimizes according to this value model.
(If the value model is just a preference ordering, you might have to deal with the fact that it can be inconsistent, but we're ignoring those issues here.)
To be clear, this is not the only approach for building aligned AI (maybe not even the most promising one). It's a special case of separately solving outer alignment and inner alignment, which itself is a special case of solving alignment more broadly. But this still covers an important class of approaches.
I think any approach along those lines has to deal with ontology identification, for a pretty simple reason. The model of human values we get from step 1. will be defined in terms of the human ontology. For example, if we're learning a utility function, it will have type signature . But for step 2., we need something the AI can optimize over, so we need a utility function of type signature . This means that as a bridge between steps 1. and 2., we need a translation from the AI ontology to the human ontology, . Then we can use .
Note that in practice, we will often collapse learning and specifying into a single step. For example, when doing RLHF on videos of trajectories, we're learning a reward function defined directly in terms of videos, rather than a human representation of what's happening in those videos. These videos also map easily into the RL agent's ontology (the agent's initial layers perform that mapping). So we don't need to explicitly use an ontology translation anywhere.
Implicitly, we're using the translation that satisfies , where is the reward model, defined as a function on videos. maps from videos to human's understanding of what's happening inside those videos, and are the "true" human values (which aren't explicitly represented anywhere in this setup).
This implicit translation corresponds to the human simulator from the ELK report: it maps videos to what humans think is happening when they see those videos. This leads to outcomes that look good to humans on video, rather than outcomes that are actually good. So when I say that alignment proposals have to "deal with ontology identification" somehow, I mean that they need a better map than this implicit one.
So far, we've focused on RLHF as an example, in which case this perspective doesn't add much. But I think the same argument applies much more broadly. For example, one response to the issues with RLHF is "Ah, we need a better model of human irrationality". In other words, we should do a lot of cognitive science to figure out precisely in which ways humans give feedback that doesn't reflect their true preferences. Then we can back out true human preferences from irrational human feedback. But even if we got this to work, the true human preferences would have the wrong type signature, and we'd need to find a good ontology translation again. Even if we could do enough cognitive science + brain scans, or some other magical procedure to figure out an explicit function representing human values, this would still not be enough without a good way to tie the ontology used in that representation to the ontology the AI is using.
I think similar arguments apply to proposals other than just those using steps 1. and 2. above. One straightforward extension is that steps 1. and 2. don't have to literally happen in sequence; the problem still exists if a model is optimizing for its current model of human values, while simultaneously improving that model.
I'm less sure exactly how broadly applicable the argument is. For example, I haven't thought as much about approaches that rely much more on corrigibility and less on having a good model of human values, though my best guess is that something similar still applies. My main goal here was to get across any intuition at all for why we might expect ontology identification to be a central obstacle.
I think alignment proposals should have an answer to the question "How are you dealing with ontology identification?" Valid answers could take different forms:
- The proposal is explicitly using an ELK solution or something structurally similar.
- There's no part of the proposal that's explicitly solving ELK, but there is still an implicit ontology translation happening somewhere. (Ideally, we should be able to analyze this implicit translation as an ELK proposal and evaluate it from that perspective).
- The proposal has some way of avoiding ontology identification that gets around the argument in this post and similar ones. We should be able to explicitly say how it's avoiding ontology identification/what the clever idea is.
- We're only using the AI for tasks where the human simulator translation is sufficient (e.g. only using RLHF in ways such that humans can correctly evaluate trajectories). Ideally be explicit about which tasks those are because a key disagreement might be about whether those can be safely done from human feedback.
I'm very interested in either pushback on these claims, or in more clear-cut and general arguments for why ontology identification is hard to avoid.