Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Not really novel, but I've made this point quite often in recent conversations, so I decided to make a short write-up.

I think a wide range of alignment proposals will somehow have to deal with ontology identification, i.e. mapping an AI's beliefs about the world to a representations humans can correctly understand. This could happen explicitly (e.g. via an ELK solution or strong interpretability), or be more hidden, or maybe the proposal avoids the problem in some clever way. But I think alignment proposals should have some answer to the question "How are you dealing with ontology identification?"

A class of examples

One high-level approach you could try to use to align AI goes as follows:

  1. Somehow get a model of human values. This could be a utility function, but might also look more like a preference ordering etc.
  2. Build an AI that optimizes according to this value model.

(If the value model is just a preference ordering, you might have to deal with the fact that it can be inconsistent, but we're ignoring those issues here.)

To be clear, this is not the only approach for building aligned AI (maybe not even the most promising one). It's a special case of separately solving outer alignment and inner alignment, which itself is a special case of solving alignment more broadly. But this still covers an important class of approaches.

I think any approach along those lines has to deal with ontology identification, for a pretty simple reason. The model of human values we get from step 1. will be defined in terms of the human ontology. For example, if we're learning a utility function, it will have type signature . But for step 2., we need something the AI can optimize over, so we need a utility function of type signature . This means that as a bridge between steps 1. and 2., we need a translation from the AI ontology to the human ontology, . Then we can use .

Note that in practice, we will often collapse learning  and specifying  into a single step. For example, when doing RLHF on videos of trajectories, we're learning a reward function defined directly in terms of videos, rather than a human representation of what's happening in those videos. These videos also map easily into the RL agent's ontology (the agent's initial layers perform that mapping). So we don't need to explicitly use an ontology translation  anywhere.

Implicitly, we're using the translation  that satisfies , where  is the reward model, defined as a function on videos.  maps from videos to human's understanding of what's happening inside those videos, and  are the "true" human values (which aren't explicitly represented anywhere in this setup).

This implicit translation  corresponds to the human simulator from the ELK report: it maps videos to what humans think is happening when they see those videos. This leads to outcomes that look good to humans on video, rather than outcomes that are actually good. So when I say that alignment proposals have to "deal with ontology identification" somehow, I mean that they need a better map  than this implicit one.

So far, we've focused on RLHF as an example, in which case this perspective doesn't add much. But I think the same argument applies much more broadly. For example, one response to the issues with RLHF is "Ah, we need a better model of human irrationality". In other words, we should do a lot of cognitive science to figure out precisely in which ways humans give feedback that doesn't reflect their true preferences. Then we can back out true human preferences from irrational human feedback. But even if we got this to work, the true human preferences would have the wrong type signature, and we'd need to find a good ontology translation again. Even if we could do enough cognitive science + brain scans, or some other magical procedure to figure out an explicit function representing human values, this would still not be enough without a good way to tie the ontology used in that representation to the ontology the AI is using.

I think similar arguments apply to proposals other than just those using steps 1. and 2. above. One straightforward extension is that steps 1. and 2. don't have to literally happen in sequence; the problem still exists if a model is optimizing for its current model of human values, while simultaneously improving that model.

I'm less sure exactly how broadly applicable the argument is. For example, I haven't thought as much about approaches that rely much more on corrigibility and less on having a good model of human values, though my best guess is that something similar still applies. My main goal here was to get across any intuition at all for why we might expect ontology identification to be a central obstacle.

Takeaways

I think alignment proposals should have an answer to the question "How are you dealing with ontology identification?" Valid answers could take different forms:

  • The proposal is explicitly using an ELK solution or something structurally similar.
  • There's no part of the proposal that's explicitly solving ELK, but there is still an implicit ontology translation happening somewhere. (Ideally, we should be able to analyze this implicit translation as an ELK proposal and evaluate it from that perspective).
  • The proposal has some way of avoiding ontology identification that gets around the argument in this post and similar ones. We should be able to explicitly say how it's avoiding ontology identification/what the clever idea is.
  • We're only using the AI for tasks where the human simulator translation is sufficient (e.g. only using RLHF in ways such that humans can correctly evaluate trajectories). Ideally be explicit about which tasks those are because a key disagreement might be about whether those can be safely done from human feedback.

I'm very interested in either pushback on these claims, or in more clear-cut and general arguments for why ontology identification is hard to avoid.

New Comment
10 comments, sorted by Click to highlight new comments since: Today at 1:42 AM

Could you give a concrete example of why thinking about ontology identification is useful?  Preferably with reference to examples that aren't of the form "the human is watching a video feed" and giving feedback. 

There's certainly lots of "foundational" problems that any AI alignment solution (or indeed any AI solution) "need to" deal with, but which are still ~basically useless to think about. The classic examples of these are no-free lunch theorems for ML/AI or how many "NP-hard" problems are still easily solvable in the average case. The ultimate example of this is Rice's theorem. Yes, every solution to AI alignment gets around complexity and computability and statistical learning impossibility theorems, but I'm not sure what insights you get by thinking explicitly about this class of problems as opposed to more direct approaches. 

Great point, some rambly thoughts on this: one way in which ontology identification could turn out to be like no-free lunch theorems is that we actually just get the correct translation by default. I.e. in ELK report terminology, we train a reporter using the naive baseline and get the direct translator. This seems related to Alignment by default, and I think of them the same way (i.e. "This could happen but seems very scary to rely on that without better arguments for why it should happen). I'd say one reason we don't think much about no-free lunch theorems as a key obstacle to AI is that we've seen tons of cases where good generalization happens because the world is low entropy. I don't think we've seen that kind of evidence for ontology identification not being a problem in practice. That said, "I think ontology identification will be easy, here's why" is another valid response to the question from this post.

A related point would be "Should we think about ontology identification explicitly, or just work on other stuff and eventually solve it implicitly?" My first instinct is to directly tackle ontology identification, but I could see cases where a solution to ontology identification is actually easier to find from another lens. I do think though that that other lens will have to tackle a similarly difficult and central problem; just working on approaches that essentially assume away the ontology identification problem will very likely not lead to progress on ontology identification.

For examples, do you mean examples of thinking about ontology identification being useful to solve ontology identification, or examples of how a solution would be helpful for alignment?

For examples, do you mean examples of thinking about ontology identification being useful to solve ontology identification, or examples of how a solution would be helpful for alignment?

I’m asking for examples of specific problems in alignment where thinking of ontology identification is more helpful than just thinking about it the usual or obvious way.

I might not have exactly the kind of example you're looking for, since I'd frame things a bit differently. So I'll just try to say more about the question "why is it useful to explicitly think about ontology identification?"

One answer is that thinking explicitly about ontology identification can help you notice that there is a problem that you weren't previously aware of. For example, I used to think that building extremely good models of human irrationality via cogsci for reward learning was probably not very tractable, but could at least lead to an outer alignment solution. I now think you'd also have to solve ontology identification, so I'm now very skeptical of that approach. As you point out in another comment, you could technically treat ontology identification as part of human irrationality (not sure if you'd call this the "usual/obvious way" in this setting?). But what you notice when separating out ontology identification is that if you have some way of solving the ontology identification part, you should probably just use that for ELK and skip the part where you model human irrationalities really well.

Another part of my answer is that ontology identification is not an obviously better frame for any single specific problem, but it can be used as a unifying frame to think about problems that would otherwise look quite different. So some examples of where ontology identification appears:

  • The ELK report setting: you want to give better informed preference comparisons
  • The case I mentioned above: you've done some cognitive science and are able to learn/write down human rewards in terms of the human ontology, but still need to translate them
  • You think that your semi-supervised model already has a good understanding of what human values/corrigibility/... are, and your plan is to retarget the search or to otherwise point an optimizer at this model's understanding of human values. But you need to figure out where exactly in the AI human values are represented
  • To prevent your AI from becoming deceptive, you want to be able to tell whether it's thinking certain types of thoughts (such as figuring out whether it could currently take over the world). This means you have to map AI thoughts into things we can understand
  • You want clear-cut criteria for deciding whether you're interpreting some neuron correctly. This seems very similar to asking "How do we determine whether a given ontology translation is correct?" or "What does it even mean for an ontology translation to be 'correct'?"

I think ontology identification is a very good framing for some of these even individually (e.g. getting better preference comparisons), and not so much for others (e.g. if you're only thinking about avoiding deception, ontology identification might not be your first approach). But the interesting thing is that these problems seemed pretty different to me without the concept of ontology identification, but suddenly look closely related if we reframe them.

Makes sense, thanks for the reply!

For what it’s worth, I do think strong ELK is probably more tractable than the whole cog eco approach for preference learning.

Great post!

If you haven't seen it yet, it might be worth taking a look at the discussion on Ontology Identification on Arbital

One nitpicky comment:

For example, one response to the issues with RLHF is "Ah, we need a better model of human irrationality". In other words, we should do a lot of cognitive science to figure out precisely in which ways humans give feedback that doesn't reflect their true preferences. Then we can back out true human preferences from irrational human feedback. But even if we got this to work, the true human preferences would have the wrong type signature, and we'd need to find a good ontology translation again.

Well, technically, since we're observing  (where d is the planner) anyways, a model of irrationality could just actually "include the translation" to . Specifically, we could also factor it as: .  So there's a trivial sense in which "better models of human irrationality" are sufficient.  :^)

For example, if the human doesn't know that the AI has stolen the diamond from the vault in the ELK problem, and gives a high reward to a hacked video feed, we can say this is an ontology identification problem, because the human doesn't understand what the AI knows. But it's "equally" valid to imagine that the human has a reward function grounded in the actual world and their planner is just biased to give high reward when the video feed shows a diamond!

That being said, (IMO) a lot of the hope for "better models of human rationality" is to find conditions under which humans are less irrational or more informative. And I'd guess that you need to do AI-assisted cognitive science to ground the human biases into the AI's world model. So I'm not sure there's much difference in practice? 

Well, technically, since we're observing  (where d is the planner) anyways, a model of irrationality could just actually "include the translation" to . Specifically, we could also factor it as: .  So there's a trivial sense in which "better models of human irrationality" are sufficient.  :^)

Ah yes, the trivial model of humans that says "whatever they do, that's what they want."

Is there some argument similar to a type signature argument that would rule out such poorly-generalizing approaches, though?

Like, clearly what's going wrong is that these things have identical input-output types - but the trivial model is learned in a totally different way, so you could just track types through... well, okay, through a complete specification of value learning. I guess such an argument wouldn't be very useful, though, because it would be too entangled with the specific program used to do value learning, so it wouldn't be very good at proving that noother type would count as a solution.

Ah yes, the trivial model of humans that says "whatever they do, that's what they want."

That's a different type of trivial solution, but the idea is broadly similar---because the space of reasonable planners is so large, there's lots of ways to slice up the policy into reward + bias. (As Paul puts it, the easy reward inference problem is still hard.)

Is there some argument similar to a type signature argument that would rule out such poorly-generalizing approaches, though?

I don't think a type signature would help: the problem is that the class of planners/biases and the class of rewards are way too unconstrained. We probably need to just better understand what human planners are via cognitive science (or do a better translation between AI concepts and Human concepts).

That being said, there's been a paper that I've put off writing for two years now, on why the Armstrong/Mindermann NFL theorem is kind of silly and how to get around it, maybe I will write it at some point :)

I'm less sure exactly how broadly applicable the argument is. For example, I haven't thought as much about approaches that rely much more on corrigibility and less on having a good model of human values, though my best guess is that something similar still applies. My main goal here was to get across any intuition at all for why we might expect ontology identification to be a central obstacle.

My guess is it applies way less---a lot of the hope is that the basin of corrigibility is "large enough" that it should be obvious what to do, even if you can't infer the whole human value function. 

There is a chance that one can avoid having to solve ontology identification in general if one punts the problem to simulated humans. I.e., it seems one can train the human simulator without solving it, and then use simulated humans to solve the problem. One may have to solve some specific ontology identification problems to make sure one gets an actual human simulator and not e.g. a malign AI simulator. However, this might be easier than solving the problem in full generality.

Minor comment: regarding the RLHF example, one could solve the problem implicitly if one is able to directly define a likelihood function over utility functions defined in the AI's ontology, given human behavior. Though you probably correctly assume that e.g. cognitive science would produce a likelihood function over utility functions in the human ontology, in which case ontology identification still has to be solved explicitly.