Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

(Context: my last post was trying to patch a certain naive strategy for AI alignment, but I didn’t articulate clearly what the naive strategy is. I think it’s worth explaining the naive strategy in its own post, even though it’s not a novel idea.)

Suppose that I jointly train an AI to do some task (e.g. make money for me) and to answer a wide range of questions about what is happening in the world (e.g. “why did Alice just wire $1000 into my bank account?” or “what is Bob thinking right now?”).

I generate training data for the QA task in a really simple way: I choose a subset of questions that humans are able to reliably answer, and use those as a training set for supervised learning. I’ll call this the naive training strategy.

I’d like for my AI to tell me everything it knows. If the AI bought a stock because it expects a merger announcement soon, I want it to tell me about the predicted merger announcement. If the AI predicts a merger announcement because it inferred that executives of the companies have been in extensive talks over the last month, I want it to tell me about those talks.

I’m not asking the AI to explain why it made a given decision, I’m asking the AI to tell me as much as it can about the world. The important property is that if the AI “knows” something and uses that knowledge to perform the task well, then it also uses that knowledge to answer questions well.

Why might this work? The hope is that “answer questions honestly to the best of your ability” is a natural thing for our AI to learn — that there is some simple way to translate from the AI’s model of the world into natural language and to honestly report what it believes. If our training dataset is good, then this policy will score well, and we can hope that SGD will find it. I’ll call this the intended policy.

Why might this not work? The concern is that “predict how a human would answer questions” is also a natural thing for our AI to learn, especially if the AI is doing a task that already requires predicting humans. Predicting humans also gets a low loss on the training set, but it generalizes poorly once we start asking our AI questions that a human couldn’t have answered on their own.

If we’re worried about this we could use a different loss function than “predict what humans would say.” But regardless of what loss function we choose, the policy could still be trying to game the loss function. If SGD learns to “game the loss function” then we’ll generalize poorly in any case where humans can’t distinguish a good question-answerer from a bad question-answerer.

I’ll call “game the loss function” the instrumental policy. It answers questions well only because doing so is instrumentally useful for getting a low loss, and so it will start answering questions badly if that situation changes. (This is closely related to deceptive alignment, but I think the concept makes sense without defining alignment or optimizers; there’s also a difference in emphasis in that I’m often considering models which are “intrinsically motivated” to game the loss function rather than doing so in order to gain influence. I first introduced this term in Inaccessible Information.)

What I’m doing now

I’m trying to dig into a bunch of reasons why the naive training strategy might fail, and to understand whether there is a way to modify the naive strategy to avoid those problems.

In my last post I discussed two reasons that the naive training strategy might learn the instrumental policy instead of the intended policy:

  1. There may be predictable errors in the training data, such that instrumental policy actually gets a lower loss than answering honestly (because it responds strategically to errors).
  2. If the AI already needs to make predictions about humans, then it can reuse that machinery to game the loss function, whereas it may have to build new machinery to translate its thoughts into natural language.

I tried to describe a modified algorithm that avoids those pitfalls. I don’t think I succeeded, but I do feel reasonably optimistic that some approach can address these two problems.

Unfortunately, there are many further reasons that the naive training strategy could fail. I’m currently spending time trying to understand those issues and figure out which if any are the most likely to represent fundamental roadblocks.

Where others stand

I’ve had a lot of conversations about alignment with ML researchers over the last 7 years. My impression is that a large fraction of optimists expect to find some strategy to get smart enough models to generalize in the intended way — honestly reporting their beliefs — rather than by learning to game the loss function.

The fact that this is part of the “consensus” optimism about alignment makes it particularly appealing to investigate cases where it seems hard to get the intended generalization.

On the other extreme, I think many alignment researchers are very skeptical about this entire project. I think the main way I disagree with them is methodological: before concluding that the problem is hard I want to try to find the simplest hard case.

Relationship to my other work

I’ve traditionally avoided the naive training strategy, in large part because I’m scared that it will learn the instrumental policy instead of the intended policy.

I still believe that you need to do something like iterated amplification and imitative generalization in order to avoid this problem. However, I think that a working strategy that combines all of these ideas may have more in common with the naive training strategy than I’d initially expected.

For example, I now think that the representations of “what the model knows” in imitative generalization will sometimes need to use neural networks to translate between what the model is thinking and human language. Once you go down that road, you encounter many of the difficulties of the naive training strategy. This is an update in my view; I’ll likely go into more detail in a future post.

New Comment
4 comments, sorted by Click to highlight new comments since: Today at 8:30 PM

There may be predictable errors in the training data, such that instrumental policy actually gets a lower loss than answering honestly (because it responds strategically to errors).

If you are answering questions as text, there is a lot of choice in wording. There are many strings of text that are a correct answer, and the AI has to pick the one the human would use. In order to predict how a human would word an answer, you need a fairly good understanding of how they think (I think). 

I agree you have to do something clever to make the intended policy plausibly optimal.

The first part of my proposal in section 3 here was to avoid using "imitate humans," and to instead learn a function "Answer A is unambiguously worse than answer B." Then we update against policies only when they give unambiguously worse answers.

(I think this still has a lot of problems; it's not obvious to me whether the problem is soluble.)

Planned summary for the Alignment Newsletter:

We want to build an AI system that answers questions honestly, to the best of its ability. One obvious approach is to have humans generate answers to questions, select the question-answer pairs where we are most confident in the answers, and train an AI system on those question-answer pairs.

(I’ve described this with a supervised learning setup, but we don’t have to do that: we could also [learn](https://deepmind.com/blog/learning-through-human-feedback/) from [comparisons](https://ai-alignment.com/optimizing-with-comparisons-c02b8c0d7877) between answers, and we only provide comparisons where we are confident in the comparison.)

What will the AI system do on questions where we _wouldn’t_ be confident in the answers? For example, questions that are complex, where we may be misled by bad observations, where an adversary is manipulating us, etc.

One possibility is that the AI system learned the **intended policy**, where it answers questions honestly to the best of its ability. However, there is an **instrumental policy** which also gets good performance: it uses a predictive model of the human to say whatever a human would say. (This is “instrumental” in that the model is taking the actions that are instrumental to getting a low loss, even in the test environment.) This will give incorrect answers on complex, misleading, or manipulative questions -- _even if_ the model “knows” that the answer is incorrect.

Intuitively, “answer as well as you can” feels like a much simpler way to give correct answers, and so we might expect to get the intended policy rather than the instrumental policy. This view (which seems common amongst ML researchers) is _optimism about generalization_: we are hoping that the policy generalizes to continue to answer these more complex, misleading, manipulative questions to the best of its ability.

Are there reasons to instead be pessimistic about generalization? There are at least three:

1. If the answers we train on _aren’t_ perfectly correct, the instrumental policy might get a _lower_ training loss than the intended policy (which corrects errors that humans make), and so be more likely to be found by gradient descent.

2. If the AI already needs to make predictions about humans, it may not take much “additional work” to implement the instrumental policy. Conversely, if the AI reasons at a different level of abstraction than humans, it may take a lot of “additional work” to turn correct answers in the AI’s ontology into correct answers in human ontologies.

3. From [a followup post](https://www.alignmentforum.org/posts/SRJ5J9Tnyq7bySxbt/answering-questions-honestly-given-world-model-mismatches), the AI system might answer questions by translating its concepts to human concepts or observations, and then deduce the answer from those concepts or observations. This will systematically ignore information that the AI system understands that isn’t represented in the human concepts or observations. (Consider the [example](https://openai.com/blog/deep-reinforcement-learning-from-human-preferences/) of the robot hand that only _looked_ like it was grasping the appropriate object.)

A possible fourth problem: if the AI system did the deduction in its own concepts and only as a final step translated it to human concepts, we might _still_ lose relevant information. This seems not too bad though -- it seems like we should at least be able to <@explain the bad effects of a catastrophic failure@>(@Can there be an indescribable hellworld?@) in human concepts, even if we can’t explain why that failure occurred.

A [followup post](https://www.alignmentforum.org/posts/roZvoF6tRH6xYtHMF/avoiding-the-instrumental-policy-by-hiding-information-about) considers whether we could avoid the instrumental policy by <@preventing it from learning information about humans@>(@Thoughts on Human Models@), but concludes that while it would solve the problems outlined in the post, it seems hard to implement in practice.

For example, I now think that the representations of “what the model knows” in imitative generalization will sometimes need to use neural networks to translate between what the model is thinking and human language. Once you go down that road, you encounter many of the difficulties of the naive training strategy. This is an update in my view; I’ll likely go into more detail in a future post.

+1 to this and excited and happy to hear about this update in your view!