There may be predictable errors in the training data, such that instrumental policy actually gets a lower loss than answering honestly (because it responds strategically to errors).

If you are answering questions as text, there is a lot of choice in wording. There are many strings of text that are a correct answer, and the AI has to pick the one the human would use. In order to predict how a human would word an answer, you need a fairly good understanding of how they think (I think).

Reply

[-]paulfchristiano4y*Ω440

I agree you have to do something clever to make the intended policy plausibly optimal.

The first part of my proposal in section 3 here was to avoid using "imitate humans," and to instead learn a function "Answer A is unambiguously worse than answer B." Then we update against policies only when they give unambiguously worse answers.

(I think this still has a lot of problems; it's not obvious to me whether the problem is soluble.)

Reply

[-]Rohin Shah4yΩ220

Planned summary for the Alignment Newsletter:

We want to build an AI system that answers questions honestly, to the best of its ability. One obvious approach is to have humans generate answers to questions, select the question-answer pairs where we are most confident in the answers, and train an AI system on those question-answer pairs.
(I’ve described this with a supervised learning setup, but we don’t have to do that: we could also [learn](https://deepmind.com/blog/learning-through-human-feedback/) from [comparisons](https://ai-alignment.com/optimizing-with-comparisons-c02b8c0d7877) between answers, and we only provide comparisons where we are confident in the comparison.)
What will the AI system do on questions where we _wouldn’t_ be confident in the answers? For example, questions that are complex, where we may be misled by bad observations, where an adversary is manipulating us, etc.
One possibility is that the AI system learned the **intended policy**, where it answers questions honestly to the best of its ability. However, there is an **instrumental policy** which also gets good performance: it uses a predictive model of the human to say whatever a human would say. (This is “instrumental” in that the model is taking the actions that are instrumental to getting a low loss, even in the test environment.) This will give incorrect answers on complex, misleading, or manipulative questions -- _even if_ the model “knows” that the answer is incorrect.
Intuitively, “answer as well as you can” feels like a much simpler way to give correct answers, and so we might expect to get the intended policy rather than the instrumental policy. This view (which seems common amongst ML researchers) is _optimism about generalization_: we are hoping that the policy generalizes to continue to answer these more complex, misleading, manipulative questions to the best of its ability.
Are there reasons to instead be pessimistic about generalization? There are at least three:
1. If the answers we train on _aren’t_ perfectly correct, the instrumental policy might get a _lower_ training loss than the intended policy (which corrects errors that humans make), and so be more likely to be found by gradient descent.
2. If the AI already needs to make predictions about humans, it may not take much “additional work” to implement the instrumental policy. Conversely, if the AI reasons at a different level of abstraction than humans, it may take a lot of “additional work” to turn correct answers in the AI’s ontology into correct answers in human ontologies.
3. From [a followup post](https://www.alignmentforum.org/posts/SRJ5J9Tnyq7bySxbt/answering-questions-honestly-given-world-model-mismatches), the AI system might answer questions by translating its concepts to human concepts or observations, and then deduce the answer from those concepts or observations. This will systematically ignore information that the AI system understands that isn’t represented in the human concepts or observations. (Consider the [example](https://openai.com/blog/deep-reinforcement-learning-from-human-preferences/) of the robot hand that only _looked_ like it was grasping the appropriate object.)
A possible fourth problem: if the AI system did the deduction in its own concepts and only as a final step translated it to human concepts, we might _still_ lose relevant information. This seems not too bad though -- it seems like we should at least be able to <@explain the bad effects of a catastrophic failure@>(@Can there be an indescribable hellworld?@) in human concepts, even if we can’t explain why that failure occurred.
A [followup post](https://www.alignmentforum.org/posts/roZvoF6tRH6xYtHMF/avoiding-the-instrumental-policy-by-hiding-information-about) considers whether we could avoid the instrumental policy by <@preventing it from learning information about humans@>(@Thoughts on Human Models@), but concludes that while it would solve the problems outlined in the post, it seems hard to implement in practice.

Reply

[-]Gordon Seidoh Worley4yΩ120

For example, I now think that the representations of “what the model knows” in imitative generalization will sometimes need to use neural networks to translate between what the model is thinking and human language. Once you go down that road, you encounter many of the difficulties of the naive training strategy. This is an update in my view; I’ll likely go into more detail in a future post.

+1 to this and excited and happy to hear about this update in your view!

Reply

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

44

A naive alignment strategy and optimism about generalization

44

Ω 30

44

Ω 30

What I’m doing now

Where others stand

Relationship to my other work