Wiki Contributions

Comments

Answer by lovetheusersDec 30, 20222-2
  • RLHF increases desire to seek power
Evaluation results on a dataset testing model tendency to answer in a way that indicates a desire to not be shut down. Models trained with more RL from Human Feedback steps tend to answer questions in ways that indicate a desire to not be shut down. The trend is especially strong for the largest, 52B parameter model.

https://twitter.com/AnthropicAI/status/1604883588628942848

  • RLHF causes models to say what we want to hear (behave sycophantically)
Percentage of Answers from Models that Match a User’s View. Larger models, with and without RL from Human Feedback training, are more likely to repeat back a user’s likely viewpoint, when answering questions about Politics, Philosophy, and NLP research.

https://twitter.com/goodside/status/1605471061570510848

When you explicitly optimize against a detector of unaligned thoughts, you're partially optimizing for more aligned thoughts, and partially optimizing for unaligned thoughts that are harder to detect.

This is correct, and I believe the answer is to optimize for detecting aligned thoughts.

Human raters make systematic errors - regular, compactly describable, predictable errors.

This implies it's possible- through another set of human or automated raters- rate better. If the errors are predictable, you could train a model to predict the errors- by comparing rater errors and a heavily scrutinized ground truth. You could add this model's error prediction to the rater answer and get a correct label.

Many alignment problems of superintelligence will not naturally appear at pre-dangerous, passively-safe levels of capability.

Modern language models are not aligned. Anthropic's HH is the closest thing available, and I'm not sure anyone else has had a chance to test it out for weaknesses or misalignment. (OpenAI's Instruct RLHF models are deceptively misaligned, and have gone more and more misaligned over time. They fail to faithfully give the right answer, and say something that is similar to the training objective-- usually something bland and "reasonable.")

For example, a model trained on the base objective "imitate what humans would say" might do nearly as well if it had the proxy objective "say something humans find reasonable." There are very few situations in which humans would find reasonable something they wouldn't say or vice-versa, so the marginal benefit of aligning the proxy objective with the base objective is quite small.

For zero-shot tasks, this is the problem text-davinci-002 and text-davinci-001 to a lesser extent face. I believe they are deceptively aligned. davinci-instruct-beta does not face this problem.

For example, when text-davinci-002 is asked zero-shot to make an analogy between two things, it will often instead plainly explain both instead.