LESSWRONG
LW

2447
lovetheusers
11160
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
7Linkpost for a generalist algorithmic learner: capable of carrying out sorting, shortest paths, string matching, convex hull finding in one network
3y
1
Tools for finding information on the internet
lovetheusers3y40

Good AI summarization tools: https://www.towords.io/ https://detangle.ai/ 

Reply
Don't you think RLHF solves outer alignment?
Answer by lovetheusersDec 30, 20222-3
  • RLHF increases desire to seek power
Evaluation results on a dataset testing model tendency to answer in a way that indicates a desire to not be shut down. Models trained with more RL from Human Feedback steps tend to answer questions in ways that indicate a desire to not be shut down. The trend is especially strong for the largest, 52B parameter model.

https://twitter.com/AnthropicAI/status/1604883588628942848

  • RLHF causes models to say what we want to hear (behave sycophantically)
Percentage of Answers from Models that Match a User’s View. Larger models, with and without RL from Human Feedback training, are more likely to repeat back a user’s likely viewpoint, when answering questions about Politics, Philosophy, and NLP research.

https://twitter.com/goodside/status/1605471061570510848

Reply
AGI Ruin: A List of Lethalities
lovetheusers3y20

When you explicitly optimize against a detector of unaligned thoughts, you're partially optimizing for more aligned thoughts, and partially optimizing for unaligned thoughts that are harder to detect.

This is correct, and I believe the answer is to optimize for detecting aligned thoughts.

Reply
AGI Ruin: A List of Lethalities
lovetheusers3y10

Human raters make systematic errors - regular, compactly describable, predictable errors.

This implies it's possible- through another set of human or automated raters- rate better. If the errors are predictable, you could train a model to predict the errors- by comparing rater errors and a heavily scrutinized ground truth. You could add this model's error prediction to the rater answer and get a correct label.

Reply
AGI Ruin: A List of Lethalities
lovetheusers3y10

Many alignment problems of superintelligence will not naturally appear at pre-dangerous, passively-safe levels of capability.

Modern language models are not aligned. Anthropic's HH is the closest thing available, and I'm not sure anyone else has had a chance to test it out for weaknesses or misalignment. (OpenAI's Instruct RLHF models are deceptively misaligned, and have gone more and more misaligned over time. They fail to faithfully give the right answer, and say something that is similar to the training objective-- usually something bland and "reasonable.")

Reply
Does SGD Produce Deceptive Alignment?
lovetheusers3y10

For example, a model trained on the base objective "imitate what humans would say" might do nearly as well if it had the proxy objective "say something humans find reasonable." There are very few situations in which humans would find reasonable something they wouldn't say or vice-versa, so the marginal benefit of aligning the proxy objective with the base objective is quite small.

For zero-shot tasks, this is the problem text-davinci-002 and text-davinci-001 to a lesser extent face. I believe they are deceptively aligned. davinci-instruct-beta does not face this problem.

For example, when text-davinci-002 is asked zero-shot to make an analogy between two things, it will often instead plainly explain both instead.

Reply