Sam Ringer

Message

456

Models Don't "Get Reward"

In terms of content, this has a lot of overlap with Reward is not the optimization target. I'm basically rewriting a part of that post in language I personally find clearer, emphasising what I think is the core insight. When thinking about deception and RLHF training, a simplified threat model...

Dec 30, 2022330

A Summary Of Anthropic's First Paper

"Larger models tend to perform better at most tasks, and there is no reason to expect naive alignment-related tasks to be an exception." At the start of December, Anthropic published their first paper, A General Language Assistant as a Laboratory for Alignment. The paper focuses on quantifying how aligned language...

Dec 30, 202186

LESSWRONG
LW

LESSWRONG
LW

Sam Ringer

Sam Ringer

Sam Ringer

Sam Ringer

Models Don't "Get Reward"

A Summary Of Anthropic's First Paper

Models Don't "Get Reward"

A Summary Of Anthropic's First Paper