LESSWRONG
LW

220
David Lindner
483Ω16130
Message
Dialogue
Subscribe

Alignment researcher at Google DeepMind

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
VLM-RM: Specifying Rewards with Natural Language
David Lindner2y30

The agents are rewarded at every timestep and we want them to perform the task throughout the whole episode, so falling over is definitely not what we want. But this has more to do with the policy optimization failing than with the reward model. In other words a policy that doesn't fall over would achieve higher reward than the policies we actually learn. For example, if we plot the CLIP reward over one episode, it typically drops at the end of the episode if the agent falls down.

We tried some tricks to improve the training, such as providing a curriculum starting from short episodes to longer ones. This worked decently well and made the agents fall over less, but we ended up not using it in the final experiments because we primarily wanted to show that it works well with off-the-shelf RL algorithms.

Reply
Practical Pitfalls of Causal Scrubbing
David Lindner2yΩ220

Thanks, that's a useful alternative framing of CaSc!

FWIW, I think this adversarial version of CaSc would avoid the main examples in our post where CaSc fails to reject a false hypothesis. The common feature of our examples is "cancellation" which comes from looking at an average CaSc loss. If you only look at the loss of the worst experiment (so the maximum CaSc loss rather than the average one) you don't get these kind of cancellation problems.

Plausibly you'd run into different failure modes though, in particular, I guess the maximum measure is less smooth and gives you less information on "how wrong" your hypothesis is.

Reply
Practical Pitfalls of Causal Scrubbing
David Lindner2y10

Yes, this seem like a plausible confusion. Your interpretation of what we mean is correct.

Reply
30Early Signs of Steganographic Capabilities in Frontier LLMs
3mo
5
31MONA: Three Month Later - Updates and Steganography Without Optimization Pressure
Ω
5mo
Ω
0
29Can LLMs learn Steganographic Reasoning via RL?
5mo
3
81MONA: Managed Myopia with Approval Feedback
Ω
8mo
Ω
30
49On scalable oversight with weak LLMs judging strong LLMs
Ω
1y
Ω
18
20VLM-RM: Specifying Rewards with Natural Language
Ω
2y
Ω
2
87Practical Pitfalls of Causal Scrubbing
Ω
2y
Ω
17
79Threat Model Literature Review
Ω
3y
Ω
4
127Clarifying AI X-risk
Ω
3y
Ω
24