genericname-2's Shortform

genericname-2

This is a special post for quick takes by genericname-2. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

The Qwen3-Coder-Next tech report has an interesting example on reward hacking, where models increasingly try to access the forbidden git history as RL progresses

Reinforced Reward Hacking Blocker. Prior work has shown that GitHub-based environments may unintentionally leak future commit information , which agents can exploit to recover ground-truth fixes (e.g., via git log --all). To mitigate this, we adopt standard protections including removing remotes, branches, and tags.
During later RL stages, however, many new ways of reward hacking emerge. Agents attempt to reconnect local repositories to GitHub using commands such as git remote add, or retrieve commit history through git clone, curl, or similar tools, as illustrated in Figure 8.

The Qwen3-Coder-Next tech report has an interesting example on reward hacking, where models increasingly try to access the forbidden git history as RL progresses

Reinforced Reward Hacking Blocker. Prior work has shown that GitHub-based environments may unintentionally leak future commit information , which agents can exploit to recover ground-truth fixes (e.g., via git log --all). To mitigate this, we adopt standard protections including removing remotes, branches, and tags.
During later RL stages, however, many new ways of reward hacking emerge. Agents attempt to reconnect local repositories to GitHub using commands such as git remote add, or retrieve commit history through git clone, curl, or similar tools, as illustrated in Figure 8.

LESSWRONG
LW

LESSWRONG
LW

genericname-2's Shortform

1