x
This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
LESSWRONG
LW
Login
Mads U — LessWrong
Mads U
Mads U
Subscribe
Message
1
9mo
All
⚙
Sonnet 4.5's eval gaming seriously undermines alignment evals, and this seems caused by training on alignment evals
Mads U
6mo
1
0
Does this mean that the model will always behave nicely, if it always thinks it is being tested?
Reply
Does this mean that the model will always behave nicely, if it always thinks it is being tested?