x
This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
LESSWRONG
LW
Login
Mads U — LessWrong
Mads U
Posts
Sorted by New
Wikitag Contributions
Comments
Sorted by
Newest
Sonnet 4.5's eval gaming seriously undermines alignment evals, and this seems caused by training on alignment evals
Mads U
1mo
1
0
Does this mean that the model will always behave nicely, if it always thinks it is being tested?
Reply
Does this mean that the model will always behave nicely, if it always thinks it is being tested?