x
This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
LESSWRONG
LW
Login
Aaryan Chandna
Posts
Sorted by New
Wikitag Contributions
Comments
Sorted by
Newest
Aaryan Chandna — LessWrong
Natural emergent misalignment from reward hacking in production RL
Aaryan Chandna
2mo
1
0
Interesting, would be cool to know which base model was used!
Reply
35
Is the evidence in "Language Models Learn to Mislead Humans via RLHF" valid?
2mo
0
Interesting, would be cool to know which base model was used!