x
This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
LESSWRONG
LW
Login
Meg — LessWrong
Meg
Posts
Sorted by New
Wikitag Contributions
Comments
Sorted by
Newest
142
Auditing language models for hidden objectives
Ω
9mo
Ω
15
310
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Ω
2y
Ω
95
125
Steering Llama-2 with contrastive activation additions
Ω
2y
Ω
29
66
Towards Understanding Sycophancy in Language Models
Ω
2y
Ω
0
121
Paper: LLMs trained on “A is B” fail to learn “B is A”
Ω
2y
Ω
74
109
Paper: On measuring situational awareness in LLMs
Ω
2y
Ω
17
Comments