x
This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
LESSWRONG
is fundraising!
LW
Login
Meg
Posts
Sorted by New
Wikitag Contributions
Comments
Sorted by
Newest
Meg — LessWrong
142
Auditing language models for hidden objectives
Ω
9mo
Ω
15
310
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Ω
2y
Ω
95
Review
125
Steering Llama-2 with contrastive activation additions
Ω
2y
Ω
29
Review
66
Towards Understanding Sycophancy in Language Models
Ω
2y
Ω
0
121
Paper: LLMs trained on “A is B” fail to learn “B is A”
Ω
2y
Ω
74
109
Paper: On measuring situational awareness in LLMs
Ω
2y
Ω
17
Comments