LESSWRONG
LW

Meg
695000
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No Comments Found
No wikitag contributions to display.
141Auditing language models for hidden objectives
Ω
4mo
Ω
15
305Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Ω
1y
Ω
95
125Steering Llama-2 with contrastive activation additions
Ω
2y
Ω
29
66Towards Understanding Sycophancy in Language Models
Ω
2y
Ω
0
121Paper: LLMs trained on “A is B” fail to learn “B is A”
Ω
2y
Ω
74
109Paper: On measuring situational awareness in LLMs
Ω
2y
Ω
17