LESSWRONG
LW

1250
Meg
696000
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No Comments Found
No wikitag contributions to display.
141Auditing language models for hidden objectives
Ω
7mo
Ω
15
306Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Ω
2y
Ω
95
125Steering Llama-2 with contrastive activation additions
Ω
2y
Ω
29
66Towards Understanding Sycophancy in Language Models
Ω
2y
Ω
0
121Paper: LLMs trained on “A is B” fail to learn “B is A”
Ω
2y
Ω
74
109Paper: On measuring situational awareness in LLMs
Ω
2y
Ω
17