x
This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
LESSWRONG
LW
Login
AI Safety — LessWrong
AI Safety
This page is a stub.
Subscribe
Discussion
Subscribe
Discussion
Posts tagged
AI Safety
Most Relevant
1
157
Existential AI safety needs an effective social movement. PauseAI is building it
Maxime Fournes
,
Espedair Street
9d
50
1
117
Synthetic Persona Pretraining: Alignment from Token Zero
Julian Minder
,
Raghav Singhal
,
Viktor Moskvoretskii
,
Stefan Krsteski
,
ashtonanderson
,
rolandaydin
,
Robert West
2mo
26
1
68
Door's Locked, Try the Window
Prakrat Agrawal
,
Jérémy Scheurer
11d
0
1
68
Research update: RL on Debate Games shows Proposal Accuracy uplift alongside Judge Hacking
lennie
,
joanv
,
Shi
,
Jacob Pfau
3d
2
1
55
Do LLMs Have Desires?
Christopher Ackerman
8d
8
1
48
Human-Guided Agentic Research: A Research Agenda
fastfedora
6d
7
1
36
The case for fine-grained tracking of compute for AI
Farhan
,
Katherine Biewer
2mo
17
1
31
[paper] Training on Documents About Monitoring Leads to CoT Obfuscation
Reilly Haskins
,
bilalchughtai
,
Josh Engels
1mo
1
1
31
A brief list of ways AI safety efforts could be net negative
Elias Schmied
16d
4
1
21
Hannibal Mistral: the Mistral family has a problem with persona-conditioned elicitation
vigji
1mo
0
1
21
Scheming Evals Mislead in Both Directions
Chijioke Ugwuanyi
,
eric-z
,
TerryJCZhang
2d
0
1
15
From 8B to Frontier: How System Prompts Control Whether AI Agents Blackmail, Leak, and Kill
Chijioke Ugwuanyi
2mo
2
1
13
Apply to the Inaugural PIBBSS Winter Research Fellowship!
Ami94
5d
0
1
12
Can You Hide From a Natural Language Autoencoder?
Yogesh Prabhu
12d
2
1
11
How Matryoshka Sparse AutoEncoders Recover Feature Hierarchies That Vanilla SAEs Lose
Baimam Boukar Jean Jacques
20d
1