LESSWRONG
LW

165
Fazl
119000
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
No Comments Found
78Best-of-N Jailbreaking
Ω
9mo
Ω
5
4Visualizing neural network planning
Ω
1y
Ω
0
48Mechanistic Interpretability Workshop Happening at ICML 2024!
Ω
1y
Ω
6
18Early Experiments in Reward Model Interpretation Using Sparse Autoencoders
2y
0
8Automated Sandwiching & Quantifying Human-LLM Cooperation: ScaleOversight hackathon results
3y
0