This website requires javascript to properly function. Consider activating javascript to get access to all site functionality.
LESSWRONG
Tags
LW
Login
MATS Program
•
Applied to
Determining the power of investors over Frontier AI Labs is strategically important to reduce x-risk
by
Lucie Philippon
2d
ago
•
Applied to
Analyzing DeepMind's Probabilistic Methods for Evaluating Agent Capabilities
by
Axel Højmark
5d
ago
•
Applied to
BatchTopK: A Simple Improvement for TopK-SAEs
by
Bart Bussmann
7d
ago
•
Applied to
My experience applying to MATS 6.0
by
mic
8d
ago
•
Applied to
[Research log] The board of Alphabet would stop DeepMind to save the world
by
Lucie Philippon
11d
ago
•
Applied to
I found >800 orthogonal "write code" steering vectors
by
Ryan Kidd
11d
ago
•
Applied to
Stitching SAEs of different sizes
by
Ryan Kidd
13d
ago
•
Applied to
Efficient Dictionary Learning with Switch Sparse Autoencoders
by
Anish Mudide
20d
ago
•
Applied to
Decomposing the QK circuit with Bilinear Sparse Dictionary Learning
by
Ryan Kidd
24d
ago
•
Applied to
Talk: AI safety fieldbuilding at MATS
by
Ryan Kidd
1mo
ago
•
Applied to
Attention Output SAEs Improve Circuit Analysis
by
Connor Kissane
1mo
ago
•
Applied to
Fine-tuning is not sufficient for capability elicitation
by
Theodore Chapman
1mo
ago
•
Applied to
[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations
by
Ryan Kidd
1mo
ago
•
Applied to
Reward hacking behavior can generalize across tasks
by
Ryan Kidd
2mo
ago
•
Applied to
How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions
by
Ryan Kidd
2mo
ago
•
Applied to
Talent Needs of Technical AI Safety Teams
by
yams
2mo
ago
•
Applied to
Infra-Bayesian haggling
by
Ryan Kidd
2mo
ago
•
Applied to
Language Models Model Us
by
eggsyntax
2mo
ago
•
Applied to
MATS Winter 2023-24 Retrospective
by
Rocket
3mo
ago