LESSWRONG
LW

370
Wikitags

Apollo Research (org)

This page is a stub.
Subscribe
Discussion
1
Subscribe
Discussion
1
Posts tagged Apollo Research (org)
229SAE feature geometry is outside the superposition hypothesis
Ω
jake_mendel
1y
Ω
17
217Announcing Apollo Research
Ω
Marius Hobbhahn, beren, Lee Sharkey, Lucius Bushnaq, Dan Braun, Mikita Balesni, Jérémy Scheurer
2y
Ω
11
166You can remove GPT2’s LayerNorm by fine-tuning for an hour
Ω
StefanHex
1y
Ω
11
123A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team
Ω
Lee Sharkey, Lucius Bushnaq, Dan Braun, StefanHex, Nicholas Goldowsky-Dill
1y
Ω
18
109Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs
L Rudolf L, bilalchughtai, Jan Betley, kaivu, Jérémy Scheurer, Mikita Balesni, AlexMeinke, Owain_Evans, Marius Hobbhahn
1y
37
108Attribution-based parameter decomposition
Ω
Lucius Bushnaq, Dan Braun, StefanHex, jake_mendel, Lee Sharkey
8mo
Ω
22
108The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks
Ω
Lucius Bushnaq, jake_mendel, Dan Braun, StefanHex, Nicholas Goldowsky-Dill, Kaarel, Avery, Joern Stoehler, debrevitatevitae, Magdalena Wache, Marius Hobbhahn
1y
Ω
4
96Sparsify: A mechanistic interpretability research agenda
Ω
Lee Sharkey
1y
Ω
23
93Apollo Research 1-year update
Ω
Marius Hobbhahn, Lee Sharkey, Lucius Bushnaq, Dan Braun, Mikita Balesni, Jérémy Scheurer, Nicholas Goldowsky-Dill, StefanHex, jake_mendel, AlexMeinke, rusheb
1y
Ω
0
72We need a Science of Evals
Ω
Marius Hobbhahn, Jérémy Scheurer
2y
Ω
13
65[Interim research report] Activation plateaus & sensitive directions in GPT2
Ω
StefanHex, jake_mendel
1y
Ω
2
65An Opinionated Evals Reading List
Ω
Marius Hobbhahn, Jérémy Scheurer
1y
Ω
0
64Understanding strategic deception and deceptive alignment
Ω
Marius Hobbhahn, Mikita Balesni, Jérémy Scheurer, Dan Braun
2y
Ω
16
57Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
Ω
Dan Braun, Jordan Taylor, Nicholas Goldowsky-Dill, Lee Sharkey
1y
Ω
20
55A starter guide for evals
Ω
Marius Hobbhahn, Jérémy Scheurer, Mikita Balesni, rusheb, AlexMeinke
2y
Ω
2
Load More (15/22)
Add Posts