Compressed Computation is (probably) not Computation in Superposition
This research was completed during the Mentorship for Alignment Research Students (MARS 2.0) Supervised Program for Alignment Research (SPAR spring 2025) programs. The team was supervised by Stefan (Apollo Research). Jai and Sara were the primary contributors, Stefan contributed ideas, ran final experiments and helped writing the post. Giorgi contributed...
Do any of these recent papers within the last year change your view on interp impact for these theories? :
1. Understanding misalignment (at least some initial insights): https://arxiv.org/html/2502.17424v2
2. Better prediction of future systems (interp for scaling):
https://arxiv.org/abs/2303.13506
3. Auditing to reveal hidden objectives:
https://www.anthropic.com/research/auditing-hidden-objectives