LESSWRONG
LW

Jai Bhagat

https://jkbhagatio.io

Ph. D. in Computational & Systems Neuroscience

Actively working on building digital models of biological brains, neural interfaces, and technical AI safety research (interp and evals)

Posts

Sorted by New

56Compressed Computation is (probably) not Computation in Superposition

4mo

9

Wikitag Contributions

Comments

Sorted by

Newest

Against Almost Every Theory of Impact of Interpretability

Jai Bhagat3mo10

Do any of these recent papers within the last year change your view on interp impact for these theories? :

1. Understanding misalignment (at least some initial insights): https://arxiv.org/html/2502.17424v2

2. Better prediction of future systems (interp for scaling):
https://arxiv.org/abs/2303.13506

3. Auditing to reveal hidden objectives:
https://www.anthropic.com/research/auditing-hidden-objectives

Reply

Activation space interpretability may be doomed

Jai Bhagat3mo40

Nice post! Random thought -- problem 1 seems a problem in systems neuroscience as well.

Reply

Compressed Computation is (probably) not Computation in Superposition

Jai Bhagat4mo10

Yes! But only if the mess is the residual stream, i.e. includes $x$! This is the heart of the necessary "feature mixing" we discuss

Reply

56Compressed Computation is (probably) not Computation in Superposition

4mo

9