LESSWRONG
LW

Jai Bhagat
52130
Message
Dialogue
Subscribe

Jai Bhagat

https://jkbhagatio.io

Ph. D. in Computational & Systems Neuroscience

Actively working on building digital models of biological brains, neural interfaces, and technical AI safety research (interp and evals)

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Against Almost Every Theory of Impact of Interpretability
Jai Bhagat1mo10

Do any of these recent papers within the last year change your view on interp impact for these theories? :

1. Understanding misalignment (at least some initial insights): https://arxiv.org/html/2502.17424v2

2. Better prediction of future systems (interp for scaling):
https://arxiv.org/abs/2303.13506

3. Auditing to reveal hidden objectives:
https://www.anthropic.com/research/auditing-hidden-objectives

Reply
Activation space interpretability may be doomed
Jai Bhagat1mo40

Nice post! Random thought -- problem 1 seems a problem in systems neuroscience as well.

Reply
Compressed Computation is (probably) not Computation in Superposition
Jai Bhagat2mo10

Yes! But only if the mess is the residual stream, i.e. includes $x$! This is the heart of the necessary "feature mixing" we discuss

Reply
56Compressed Computation is (probably) not Computation in Superposition
2mo
9