Jai Bhagat
https://jkbhagatio.io
Ph. D. in Computational & Systems NeuroscienceActively working on building digital models of biological brains, neural interfaces, and technical AI safety research (interp and evals)
Do any of these recent papers within the last year change your view on interp impact for these theories? :1. Understanding misalignment (at least some initial insights): https://arxiv.org/html/2502.17424v2
2. Better prediction of future systems (interp for scaling):https://arxiv.org/abs/2303.13506
3. Auditing to reveal hidden objectives:https://www.anthropic.com/research/auditing-hidden-objectives
Nice post! Random thought -- problem 1 seems a problem in systems neuroscience as well.
Yes! But only if the mess is the residual stream, i.e. includes $x$! This is the heart of the necessary "feature mixing" we discuss
Do any of these recent papers within the last year change your view on interp impact for these theories? :
1. Understanding misalignment (at least some initial insights): https://arxiv.org/html/2502.17424v2
2. Better prediction of future systems (interp for scaling):
https://arxiv.org/abs/2303.13506
3. Auditing to reveal hidden objectives:
https://www.anthropic.com/research/auditing-hidden-objectives