Top postsTop post
josh :)
Message
MS in AI at UT Austin. Interested in interpretability and model self-knowledge.
I am open to opportunities :)
Twitter: @joshycodes
Blog: joshfonseca.com/blog
36
4
1
TL;DR LLMs can be trained to detect activation steering robustly. With lightweight fine-tuning, models learn to report when a steering vector was injected into their residual stream and often identify the injected concept. The best model reaches 95.5% detection on held-out concepts and 71.2% concept identification. Lead Author: Joshua Fonseca...
> TL;DR: Letting a model overfit first, then applying Frobenius norm regularization, achieves grokking in roughly half the steps of Grokfast on modular arithmetic. I learned about grokking fairly recently, and thought it was quite interesting. It sort of shook up how I thought about training. Overfitting to your training...
TL;DR I fine-tuned deepseek-ai/deepseek-llm-7b-chat to detect when its activations are being steered. It works: 85% accuracy on held-out concepts, 0% false positives. But two things bother me about what I found: First of all, if models can indeed learn to detect steering, steering-based evaluations might become unreliable. A model that...