Gerard Boxo — LessWrong

Gerard Boxo's Shortform

Are Sparse Autoencoders a good idea for AI control?

Based on a 2-day hackathon brainstorm. Current status: 70% of the tooling is done, unsure of how to proceed. Not enough experience with multi-month sized projects to judge for feasibility. I'm looking for some feedback. Specifically I want feedback regarding my current implementation. The statement "SAEs could be useful for...

Dec 26, 20243

It's important to know when to stop: Mechanistic Exploration of Gemma 2 List Generation

TL;DR * Small Language Models are getting better an at an accelerated pace, enabling the study of behaviors that just a few months ago were only observed in SOTA models. This, paired with the release of the suite of Sparse Autoencoders "Gemma Scope" by Google Deep Mind, makes this kind...

Oct 14, 20249