Or, partly against discussion about generic “features” in mechanistic interpretability Probably the most debated core concept in mechanistic interpretability is that of the “feature”: common questions include “are there non-linear features, and does this mean that the linear representation hypothesis is false?”, “do SAEs recover a canonical set of features,...
1. Late last week, researchers at Apple released a paper provocatively titled “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity”, which “challenge[s] prevailing assumptions about [language model] capabilities and suggest that current approaches may be encountering fundamental barriers to generalizable...
See also: the research post detailing Claude's extended reasoning abilities and the Claude 3.7 System Card. About 1.5 hours ago, Anthropic released Claude 3.7 Sonnet, a hybrid reasoning model that interpolates between a normal LM and long chains of thought: > Today, we’re announcing Claude 3.7 Sonnet1, our most intelligent...
Epistemic status: Soldier mindset. These are not (necessarily) our actual positions, these are positions we were randomly assigned, and for which we searched for the strongest arguments we could find, over the course of ~1 hr 45 mins. Sides: Ben was assigned to argue that it's ethical to work for...
One use case for model internals work is to perform automated auditing of models: https://www.alignmentforum.org/posts/cQwT8asti3kyA62zc/automating-auditing-an-ambitious-concrete-technical-research That is, given a specification of intended behavior, the attacker produces a model that doesn't satisfy the spec, and the auditor needs to determine how the model doesn't satisfy the spec. This is closely related...
We recently released a paper on using mechanistic interpretability to generate compact formal guarantees on model performance. In this companion blog post to our paper, we'll summarize the paper and flesh out some of the motivation and inspiration behind our work. Paper abstract > In this work, we propose using...
Announcing the first academic Mechanistic Interpretability workshop, held at ICML 2024! I think this is an exciting development that's a lagging indicator of mech interp gaining legitimacy as an academic field, and a good chance for field building and sharing recent progress! We'd love to get papers submitted if any...