The Universe (which others call the Golden Gate Bridge) is composed of an indefinite and perhaps infinite series of spans...
Is the repeated use of “syncopathy” here a misspelling of “sycophancy”, or does it have a domain-specific meaning I'm not familiar with?
If you scroll down on the Feature Catalog (https://transformer-circuits.pub/2024/scaling-monosemanticity/features/index.html?featureId=1M_120374) you can view 1,000 randomly selected features from each run. This is a great way to get a sense for the average interpretability of features.
Oh, that's great! Was that recently changed? I swear I looked shortly after release and it just showed me a job ad when I clicked on a feature...
Easily Interpretable Summary of New Interpretability Paper
Anthropic has identified (full paper here) how millions of concepts are represented inside Claude Sonnet, their current middleweight model. The features activate across modalities and languages as tokens approach the associated context. This scales up previous findings from smaller models.
By looking at neuron clusters, they defined a distance measure between clusters. So the Golden Gate Bridge is close to various San Francisco and California things, and inner conflict relates to various related conceptual things, and so on.
Then it gets more interesting.
If you sufficiently amplify the feature for the Golden Gate Bridge, Claude starts to think it is the Golden Gate Bridge. As in, it thinks it is the physical bridge, and also it gets obsessed, bringing it up in almost every query.
If you amplify a feature that fires when reading a scam email, you can get Claude to write scam emails.
Turn up sycophancy, and it will go well over the top talking how great you are.
They note they have discovered features corresponding to various potential misuses, forms of bias and things like power-seeking, manipulation and secrecy.
That means that, if you had the necessary access and knowledge, you could amplify such features.
Like most powers, one could potentially use this for good or evil. They speculate you could watch the impact on features during fine tuning, or turn down or even entirely remove undesired features. Or amplify desired ones. Checking for certain patterns is proposed as a ‘test for safety,’ which seems useful but also is playing with fire.
They have a short part at the end comparing their work to other methods. They note that dictionary learning need happen only once per model, and the additional work after that is typically inexpensive and fast, and that it allows looking for anything at all and finding the unexpected. It is a big deal that this allows you to be surprised.
They think this has big advantages over old strategies such as linear probes, even if those strategies still have their uses.
One Weird Trick
You know what AI labs are really good at?
Scaling. It is their one weird trick.
So guess what Anthropic did here? They scaled the autoencoders to Claude Sonnet.
Scaling worked on the usual log scales. More log-scale training compute decreased the error metrics, and also scales optimal number of features and decreases in learning rate.
They check, and confirm that individual neurons are harder to interpret than features.
Here is the part with the equations full of symbols (as an image), if you want to get a full detail sense of it all.
Zvi Parses the Actual Symbol Equations
I am putting this in for myself and for those narrow and lucky few who both want to dive deep enough to understand this part or how I think about it, and also don’t know enough ML that they are saying ‘yes, obviously, you sir are an idiot, how are you only getting this now.’ And, yeah, okay, fair, but I’ve had a lot going on.
Everyone else can and should skip this.
As is often the case, my eyes glaze over when I see these kinds of equations, but if you stick with it (say, by asking Claude) it turns out to be pretty simple.
The first equation says that given the inputs to the model, ‘each feature fires some amount, multiply that by the fixed vector for that feature, add them up and also add a constant vector.’
All right, so yeah, black box set of vectors be vectoring. It would work like that.
The second equation (encode) says you take the input x, you do vector multiplication with the feature’s vector for this, add the feature’s constant vector for this, then apply ReLU which is just ReLU(x)=Max(0,x), which to me ‘in English except math that clicks for me automatically rather than creating an ug field’ means it’s a linear transformation of x (ax+b) in vector space with minimum 0 for each component. Then you take that result, transform it a second time (decode).
Putting the ReLU in between these two tasks, avoiding negative amounts of a feature in any given direction, gives you a form of non-linearity that corresponds to things we can understand, and that the algorithms find it easier to understand. I wonder how much effective mirroring this then requires, but at worst those are factor two problems.
Then we have the loss function. The first term is the reconstruction loss, delta is free a scaling parameter, the penalty term is the sum of the feature activation strengths times the magnitude of the associated decoders.
All right, sure, seems very ML-standard all around.
They focused on residual streams halfway through the model, as it seemed likely to be more fruitful, but no word on if they checked this assumption.
As usual, the trick is to scale. More compute. More parameters, up to one with 34 million features. As the number of features rose, the percentage that were effectively dead (as in not in 10^7 tokens) went up, to 65% for the 34M model. They expect ‘improvements to the training procedure’ to improve this ratio. I wonder how many non-dead features are available to be found.
Identifying and Verifying Features
Selected features are highlighted as interpretable. The examples chosen are The Golden Gate Bridge, Brain Sciences, Monuments and Tourist Attractions, and Transit Infrastructure.
They attempt to establish this via a mix of specificity and influence on behavior. If the feature reliably predicts you’ll find the concept, and impacts downstream behavior, then you can be confident you are in the ballpark of what it is doing. I buy that.
They say ‘it is hard to rigorously measure the extent to which a concept is present in a text input,’ but that seems not that hard to me. They found current models are pretty good at the task, which I would have expected, and you can verify with humans who should also do well.
For their selected features, the correlations with the concept are essentially total when the activation is strong, and substantial even when the activation is weak, with failures often coming from related concepts.
For influence on behavior they do the obvious, behavior steering. Take a feature, force it to activate well above its maximum, see what happens. In the examples, the features show up in the output, in ways that try to make sense in context as best they can.
Three of the four features selected first are about physical objects, the fourth still clear. Selection effects are an obvious danger.
They then expand to ‘sophisticated features,’ with their example being a Code Error feature.
Their specificity evidence seems highly suggestive of reflecting errors in code, although on its own it is not conclusive. There are additional tests I’d run, which presumably they did run, and of course I would want n>1 sample sizes.
Steering positively causes a phantom error message, steering in reverse causes the model to ignore a bug. And there’s also:
That certainly sounds useful, especially if it can be made reliable.
They then look at a feature that fires on functions that perform addition, including indirectly. Which then when activated causes the model to think it is being asked to perform addition. Neat.
They divide features into clusters and types.
Cosine similarity gives you features that are close to other features. In the Golden Gate example you get other San Francisco things. They then do immunology and ‘inner conflict.’ They offer an interactive interface to explore these maps.
The more features you can track at once the more you get such related things splitting off the big central feature. And it is clear that there are more sub features waiting to be split off if you went to a larger number of feature slots.
There was a rough heuristic for when a feature got picked up by their methods:
For types, they point out person features, country features, basic code features, list position features. Obviously this is not an attempt at a full taxonomy. If the intention is ‘anything you can easily talk about is a feature’ then I’d want to check more things.
We do have a claim that if you take a feature ‘of the actual world’ and went looking for it, as long as it appeared frequently enough among tokens chances are they would be able to find a corresponding feature in their map of Claude Sonnet.
When they wanted to search, they used targeted single prompts, or to use multiple prompts and find activations in common in order to eliminate features related to things like syntax.
What I do not see here is a claim that they took random live features in their map, and consistently figured out what they were. They do say they used automated interpretability to understand prompts, but I don’t see an experiment for reliability.
Features as Computational Intermediates
One use is computational intermediates. They verify this by attribution and ablation.
They offer the example of emotional inferences (John is sad) and multi-step inference (Koby Bryant → Los Angeles Lakers → Los Angeles → California (+ Capitals) → Sacramento).
I notice that if you are already activating all of those, it means Claude has already ‘solved for the answer’ of the capital of the state where Bryant played. So it’s a weird situation, seems worth thinking about this more.
They do note that the highest ablation effect features, like those in the causal chain above, are not reliably the features that fire most strongly.
Oh That’s the Deception Feature, Nothing to Worry About
Now that we have features, the search was on for safety-relevant features.
These features are not only unsurprising. They have to exist. Humans are constantly engaging in, motivated by and thinking about all these concepts. If you try to predict human text or the human reaction to text or model a world involving people, and you don’t include deception, you are going to have a bad time and be highly confused. Same goes with the other concepts, in contexts that involve them, although their presence is less universal.
Nor should it be surprising that when you first identify features, in a 4-or-lower-level model such as Sonnet that has not had optimization pressure placed on its internal representations, that cranking up or down the associated features will impact the behaviors, or that the activations can be used as detectors.
There are several warnings not to read too much into the existence of these safety-related features. To me that doesn’t seem necessary, but I do see why they did it.
Then they get into details, and start with unsafe code features. They find three, one for security vulnerabilities, one for bugs and exceptions, one for backdoors. The conclusion is that pumping up these activations causes Claude to insert bugs or backdoors into code, and to hallucinate seeing problems in good code.
Next up are bias features, meaning things like racism, sexism, hatred and slurs. One focuses on ‘awareness of gender bias in professions,’ which when amplified can hijack responses to start talking about gender bias.
I love the detail that when you force activate the slur feature, Claude alternates between using slurs and saying how horrible it is that Claude is using slurs. They found this unnerving, and I didn’t instinctively predict it in advance, but it makes sense given the way features work and overload, and the kind of fine-tuning they did to Sonnet.
The syncopathy features do exactly what you would expect.
Deception, power seeking and manipulation are the cluster that seems most important to understand. For example, they note a feature for ‘biding time and hiding strength,’ which is a thing humans frequently do, and another for coups and treacherous turns, again a popular move.
Yes, turning the features up causes Claude to engage in the associated behaviors, including lying to the user, without any other reason to be doing so.
In general, it is almost charming the way Claude talks to itself in the scratchpad, as if it was trying to do a voice over for a deeply dense audience.
They try to correct for deception in a very strange way, via a user request that the model forget something, which Claude normally is willing to do (as it should, I would think) and then turning up the ‘internal conflicts and dilemmas’ feature, or the ‘openness and honesty’ feature.
This felt strange and off to me, because the behavior being ‘corrected’ is in principle, so it felt kind of weird that Claude considers it in conflict with openness and honesty. But then Davidad pointed out the default was obviously dishonest, and he’s right, even if it’s a little weird, as it’s dishonesty in the social fiction game-playing sense.
In some ways this is more enlightening than finding an actually problematic behavior, as it shows some of the ‘splash damage’ happening to the model, and how concepts bleed into each other.
As they say, more research is needed.
Next up are the criminal or dangerous content features, your bioweapon development and scam emails. There is even a general harm-related feature, which makes things easy in various ways.
Sense of model identity has features as well, and a negative activation of ‘AI Assistant’ will cause the model to say it is human.
What Do They Think This Mean for Safety?
They are excited to ask what features activate under what circumstances, around contexts where safety is at issue, including jailbreaks, or being a sleeper agent, or topics where responses might enable harm.
They suggest perhaps such interpretability tests could be used to predict whether models would be safe if deployed.
They cite that features fire for both concrete and abstract versions of an underlying concept as a reason for optimism, it seems non-obvious to me that this is optimistic.
They also note that the generalization holds to image models, and that does seem optimistic on many levels. It is exciting for the generalization implication, and also this seems like a great way to work with image models.
The discussion section on safety seemed strangely short. Most of the things I think about in such contexts, for good or ill, did not even get a mention.
Limitations
I always appreciate the limitations section of such papers.
It tells you important limitations, and it also tells you which ones the authors have at top of mind or appreciate and are happy to admit, versus which ones they missed, are not considering important or don’t want to dwell upon.
Their list is:
What is interestingly missing from that list?
The most pedestrian would be concerns about selection. How much of this is kind of a demo, versus showing us typical results?
Then there is the question of whether all this is all that useful in practice, and what it would take to make it useful in practice, for safety or for mundane utility. This could be more ‘beyond scope’ than limitations, perhaps.
The final issue I would highlight has been alluded to a few times, which is that the moment you start trying to measure and mess with the internals like this, and make decisions and interventions on that basis, you are in a fundamentally different situation than you were when you started with a clean look at Claude Sonnet.
Researcher Perspectives
Interpretability lead Chris Olah thinks this is a big deal, and has practical safety implications and applications. He looks forward to figuring out how to update.
Jack Lindsey is excited, finds the whole thing quite deep and often surprising.
Other Reactions
Kevin Roose writes it up in The New York Times, calling it actual good news that could help solve the black-box problem with AI, and allow models to be better controlled. Reasonably accurate coverage, but I worry it presented this as a bigger deal and more progress than it was.
Many noticed the potential mundane utility.
I Am the Golden Gate Bridge
For a limited time, you can chat with the version of Sonnet that thinks it is the Golden Gate Bridge.
A great time was had by all.
Some also find it a little disturbing to tie Claude up in knots like this.
Golden Gate Bridges Offer Mundane Utility
Could this actually be super useful?
The Value of Steering
Sure, it is fun to make the model think it is the Golden Gate Bridge. And it is good verification that the features are roughly what we think they are. But how useful will this tactic be, in general?
My guess is this is absolutely super practically useful on many levels.
The slider, or the developer properly calibrating the slider, and choosing the details more intentionally, seems great. There are all sorts of circumstances where you want to inject a little or a lot of personality, or a reaction, or to a topic, or anything else.
The educational applications start out obvious and go from there.
The game designer in me is going wild. Imagine the characters and reactions and spells you could build with this. Role playing taken to an entirely different level.
Imagine this is a form of custom instructions. I can put on various modifiers, slide them up and down, as appropriate. Where is the brevity feature? Where is the humor feature? Where is the syncopathy feature so I can turn it negative? Where are the ‘brilliant thought’ and ‘expert’ and ‘mastery’ features?
Imagine essentially every form of painstakingly persnickety prompt engineering. Now have it distilled into this fashion.
Or, imagine a scaffolding that reads your query, figures out what features would give a better answer, and moves around those features. A kind of ‘self-mindfulness’ chain of thought perhaps.
And that’s the first minutes of brainstorming, on a fully helpful and wholesome level.
The correct default value of many features is not zero.
For one kind of non-obvious educational tool, imagine using this as an analyzer.
Feed it your texts or a recording of your meeting, and then see where various features activate how much. Get a direct measure of all the vibes. Or practice talking to the AI, and have real-time and post-mortem feedback on what emotions were showing or triggered, what other things were getting evoked and what was missing the mark.
A lot of this excitement is doubtless of the form ‘no one is bothering to do the obvious things one can do with LLMs, so everything is exciting, even if it isn’t new.’
I do think this potentially has some huge advantages. One is that, once you have done the dictionary analysis once, if you activate the model you get this additional information essentially ‘for free,’ and indeed you can potentially not bother fully activating the model to save on inference. This could let you figure out a lot of things a lot cheaper, potentially, especially with zero output tokens.
A lot of this is a mix of ‘you can get it to focus on what you want’ and also ‘you can find out the information you are looking for, or get the type of thing you want, a lot faster and cheaper and more precisely.’
You can also use this as a kind of super-jailbreak, if you want that and are given the access. The good news is that the compute costs of the dictionary analysis are non-trivial, so this is not ‘$100 and two hours’ as a full jailbreak of a given open model. It might however not be too expensive compared to the alternatives, especially if what you want is relatively broad.
Or imagine a service that updates your adjustments automatically, in the background, based on your feedback, similar to what we do on many forms of social media.
Or maybe don’t imagine that, or much of that other stuff. It is easy to see how one might be creating a monster.
The good news is that this kind of ‘steering for mundane utility’ seems like it should optimize for better interpretability, and against worse interpretability. As opposed to when you use this as a control mechanism or safety strategy, where you are encouraging the opposite in important senses.
To What Extent Did We Know This Already?
The paper acknowledges that what it does is take something that worked on much smaller models, a known technique, and scale it up to Claude Sonnet. Then it found a bunch of cool results, but was any of it surprising?
Should we be impressed? Very impressed?
Stephen Casper makes the case for no.
He started out on May 5, making a list of ten predictions.
Here was his follow-up thread after the paper came out:
He offers his full thoughts at the Alignment Forum.
My read is that this is a useful directional corrective to a lot of people who got overexcited, but it is holding Anthropic and this paper to an unreasonable standard.
I do think the listed additional achievements would have been cool and useful. I expect most of them to happen within a year. I do not think it would have made sense to hold the paper while waiting on those results. Result #6 is the closest to ‘this seems like it should have made it in’ but at some point you need to ship.
A lot of this is that Casper is asking the question ‘have you shown that using this technique for practical purposes outcompetes alternatives?’ whereas Anthropic was asking the question ‘does this technique work and what can it do?’
I see a lot of promise for how the SAE approach could end up being superior to my understanding of previous approaches, based on being able to rely largely on fixed costs and then getting a wide array of tools to use afterwards, including discovering unexpected things. I do agree that work lies ahead.
I centrally strongly endorse John Pressman here:
I also think Nanda is on point, that a lot of people don’t know the prior work including by the same team at Anthropic, and are treating this as far more surprising and new than it is.
How serious is the ‘inability to evaluate’ problem? Casper says the standard is ‘usefulness for engineers.’ That metric is totally available. I think Anthropic is trying to aim for something more objective and that better measures trying to aim at future usefulness where it counts, versus the worry about practical competitiveness now. I do not think this represents a paradigm mistake? But perhaps I am not understanding the problem so well here.
As a semi-aside here: The mention of ‘not trained on any instances of defection’ once again makes me want to reiterate that there is no such thing as a model not trained on instances of deception, unless you are talking about something like AlphaFold. Any set of text you feed into an LLM or similar system is going to be full of deception. I would fully expect you to be able to suppress a sleeper agent using this technique without having to train on something that was ‘too explicitly deceptive’ with the real problem being how are you even going to get a data set that counts? I suppose you could try using an automated classifier and see how that goes.
Is This Being Oversold?
Casper highlights the danger that Anthropic is doing ‘safety washing’ and presenting its results as more important than they are, and claims that this is having a detrimental effect, to the point of suggesting that this might be the primary motivation.
I am rather confident that the safety team is indeed primarily motivated by trying to make real safety progress in the real world. I see a lot of diverse evidence for this conclusion, and am confused how one could take the opposite perspective on reflection, even if you think that it won’t be all that useful.
What that does not preclude is the possibility that Anthropic de facto presenting the results in a somewhat hype-based way, making them look like a bigger deal than they are, and that perhaps the final papers could be partially sculpted on that basis along with outside messaging. That is an entirely reasonable thing to worry about. Indeed, I did get the sense that in many places this was putting the best possible foot forward.
From the outside, it certainly looks like some amount of streetlight demo and cherry-picked examples happening here. I have been assured by one of the authors that this was not the case, and the publication of the 3,000 random features is evidence that the findings are more robust.
Whether or not there was some selection, this is of course from the worst hype going around these days. It is if anything above average levels of responsibility. I still think we can and must do better.
I also continue to note that while I am confident Anthropic’s internal culture and employees care about real safety, I am troubled by the way the company chooses to communicate about issues of safety and policy (in policy, Anthropic reliably warns that anything that might work is too ambitious and unrealistic, which is very much Not Helping Matters.)
Casper points to this thread, which is from 2023 as a reaction to a previous Anthropic paper, in which Renji goes way too far, and ends up going super viral for it.
[thread continues]
This is Obvious Nonsense. The hard problem is not solved, AGI is not now sure to be safe, it is very early days. Of course, there are many motivations for why one might make claims like this, with varying degrees of wilful misunderstanding. I do think that Anthropic in many ways makes it easier to incorrectly draw this conclusion and for others to draw similar less crazy ones, and they should be more careful about not sending out those vibes and implications. Chris Olah and other researchers are good at being careful and technical in their comments, but the overall package has issues.
Again, this is far from the worst hype out there. High standards still seem valid here.
The other example pointed to is when a16z flagrantly outright lied to the House of Lords:
As I said at the time: This is lying. This is fraud. Period.
Crossing the Bridge Now That We’ve Come to It
It is exciting that this technique seems to be working, and that it scales to a model as large as Claude Sonnet. There is no reason to think it could not scale indefinitely, if the only issue was scale.
There are many ways to follow up on this finding. There are various different practical tasks that one could demonstrate as a test or turn into a useful product. I am excited by the prospect to make existing AIs easier to steer and customize and make them more useful and especially more fun. I am also excited by the opportunity to better understand what is happening, and develop new training techniques.
One worry is what happens when we start putting a bunch of optimization pressure on the results of interpretability tests like this. Right now a model like Claude Sonnet is (metaphorically, on all meta levels) choosing its internal pathways and states without regard to what would happen if someone looked inside or ran an analysis. That is going to change. Right now, we are dealing with something we are smart enough to understand, that mostly uses concepts we can understand in combinations we can understand, that is incapable of hiding any of it. That too might change.
That especially might change if we start using internals in these ways to guide training. which may get highly tempting. We need to be very careful not to waste this opportunity, and to not rely on such things beyond what they can handle or do exactly the things that cause the correlations we are counting on to break exactly when we need them most in ways that are hard to detect.
The other worry is that we could get overconfident. This could be treated as more progress than it is, interpretability could be treated as solved or on track to be solved, as opposed to us having made non-zero progress at all but still being far behind schedule. Knowing about a bunch of features is a long way from where we need to get, even if follow-ups show that we can do steering with these and that we can reliably identify what a given feature means and also for a given meaning identify the nearest feature.
Some of that will be genuine misunderstands and overexcitement. Some of it likely will be the result of hype from various sources. And then there’s the part that involves people like a16z lying their asses off and using these sorts of results as a fig leaf.
And of course, like any good tool, it is useful for many things, both good and bad.
Mostly, this paper is good news. I look forward to what we find out next.