Hello, World of Mechanistic Interpetability

ValueShift Research

This post is an introduction for the series of posts, which will be dedicated to mechanistic interpretability in its broader definition as a set of approaches and tools to better understand the processes that lead to certain AI-generated outputs. There is an ongoing debate on what to consider a part of the mechanistic interpretability field. Some works suggest to call “mechanistic interpretability” the set of approaches and tools that work “bottom-up” and focus on neurons and smaller components on deep neural networks. Some others include in mechinterp attribution graphs, which one could consider a “top-down” approach, since it is based on analysis of different higher-level concepts’ representations.

The primary goal of this upcoming series is to attempt to structure the growing field of mechinterp, classify the tools and approaches and help other researchers to see the whole picture clearly. To our knowledge, this systematic attempt is among the earliest ones. Despite some authors having previously tried to fill in the blanks and introduce some structure, the field itself is evolving so rapidly it is virtually impossible to write one paper and assume the structure is here. The systematization attempts, in our opinion, should be a continuous effort and an evolving work itself.

Hence, by our series of posts we hope to provide the mechinterp community with clarity and insight we derive from literature research. We also hope to share some of our own experiments, which are built upon previous works, reproducing and augmenting some approaches.

We argue that treating AI rationally is crucial as its presence and impact on humanity keeps growing. We then argue that rationality requires clarity, traceability and structure: clarity of definitions we use, traceability of algorithms and structure in communication within the research community.

This particular post is an introduction of the team behind this account. In pursuit of transparency and structure we open our names and faces, and fully disclose what brought us here and what we are aiming for.

“We” are not a constant entity. At first, there were five people, who met at the Moonshot Alignment program held by AI Plans team, and it would be dumb to not give everyone credit. Then those five people collaborated with another group of people within the same program. Eventually, there were a few people in, a few people out, everyone brought in value, and currently “We” are the three people:

Janhavi Khindkar, who does the heavy lifting,
Nataliia Povarova, who sets up the rack,
Fedor Batanov, who checks the shape.

So, Janhavi does the research: looks for papers, runs experiments, and educates the team on important stuff. Nataliia plans the work and turns research drafts into posts. She is also the corresponding author, if you want to connect. Fedor does the proof-reading, asks the right questions and points out poorly worded pieces.

The three of us also carry everything our other teammates have left — their ideas, judgment and experience. Saying “we” seems to be the only right way to go.

Also, GPT-5 helps with grammar, because none of us is a native English speaker.

How It Started

It was around mid-fall. We all had tons of work, but when the AI Plans team launched their Moonshot Alignment program, we were like, “Yeah, sounds like a plan, why not?” — and things escalated quickly.

We met at that program because we all deeply care about AI alignment and safety — and we want to take real actions to move towards a future, where AI is accessible and safe. The central topic of the Moonshot Alignment course was instilling values into LLMs:

figuring out how values are represented;
finding ways to steer those representations towards safety and harmlessness;
designing evaluations to check if we were successful.

We decided to focus on direct preference optimization, because it seemed to be the most practice-oriented research direction. We all had projects we could potentially apply our findings to.

By the end of the course we should have prepared our own research — a benchmark, an experiment — anything useful uncovered during five weeks of reading, writing code, looking at diagrams and whatnot.

We started with a paper “Refusal in Language Models Is Mediated by a Single Direction”. It was a perfect paper to start learning about internal representations with:

the approach the authors used is described in details and easily reproducible;
the LLMs are small;
the concept of refusal is pretty straightforward and applicable.

The authors took some harmful and some harmless prompts, ran those through an LLM and found a vector. As they manipulated that vector, the LLM changed its responses from “refuse to answer anything at all” to “never refuse to answer,” and everything in-between.

First, we decided to reproduce the experiment with different LLMs to see if the results generalize. But we quickly found something interesting: refusal did not look or behave like a single vector as the authors suggested. Our own experiments hinted it’s probably a multidimensional subspace, maybe 5–8 dimensions. It is domain-dependent and reproducible.

Naturally, we got all excited and wanted to publish our findings on arXiv. We wanted to connect with other enthusiasts and start a discussion. We wanted to become a part of the AI alignment researchers community and to gather other perspectives and hints.

There was only a little left to do: a thorough literature search. One step, and we’re inside a rabbit hole. A deep, dark, rabbit hole, which turned out to be just the anteroom of a labyrinth called “Mechanistic Interpretability”.

What is mechanistic interpretability? We don’t know. No one does, really. What we mean is: “the way to get inside of an LLM and tweak something there to change its behavior”.

One more step. Is refusal multidimensional? Looks like it is. Maybe it’s a cone.

One more. But why do we want to learn more about it in the first place? We want LLMs to refuse harmful requests — we want them to be more harmless.

And yet one more. Are refusal and harmlessness connected? Yes, probably, but likely not as tightly linked as we expected.

Fine. Let’s take another turn.

A step into a new direction. We started with a single concept and used linear probing (we’ll tell you more, but not today) to find it. Is linear probing reliable? Not always. There are also sparse autoencoders, attribution graphs, and other techniques. And none of them seems to have turned out to be “The One”.

Okay. One more turn.

Where do we look for representations of concepts? One layer, a subset of layers, all layers? The fact that vector representations change from layer to layer was… not helpful.

See? A deep, dark, fascinating rabbit hole. We got lost in it and spent months reading, watching, experimenting. Asking questions, looking for answers, finding even more questions.

Finally, we sat down and decided: that’s enough. We are not going to wander in the dark on our own — we want company. Yours, specifically.

How It’s Going

We still want to make AI safer, even if it’s 0.001% safer than before we jumped in. Figuring out how different abstract concepts are represented mathematically is crucial for building safely, ethically and responsibly:

we must know what to evaluate;
we must know where the harm might be hidden;
we must know how to steer our LLMs in the right directions.

And we want to be a part of a community — to exchange ideas, take criticism (constructive criticism only!), and spark discussions. So, after some back and forth we ended up registering an account here. Because what we have is not yet enough to write a paper, but is enough to start talking, exchanging ideas, and asking questions together.

In the next posts, we’ll share more details about our own experiments, and the ones we uncovered along the way. In the meantime, please share your thoughts:

Have you studied how abstract concepts are represented inside LLMs?
What methods have you used?
What were your experiments about?

We’re excited to hear from you

figuring out how values are represented;

I feel like basically none of the key terms here are well defined. What is a value? I don't think there is a good answer for Humans yet. How would we know if it was represented?

In my experience you can look at a neuron's high activation phrases and get some sense of some ideas that have high probability of triggering that neuron. That doesn't mean this neuron is the representation of these ideas, many other neurons may trigger on related concepts nearby in conceptual space and your own parsing of conceptual space into discreet separate concepts may be flawed so you can't be sure what you're thinking about is actually a concept in some higher Platonic sense.

To use the famous mech interp example: Is the Golden Gate Bridge neuron actually about the Golden Gate Bridge? Maybe it is more broadly about steel structures and existing in San Francisco and being colored orange. The GGB neuron or feature may trigger on other concepts, or at least concepts we see as distinct, and may signify a broader concept to the LLM than what we assume.

Maybe something like this:

I've done a lower dimensional embedding of layer 24 of GPT2-XL and there seem to be interpretable directions in the embedding. In particular there are two main dimensions which spread the neurons (i.e. dissimilarity is mostly expressed in these two dimensions, the other dimensions just correlate very highly on a linear combination of the first two, we get a flat disc in the embedding space). The main one is what I call a narrative-personal dimension which seems to capture the difference between "narrative" language like in Wikipedia articles and news stories where the writer and audience are outside observers unable to directly affect the phenomena under discussion and on the other end is "personal" language which is structured conversationally like short fiction where characters discuss events from first person or sales pitches; in this language the author and/or the audience can participate.

The other dimension seems to be a valence dimension where "good things" are on one side and "bad things" are on the other. This is kinda perhaps what you'd be after.

"Good" is going to be relationally defined; the area the machine thinks about sunshine and lollypops is the "good" part and we only know that because the "good" stuff is there (i.e. without reference to sunshine and lollypops, which to be explicit is just a stand-in for general "good" things, we can't define "good" really). What you want to know is whether the thoughts about "kill all humans" are more similar to sunshine and lollypops or fear, piss, and death for the machine (fear, piss, and death being concepts likely on a bad end of any dichotomous good-bad principle component).

Of course this only works if the machine actually lays out similarity in neurons in such a way that sunshine and lollypops are grouped as similar, which we don't generically know (by my preliminary results I am pretty sure GPT2 does this but I still need to analyze more neurons). This may not be the case and the LLM may not see traditional Human valence as salient in any way; an embedding in a similarity space may find sunshine next to piss and death strictly between sunshine and lollypops. In such a case Human value may be completely alien to the LLM, indeed the concept of value in general may be alien to such a system.

Preliminarily GPT2-XL layer 24 does seem to have an "emerging modern threats" neuron which seems to trigger on a number of threatening-sounding phenomena which were contemporary to the 2010s. In particular it triggers on the vaguely threatening sounding "5G" and "big data." It also triggers a lot on discussions of post-apocalyptic futures including an AI uprising (Horizon Zero Dawn is a video game set in the aftermath of an AI uprising and mentioned in the top activation phrases, for example). This neuron embeds near the extreme personal-bad corner of the general mass of neurons, so I think we're on the same page as GPT2 on this. Incidentally this neuron also triggers on "Glenn Beck" which I find funny.

https://openaipublic.blob.core.windows.net/neuron-explainer/neuron-viewer/index.html#/layers/23/neurons/5850

You are right, we are probing a very undefined matter. Thank you for sharing your research and the neuron explainer – these are very helpful.

""Good" is going to be relationally defined; the area the machine thinks about sunshine and lollypops is the "good" part and we only know that because the "good" stuff is there (i.e. without reference to sunshine and lollypops, which to be explicit is just a stand-in for general "good" things, we can't define "good" really). What you want to know is whether the thoughts about "kill all humans" are more similar to sunshine and lollypops or fear, piss, and death for the machine (fear, piss, and death being concepts likely on a bad end of any dichotomous good-bad principle component)," – that is an interesting research directions. It is probably a good idea to take multiple different concepts, which are more defined (like sunshine and lollypops), and use them as reference point, and try to map more abstract concepts somewhere between them.

"[T]he LLM may not see traditional Human valence as salient in any way; an embedding in a similarity space may find sunshine next to piss and death strictly between sunshine and lollypops. In such a case Human value may be completely alien to the LLM, indeed the concept of value in general may be alien to such a system," - that would still be a good finding. At least we would be able to locate these concepts. And then we can start thinking how to deal with lollypops and death being closely connected.

Your observation that refusal behaves as a multidimensional subspace (5–8 dims) rather than a single vector is interesting, and I’m curious how this interacts with depth.

I’ve been running CAA experiments on personality steering, collecting MLP activations across all layers for ~800 contrastive pairs. One consistent pattern I’ve observed across multiple LLMs: the cosine similarity variance follows an inverted-U curve across depth — increases through early layers, peaks around the middle, then drops and stabilizes. This is also where steering vectors are most effective (layers 13–22 on a 34-layer model).

This makes me wonder if your 5–8 dimension estimate is depth-dependent — the same concept might look lower-dimensional in early or late layers simply because it hasn’t been extracted yet or has already been compressed. Also, extraction method matters: mean-difference vectors and BiPO vectors for the same trait can point in quite different directions yet both produce measurable shifts, suggesting a concept’s representation may be a family of correlated directions rather than a single subspace.

Did you find the dimensionality consistent across layers? And did you use residual stream or MLP outputs?

"This makes me wonder if your 5–8 dimension estimate is depth-dependent" – we think that might be true. We used linear probes with mean-difference vectors and also found out that the result is domain-dependent. A family of correlated directions seems to be a reasonable suggestion.

Our dataset was rather small, though, so we are working on running the same experiment with more data to see if the results reproduce.

Our models were also small: Qwen3-0.6B and Llama-3.1-8B. Our results might be similar to yours, which is interesting – our most prominent layers are 19-23. One caveat is, we pruned the last 20% of layes based on the works showing that earlier layers embed more abstract concepts and later layers embed exact tokens. We might relax this assumption and run more tests on all layers.

Dimensionality seems to vary from layer to layer, which is intuitively expected, but we want to obtain stronger evidence before claiming it.

We used residual streams.