This post is an introduction for the series of posts, which will be dedicated to mechanistic interpretability in its broader definition as a set of approaches and tools to better understand the processes that lead to certain AI-generated outputs. There is an ongoing debate on what to consider a part of the mechanistic interpretability field. Some works suggest to call “mechanistic interpretability” the set of approaches and tools that work “bottom-up” and focus on neurons and smaller components on deep neural networks. Some others include in mechinterp attribution graphs, which one could consider a “top-down” approach, since it is based on analysis of different higher-level concepts’ representations.
The primary goal of this upcoming series is to attempt to structure the growing field of mechinterp, classify the tools and approaches and help other researchers to see the whole picture clearly. To our knowledge, this systematic attempt is among the earliest ones. Despite some authors having previously tried to fill in the blanks and introduce some structure, the field itself is evolving so rapidly it is virtually impossible to write one paper and assume the structure is here. The systematization attempts, in our opinion, should be a continuous effort and an evolving work itself.
Hence, by our series of posts we hope to provide the mechinterp community with clarity and insight we derive from literature research. We also hope to share some of our own experiments, which are built upon previous works, reproducing and augmenting some approaches.
We argue that treating AI rationally is crucial as its presence and impact on humanity keeps growing. We then argue that rationality requires clarity, traceability and structure: clarity of definitions we use, traceability of algorithms and structure in communication within the research community.
This particular post is an introduction of the team behind this account. In pursuit of transparency and structure we open our names and faces, and fully disclose what brought us here and what we are aiming for.
“We” are not a constant entity. At first, there were five people, who met at the Moonshot Alignment program held by AI Plans team, and it would be dumb to not give everyone credit. Then those five people collaborated with another group of people within the same program. Eventually, there were a few people in, a few people out, everyone brought in value, and currently “We” are the three people:
So, Janhavi does the research: looks for papers, runs experiments, and educates the team on important stuff. Nataliia plans the work and turns research drafts into posts. She is also the corresponding author, if you want to connect. Fedor does the proof-reading, asks the right questions and points out poorly worded pieces.
The three of us also carry everything our other teammates have left — their ideas, judgment and experience. Saying “we” seems to be the only right way to go.
Also, GPT-5 helps with grammar, because none of us is a native English speaker.
How It Started
It was around mid-fall. We all had tons of work, but when the AI Plans team launched their Moonshot Alignment program, we were like, “Yeah, sounds like a plan, why not?” — and things escalated quickly.
We met at that program because we all deeply care about AI alignment and safety — and we want to take real actions to move towards a future, where AI is accessible and safe. The central topic of the Moonshot Alignment course was instilling values into LLMs:
figuring out how values are represented;
finding ways to steer those representations towards safety and harmlessness;
designing evaluations to check if we were successful.
We decided to focus on direct preference optimization, because it seemed to be the most practice-oriented research direction. We all had projects we could potentially apply our findings to.
By the end of the course we should have prepared our own research — a benchmark, an experiment — anything useful uncovered during five weeks of reading, writing code, looking at diagrams and whatnot.
the approach the authors used is described in details and easily reproducible;
the LLMs are small;
the concept of refusal is pretty straightforward and applicable.
The authors took some harmful and some harmless prompts, ran those through an LLM and found a vector. As they manipulated that vector, the LLM changed its responses from “refuse to answer anything at all” to “never refuse to answer,” and everything in-between.
First, we decided to reproduce the experiment with different LLMs to see if the results generalize. But we quickly found something interesting: refusal did not look or behave like a single vector as the authors suggested. Our own experiments hinted it’s probably a multidimensional subspace, maybe 5–8 dimensions. It is domain-dependent and reproducible.
Naturally, we got all excited and wanted to publish our findings on arXiv. We wanted to connect with other enthusiasts and start a discussion. We wanted to become a part of the AI alignment researchers community and to gather other perspectives and hints.
There was only a little left to do: a thorough literature search. One step, and we’re inside a rabbit hole. A deep, dark, rabbit hole, which turned out to be just the anteroom of a labyrinth called “Mechanistic Interpretability”.
What is mechanistic interpretability? We don’t know. No one does, really. What we mean is: “the way to get inside of an LLM and tweak something there to change its behavior”.
One more step. Is refusal multidimensional? Looks like it is. Maybe it’s a cone.
One more. But why do we want to learn more about it in the first place? We want LLMs to refuse harmful requests — we want them to be more harmless.
A step into a new direction. We started with a single concept and used linear probing (we’ll tell you more, but not today) to find it. Is linear probing reliable? Not always. There are also sparse autoencoders, attribution graphs, and other techniques. And none of them seems to have turned out to be “The One”.
Okay. One more turn.
Where do we look for representations of concepts? One layer, a subset of layers, all layers? The fact that vector representations change from layer to layer was… not helpful.
See? A deep, dark, fascinating rabbit hole. We got lost in it and spent months reading, watching, experimenting. Asking questions, looking for answers, finding even more questions.
Finally, we sat down and decided: that’s enough. We are not going to wander in the dark on our own — we want company. Yours, specifically.
How It’s Going
We still want to make AI safer, even if it’s 0.001% safer than before we jumped in. Figuring out how different abstract concepts are represented mathematically is crucial for building safely, ethically and responsibly:
we must know what to evaluate;
we must know where the harm might be hidden;
we must know how to steer our LLMs in the right directions.
And we want to be a part of a community — to exchange ideas, take criticism (constructive criticism only!), and spark discussions. So, after some back and forth we ended up registering an account here. Because what we have is not yet enough to write a paper, but is enough to start talking, exchanging ideas, and asking questions together.
In the next posts, we’ll share more details about our own experiments, and the ones we uncovered along the way. In the meantime, please share your thoughts:
Have you studied how abstract concepts are represented inside LLMs?
This post is an introduction for the series of posts, which will be dedicated to mechanistic interpretability in its broader definition as a set of approaches and tools to better understand the processes that lead to certain AI-generated outputs. There is an ongoing debate on what to consider a part of the mechanistic interpretability field. Some works suggest to call “mechanistic interpretability” the set of approaches and tools that work “bottom-up” and focus on neurons and smaller components on deep neural networks. Some others include in mechinterp attribution graphs, which one could consider a “top-down” approach, since it is based on analysis of different higher-level concepts’ representations.
The primary goal of this upcoming series is to attempt to structure the growing field of mechinterp, classify the tools and approaches and help other researchers to see the whole picture clearly. To our knowledge, this systematic attempt is among the earliest ones. Despite some authors having previously tried to fill in the blanks and introduce some structure, the field itself is evolving so rapidly it is virtually impossible to write one paper and assume the structure is here. The systematization attempts, in our opinion, should be a continuous effort and an evolving work itself.
Hence, by our series of posts we hope to provide the mechinterp community with clarity and insight we derive from literature research. We also hope to share some of our own experiments, which are built upon previous works, reproducing and augmenting some approaches.
We argue that treating AI rationally is crucial as its presence and impact on humanity keeps growing. We then argue that rationality requires clarity, traceability and structure: clarity of definitions we use, traceability of algorithms and structure in communication within the research community.
This particular post is an introduction of the team behind this account. In pursuit of transparency and structure we open our names and faces, and fully disclose what brought us here and what we are aiming for.
“We” are not a constant entity. At first, there were five people, who met at the Moonshot Alignment program held by AI Plans team, and it would be dumb to not give everyone credit. Then those five people collaborated with another group of people within the same program. Eventually, there were a few people in, a few people out, everyone brought in value, and currently “We” are the three people:
So, Janhavi does the research: looks for papers, runs experiments, and educates the team on important stuff. Nataliia plans the work and turns research drafts into posts. She is also the corresponding author, if you want to connect. Fedor does the proof-reading, asks the right questions and points out poorly worded pieces.
The three of us also carry everything our other teammates have left — their ideas, judgment and experience. Saying “we” seems to be the only right way to go.
How It Started
It was around mid-fall. We all had tons of work, but when the AI Plans team launched their Moonshot Alignment program, we were like, “Yeah, sounds like a plan, why not?” — and things escalated quickly.
We met at that program because we all deeply care about AI alignment and safety — and we want to take real actions to move towards a future, where AI is accessible and safe. The central topic of the Moonshot Alignment course was instilling values into LLMs:
We decided to focus on direct preference optimization, because it seemed to be the most practice-oriented research direction. We all had projects we could potentially apply our findings to.
By the end of the course we should have prepared our own research — a benchmark, an experiment — anything useful uncovered during five weeks of reading, writing code, looking at diagrams and whatnot.
We started with a paper “Refusal in Language Models Is Mediated by a Single Direction”. It was a perfect paper to start learning about internal representations with:
The authors took some harmful and some harmless prompts, ran those through an LLM and found a vector. As they manipulated that vector, the LLM changed its responses from “refuse to answer anything at all” to “never refuse to answer,” and everything in-between.
First, we decided to reproduce the experiment with different LLMs to see if the results generalize. But we quickly found something interesting: refusal did not look or behave like a single vector as the authors suggested. Our own experiments hinted it’s probably a multidimensional subspace, maybe 5–8 dimensions. It is domain-dependent and reproducible.
Naturally, we got all excited and wanted to publish our findings on arXiv. We wanted to connect with other enthusiasts and start a discussion. We wanted to become a part of the AI alignment researchers community and to gather other perspectives and hints.
There was only a little left to do: a thorough literature search. One step, and we’re inside a rabbit hole. A deep, dark, rabbit hole, which turned out to be just the anteroom of a labyrinth called “Mechanistic Interpretability”.
What is mechanistic interpretability? We don’t know. No one does, really. What we mean is: “the way to get inside of an LLM and tweak something there to change its behavior”.
One more step. Is refusal multidimensional? Looks like it is. Maybe it’s a cone.
One more. But why do we want to learn more about it in the first place? We want LLMs to refuse harmful requests — we want them to be more harmless.
And yet one more. Are refusal and harmlessness connected? Yes, probably, but likely not as tightly linked as we expected.
Fine. Let’s take another turn.
A step into a new direction. We started with a single concept and used linear probing (we’ll tell you more, but not today) to find it. Is linear probing reliable? Not always. There are also sparse autoencoders, attribution graphs, and other techniques. And none of them seems to have turned out to be “The One”.
Okay. One more turn.
Where do we look for representations of concepts? One layer, a subset of layers, all layers? The fact that vector representations change from layer to layer was… not helpful.
See? A deep, dark, fascinating rabbit hole. We got lost in it and spent months reading, watching, experimenting. Asking questions, looking for answers, finding even more questions.
Finally, we sat down and decided: that’s enough. We are not going to wander in the dark on our own — we want company. Yours, specifically.
How It’s Going
We still want to make AI safer, even if it’s 0.001% safer than before we jumped in. Figuring out how different abstract concepts are represented mathematically is crucial for building safely, ethically and responsibly:
And we want to be a part of a community — to exchange ideas, take criticism (constructive criticism only!), and spark discussions. So, after some back and forth we ended up registering an account here. Because what we have is not yet enough to write a paper, but is enough to start talking, exchanging ideas, and asking questions together.
In the next posts, we’ll share more details about our own experiments, and the ones we uncovered along the way. In the meantime, please share your thoughts:
We’re excited to hear from you