Research increasingly shows that various geometric structures emerge in the activation and behavior spaces of large language models. These structures are fascinating to me, and I find it worth exploring what structures we can find for different concepts in LLM’s. I made a tool allowing anyone to generate and visualize 3d structures for some arbitrary concept. The tool currently has support for cyclical and smooth structures, extracting activations at any given layer in the Llama-3-8b model. Check it out here:
Traditionally, mechanistic interpretability has focused on steering models with linear direction vectors. Underpinning this is the Linear Representation Hypothesis, which states that all features in language models are collections of sparse linear directions. While linear activation steering has proven useful to shift model behavior, it can also produce unexpected behavior. When the scalar's magnitude becomes too large, the language model often breaks down, producing garbled and incoherent text. Personally, I do not think that the linear representation hypothesis is sufficient to explain what is going on mechanistically in these large language models.
Recent research from Goodfire has focused on exposing the smooth geometric structures that form for certain ordered concepts, such as days of the week, seasons, age, temperature, and more. Recurring ideas like days of the week and seasons form smooth cyclical structures, whereas linear concepts such as temperature and age form arcs. The work here is directly inspired by Goodfire's paper and research.
Extracting Activation Vectors
One of the most important steps in this pipeline is extracting meaningful activation vectors from carefully constructed prompts. For example, Llama may be prompted, "What day comes 3 days after Tuesday?" or "What day comes before Thursday?" Ground-truth labels are simplythe correct answer to the prompt. Activation vectors are extracted from the last token to capture maximal semantic meaning.
The neural geometry tool used the GPT-4o-mini model to generate a multitude of prompts with ground truth labels for any given concept. These prompts are often favored for cyclical and linear concepts, where semantic meaning can easily be captured. One potential issue with this style of prompting is that we may be imposing a cyclical structure through our choice of words when there otherwise may not be one. For example, prompts such as "What day comes two days before Wednesday?" when repeated across every weekday, may impose a cyclical structure when there otherwise may not be one. It would be interesting to explore whether the same structure emerges when a different style of prompting is used.
Projection Down to 3D
Raw activation vectors are first projected down to a 64-dimensional space. Here, the vectors are averaged for the same ground truth labels to produce centroids. These centroids are more stable in terms of their semantic meaning than any individual vector alone.
We do not project directly down to 3D; 64D serves as an intermediate step. The purpose of projecting down to 64 dimensions is to reduce most of the noise that exists in a high dimension, like 4096. Important semantic meaning may only be encoded in a small number of dimensions. Computing centroids in 64D is also much more computationally stable than computing in 4096. 64 is a somewhat arbitrary dimension; 32 and 128 can also work just fine.
A smooth curve is fitted through the centroids that are produced in 64 dimensions and sampled at 300 different points. The resulting spline curve is then projected down to 3d via PCA. This is the smooth curve that is displayed on the website after the full pipeline is finished.
Conclusion
Overall, I find this to be one of the more interesting research directions coming out in the field of mechanistic interpretability. While linear directions have been shown to causally influence model behavior, assuming that all features are simply linear direction vectors seems to be a nascent theory of interpretability and alignment in general. Rather, the structures within language models are likely much more complex and have nonlinearities, which is why models break down when steered too far. I am excited to see where this research leads in the future. In the meantime, I hope this tool can make this field of research more accessible for everyone.
Summary
Research increasingly shows that various geometric structures emerge in the activation and behavior spaces of large language models. These structures are fascinating to me, and I find it worth exploring what structures we can find for different concepts in LLM’s. I made a tool allowing anyone to generate and visualize 3d structures for some arbitrary concept. The tool currently has support for cyclical and smooth structures, extracting activations at any given layer in the Llama-3-8b model. Check it out here:
https://neural-geometry.vercel.app
Introduction
Traditionally, mechanistic interpretability has focused on steering models with linear direction vectors. Underpinning this is the Linear Representation Hypothesis, which states that all features in language models are collections of sparse linear directions. While linear activation steering has proven useful to shift model behavior, it can also produce unexpected behavior. When the scalar's magnitude becomes too large, the language model often breaks down, producing garbled and incoherent text. Personally, I do not think that the linear representation hypothesis is sufficient to explain what is going on mechanistically in these large language models.
Recent research from Goodfire has focused on exposing the smooth geometric structures that form for certain ordered concepts, such as days of the week, seasons, age, temperature, and more. Recurring ideas like days of the week and seasons form smooth cyclical structures, whereas linear concepts such as temperature and age form arcs. The work here is directly inspired by Goodfire's paper and research.
Extracting Activation Vectors
One of the most important steps in this pipeline is extracting meaningful activation vectors from carefully constructed prompts. For example, Llama may be prompted, "What day comes 3 days after Tuesday?" or "What day comes before Thursday?" Ground-truth labels are simply the correct answer to the prompt. Activation vectors are extracted from the last token to capture maximal semantic meaning.
The neural geometry tool used the GPT-4o-mini model to generate a multitude of prompts with ground truth labels for any given concept. These prompts are often favored for cyclical and linear concepts, where semantic meaning can easily be captured. One potential issue with this style of prompting is that we may be imposing a cyclical structure through our choice of words when there otherwise may not be one. For example, prompts such as "What day comes two days before Wednesday?" when repeated across every weekday, may impose a cyclical structure when there otherwise may not be one. It would be interesting to explore whether the same structure emerges when a different style of prompting is used.
Projection Down to 3D
Raw activation vectors are first projected down to a 64-dimensional space. Here, the vectors are averaged for the same ground truth labels to produce centroids. These centroids are more stable in terms of their semantic meaning than any individual vector alone.
We do not project directly down to 3D; 64D serves as an intermediate step. The purpose of projecting down to 64 dimensions is to reduce most of the noise that exists in a high dimension, like 4096. Important semantic meaning may only be encoded in a small number of dimensions. Computing centroids in 64D is also much more computationally stable than computing in 4096. 64 is a somewhat arbitrary dimension; 32 and 128 can also work just fine.
A smooth curve is fitted through the centroids that are produced in 64 dimensions and sampled at 300 different points. The resulting spline curve is then projected down to 3d via PCA. This is the smooth curve that is displayed on the website after the full pipeline is finished.
Conclusion
Overall, I find this to be one of the more interesting research directions coming out in the field of mechanistic interpretability. While linear directions have been shown to causally influence model behavior, assuming that all features are simply linear direction vectors seems to be a nascent theory of interpretability and alignment in general. Rather, the structures within language models are likely much more complex and have nonlinearities, which is why models break down when steered too far. I am excited to see where this research leads in the future. In the meantime, I hope this tool can make this field of research more accessible for everyone.