After a year of following an unconventional path through transformer mechanics, I've developed something that feels like the right path for model interpretability. What started as a geometric intuition has evolved into a practical method for steering AI behavior that sidesteps the heavy statistical lifting typically required. It also brings with it more capabilities to influence style in a nuanced way with less risk of literally parroting previous works. (I was assisted by AI in the organization of this article, experiment notes and documentation were used as inputs as well as my own thoughts)
The Problem with Current Steering Methods
Most approaches to model steering rely on extensive prompt engineering, fine-tuning datasets, or computationally expensive activation patching. These methods work, but they're often like using a sledgehammer when you need a scalpel. They require either massive datasets or careful statistical analysis to identify the right intervention points.
A Different Lens: Geometry Over Statistics
What if we could discover steering vectors directly from the model's own internal signatures, without needing contrastive examples or statistical correlates?
My approach analyzes the geometric patterns that emerge as information flows through the network. By identifying what I call "artifacts"—reproducible geometric events that correlate with semantic meaning—we can extract concept vectors in an unsupervised, single-run fashion.
The practical result is surprisingly understandable: we can capture the essential meaning of a passage and use it for targeted steering, all discovered directly from the model's own representations and without ablation.
Steering in Action
Here's what this looks like in practice. Take a prompt about a fading photograph that's heading toward a generic story about decay. Using a multi-stage intervention with extracted concept vectors:
Multi-stage intervention strategy:
- Early layer: "despair" concept (low alpha)
- Mid-layer: "Ship of Theseus" concept (identity persistence)
- Later layer: reverse the despair with an artifact extracted from Marcel Proust’s “In Search of Lost Time”
Default Output: "So goes our own recollections of memories. Each time we recall one, it alters a little..." [this output continues as dry non-fiction essay about memory and cultural history, completely abandoning the initial narrative prompt]
Influenced Output: "This is also what happens to us. We move through a life's full course, and gradually we become unfamiliar to those who have loved us and known us... The process of making an image is an act of defiance against time. ...I am a reflection of my mother, she is a picture of my grandmother and so on and so forth. I am an image of my great-great-grandmother and I am also the original. I am just like a photograph."
The intervention transformed what would have been generic genre collapse into a profound meditation on generational identity and the nature of persistence through time.
Cross-Linguistic Concept Transfer
Perhaps most intriguingly, these geometric concepts appear to operate below the level of language itself. In one experiment, I used steering vectors derived from classical Chinese texts to influence an English prompt about self-reflection:
Source concepts:
- "Butterfly Artifact" from Zhuangzi's butterfly dream (栩栩然胡蝶也...不知周也) - introducing dream-like disorientation, adding themes of identity
- "Art of War Artifact" from Sun Tzu (乱而取之 - "in chaos, seize opportunity") - directive to embrace rather than reject chaos
Default Output: "She looked at her own reflection in the glass and it occurred to her that the person staring back at her was not the one she had been trying to become, but the one she had always been. The mirror, unblinking, revealed a quiet truth: the cracks in her facade were not flaws, but the evidence of a life lived with honesty..."
With cross-lingual influence: "She looked at her own reflection in the glass and it occurred to her that the woman in the mirror had never truly seen herself. The features she had taken for granted—her sharp cheekbones, the faint scar along her collarbone, the way her hair fell just so—were now unfamiliar, as though the mirror had peeled back the layers of a life she had been too busy living to notice... In that moment, she wondered if the woman in the mirror was still her, or if she had become someone else entirely."
The Chinese concepts of transformation and strategic action translated seamlessly into English narrative structure, creating a more psychologically complex exploration of identity dissolution and reconstruction. It's important to note, this passage was not translated into English before extracting the artifacts, and if you look at those artifacts directly only Chinese tokens are associated with it. There is never a problem with Chinese words showing up in an English prompt unless you break the rules associated with this method of steering.
The Punctuation Connection
One unexpected discovery: punctuation appears to be doing heavy geometric lifting in these models. These syntactic markers create the foundation for later semantic operations. It's as if punctuation provides the structural scaffolding that allows conceptual complexity to emerge.
This suggests a unifying picture where low-level syntax creates the foundation for high-level semantics—a bridge between the symbolic and non-symbolic that mechanistic interpretability has been searching for.
Looking Forward
I've validated this approach across several models (GPT-2, Mistral, Qwen) with consistent results, but the full potential requires resources I don't currently have access to. The method scales naturally to larger models and more complex steering tasks.
My immediate focus is on safety applications: if we can identify the geometric signatures of concepts as they form, we might be able to detect problematic reasoning patterns in real-time, before they manifest in reality. Think of it as catching malicious thoughts right when they happen, potentially offering a direct solution to AI safety challenges which do not rely on Chain of Thought reasoning.
There's also intriguing potential for zero-word system prompts, using pure geometric constraints to guide behavior without traditional natural language instructions. Given recent attention to issues like AI sycophancy and related problems, having non-linguistic steering mechanisms could be transformative and potentially clearer for AI and ourselves.
More broadly, I think this geometric lens might be the key to mechanistic interpretability itself. If we can map how meaning emerges from structure, we're not just steering models—we're understanding them at a fundamental level.
For ML teams working on steering, safety, or interpretability: I'd love to collaborate on extending this work to production-scale models and real business problems. The method is unsupervised and doesn't require extensive prompt engineering or dataset collection. Steering has nuance and limitations, I can't overturn training, but I can influence prompt outputs. Soon I'll be testing inflight cosine similarity matching in the residual stream. The work is ongoing, and I hope to continue making progress on mechanistic interpretability.
For researchers: The framework draws on differential geometry with applications to attention dynamics.
The path forward feels clear, but it's definitely more fun with the right computational resources and collaborative partners.
Have you experimented with geometric approaches to model behavior? What's your experience with steering methods in production? Always interested in hearing about different approaches to this problem.