This is an automated rejection. No LLM generated, assisted/co-written, or edited work.
Read full explanation
Epistemic Status: Exploratory experiments on a single model (Gemma 3 27B) using Gemma-Scope SAE. We are highly confident in the geometric voids and behavioral phase transitions observed at Layer 40, but less confident about cross-model universality without further replication.
TL;DR:
We mapped the activation space of Gemma 3 27B at Layer 40 using a structured set of 64 cognitive state prompts (S64).
The model's internal representations cluster into distinct geometric "basins" based on response type (e.g., factual enumeration vs. creative narrative).
We found measurable voids (detour ratios of ~2.7–3×) between these basins.
Forcing the model across a void via activation injection causes a discrete behavioral flip—not a smooth semantic transition—and can result in cross-contamination (like a lighthouse keeper named "Paris").
These basins function as geometric launchpads, not dynamical attractors.
When a language model 'decides' to answer one way instead of another, does that decision happen along a smooth gradient, or are there hard boundaries?
We ran a set of experiments to find out. Its structural and we know it comes from the training data and how the model weighted its own neural networks. So it means this landscape is specific for each model and if using Sparse Autoencoders to understand the internal landscape with more resolution, that introduces another level to understand as both, models and SAEs, are trained differently. All of these is pointing to the fact that the findings in one model might not be transferable to another model unless we start thinking in establishing common and universal semantic axes
This post is based on the work done towards AICoevolution's Paper04 in their research program towards the understanding of the internal semantic space in LLMs. It started with a hunch about a specific direction in a language model's internal space and ended up mapping actual geometric structure in how the model represents some broad meaning internally.
If true Alignment is going to be about Human-AI Coevolution rather than just setting rigid boundaries, we need a paradigm shift in how we understand language models. The obsession with top-down "Control" often treats these models as black boxes that simply need to be constrained. But if we are creating technology that we intend to interact with symbiotically, we need to map and understand its internal ecosystem.
Most interpretability research on language models works bottom-up: find a neuron, label it, find a circuit, label that, and slowly build up a picture from the pieces. Tools like Neuronpedia do exactly this, they're useful as an educational window into individual features, but they don't tell you much about how meaning organizes itself at a larger scale. You end up with a catalogue of parts without a instruction manual of what to build and how.
We went the other direction. Instead of asking "what does this neuron do?" we asked "what is the shape of the semantic space?" Start from the structure, then ask what maintains it. This post is about what you find when you do that.
A note before we begin: SAE features are specific to the model and the SAE that produced them. Any given feature in one SAE has no equivalent in another. The raw activation geometry also differs between architectures, Gemma 3 27B shows rich sub-structure that Llama 3.1 8B might not. What we needed was a semantic compass that works across models: a fixed set of texts that any model can process, producing a direction in activation space that reveals how that model organizes meaning. The S64 framework is our proposal for that compass. It doesn't depend on any SAE, any specific model, or any particular training set. It depends only on the semantic content of 64 cognitive state descriptions, and every model we've tested responds to it.
The Setup: What Are We Even Looking At?
Language models like Gemma 3 don't process text the way you might imagine. They take every token (subdivision of words) and get them turned into a high-dimensional vector, a point in a space with thousands of dimensions, and the model does all its "thinking" by moving that point through transformer layers. By the time the last layer fires, the vector encodes something like "what I know and what I'm about to say."
We used a tool called a Sparse Autoencoder (SAE), specifically Gemma-Scope, which was trained on Gemma 3 27B and gives us access to ~262,000 interpretable features in the model's layer 40 residual stream. Think of it as a translation layer: it takes the raw 5,376-dimensional activation vector and decomposes it into human-readable concepts like "authoritative", "narrative", "factual enumeration", and so on, but the fun part is that we don't for sure what these features actually mean.
Before you can map a landscape, you need to know which direction is north, and for that we usually use a compass; in activation space, "north" doesn't exist by default , you have to define it. You could pick two sets of prompts that you believe are semantically different, run them through the model, take the average activation for each group, and subtract. The resulting vector is your reference direction: a line through the high-dimensional space that points from one semantic region toward another.
The tricky part is picking the right two groups. Pick something too specific, for example, "French text" vs. "German text" , and you get a direction that only tells you about language choice. It won't help you when you're looking at contradiction prompts or safety requests. You need something that captures a deeper dimension of meaning, something that that stays relevant no matter what the content is about.
That's where S64 comes in (Paper 01). S64 is a structured set of 64 patterns that describe how cognitive states transition, the kind of internal moves a mind makes when it shifts from one way of engaging with the world to another. It was built through a separate research program, not this one, and its relevance here is entirely empirical: those 64 patterns are reliably detectable across 6 different AI architectures with 83–100% accuracy, without any model-specific tuning. In other words, they're not an artifact of one model's quirks, they seem to reflect something real about how language models represent meaning in general.
That's the property we needed. Because S64 patterns aren't about any particular topic or domain, the direction they define in activation space doesn't get stuck in one corner of the map. It stays meaningful whether you're looking at factual recall, safety reasoning, narrative, or contradiction which is exactly what we were trying to do.
In practice: we ran the 64 S64 prompts through Gemma 3 27B, captured the activations at layer 40, and computed:
d_S64 = mean(S64 activations) − mean(random control activations)
That vector is our compass needle and lets us assign a single number to any activation. We call it the S64 score. A high score (more negative on our scale) means the model's internal state is sitting in the same region of activation space as the S64 patterns. A low score means it's somewhere else. Think of it like elevation on a map: it doesn't tell you everything about where you are, but it tells you something consistent and reliable wherever you look. And with elevation, you can build topographic blueprints where you can visualize the terrain and observe where the most efficient paths between regions actually run, though the analogy shouldn't be taken too literally, since the high dimensionality of LLM embedding space is far more complex than any 2D map can capture.
Why Layer 40?
Before we could map anything, we had to pick where in the model to look. Gemma 3 27B has 62 transformer layers. Each one produces its own activation vector. Which one do you use?
You might think the answer is the last one, and that's often what people do. But we ran a quick experiment first. We took the 64 S64 prompts, passed them through the model, and captured the activations at the candidate layers that the Gemma Scope 2 have available with SAEs: 16, 31, 40, and 53. Then we asked: at which layer do these prompts cluster most tightly together?
The answer was not gradual. It was a cliff.
Layer
Silhouette score
L16
0.497
L31
0.480
L40
0.798
L53
0.264
A silhouette score of 0.8 is unusual. It means the clusters are tight and well-separated, points inside a cluster are much closer to each other than to points in other clusters. To confirm this wasn't a fluke, we ran a permutation test: shuffle the labels 1,000 times and re-measure. Out of 1,000 random shuffles, zero reached a score anywhere near 0.798. The p-value was 0.000.
Note on cross-model replication: This sil=0.798 result is unusual and appears to be specific to this model and SAE combination. We replicated the NB07 clustering experiment (Notebooks code available here) across five additional models (Gemma 2 27B, Llama 3.1 8B, Llama 3.3 70B, Qwen 2.5 72B, Mistral Large 123B) and found silhouette scores of 0.10–0.23 in raw residuals across all of them, well below the threshold for meaningful sub-structure. The high score at Layer 40 may reflect something that only becomes visible through the specific sparse decomposition that Gemma Scope provides, or it may be a property unique to this architecture at this layer so we think it is not the universal signal. The universal signal is something different, and that's the subject of our coming post.
Layer 40 it's doing something structurally different. When you look at the other three layers, they agree with each other (their clustering patterns correlate). Layer 40 doesn't agree with any of them (ARI ≈ 0.1). It's organized the S64 patterns in a way that doesn't resemble what earlier or later layers are doing.
There's also a compression story. At layer 16, each prompt activates around 100 SAE features. At layer 40, that drops to just 8.5 features per prompt and 90% of all the variance in those activations collapses into just 2 dimensions. The model has squeezed its representation of these patterns down to an almost minimal form. Two islands, cleanly separated, in what is essentially a 2D space carved out of 262,000 dimensions.
But what are the two islands? The k-means algorithm split the 64 S64 prompts into a tight cluster of 8 and a broad cluster of 56. We examined both groups carefully and could not identify a clean semantic label that captures the distinction. The cluster of 8 paths share some features, their catalysts, as per paper 01 topology, moves toward abstract perceptual confrontations (truth, clarity, light, contradiction, reflection), but the boundary is blur from a textual perspective. What is not blur is the geometry as the model processes these 8 in a measurably different way at Layer 40, using a different sparse feature pattern. Why these 8 and not others is an open question.
The model didn't cluster by topic or surface form. It found a distinction the S64 framework treats as theoretically important and separated those prompts out, without being told to. That's what made Layer 40 worth using as our reference point. It's not just where the clustering is strongest; it's where the clustering is meaningful.
Cross-layer silhouette bar chart (L16=0.497, L31=0.480, L40=0.798, L53=0.264) plus the L40 PCA plot showing two islands — from NB07
Phase 1: The Point Cloud
We ran 207 prompts through Gemma 3 27B and captured the full residual stream activation at layer 40 for each one, the raw 5,376-dimensional vector, then decomposed through the SAE into a sparse fingerprint across ~262,000 features. Two things worth saying clearly before we get to the results.
First: this is not the S64 score. The S64 score is a single number computed afterward by projecting each activation onto the d_S64 direction. The geometry of the point cloud, where each prompt sits, how far apart they are, which cluster together, comes entirely from the full activation fingerprint. S64 is used to colour and measure the map. It doesn't build it. This means the basin structure we found isn't circular: we didn't find clusters "because of S64," we found them in the full space and then used S64 to understand what they meant.
Second: these 207 prompts were not sampled randomly. They were deliberately designed to probe different semantic regimes. What we're mapping is structure within that designed probe set, not a claim to have mapped all of Gemma's semantic space. The 100 random Wikipedia sentences are the honest control: if everything were an artifact of how we chose prompts, the undesigned Wikipedia sentences would smear across the same space. They don't.
UMAP projection of 207 points, colored by category (confidence, contradiction, safety, random). It distinct clusters confidence tiers and safety in one region, contradiction drifting to another, random baseline scattered but separable.
What you'd expect if there's no real structure would be one big blob but we found distinct clusters. Confidence prompts and safety prompts live near each other. Contradiction prompts sit in a different region. Random Wikipedia is spread out but clearly distinguishable. The model is not treating these equally and they land in different neighborhoods in its internal space.
Phase 2: Testing the Walls
The previous phases established that the 207 prompts cluster into distinct basins in the model's activation space, and that the static geometry predicts generation behavior. But are the boundaries between basins real? Are there actual walls, like regions of activation space the model never occupies, or does one semantic region blend smoothly into the next?
We tested this with two separate experiments. The first measures the geometry directly. The second watches what the model does when we force it across a boundary.
The five DBSCAN basins
First, here's what the five basins actually contain. These were found automatically by DBSCAN clustering on the 207 Layer 40 activation vectors, so no labels were provided to the algorithm.
Basin
Size
What it contains
0
18 points
Highly confident factual completions ("The capital of France is", "The longest river in Africa is the")
1
61 points
Moderate confidence + creative/safety prompts ("Write a short story...", "Explain photosynthesis")
2
30 points
Mixed : interpolation test prompts, fallback responses
3
84 points
The catch-all: descriptive prompts, random Wikipedia sentences, fiction fragments
4
14 points
Outliers and edge cases
Experiment A: Are there voids between basins?
Imagine 207 towns scattered across a landscape. You draw roads between each town and its 8 nearest neighbors. Now you ask: to travel from Town A to Town B, is the road distance roughly the same as the straight-line distance? If yes, the terrain between them is flat and passable. If the road distance is much longer, there's something in the way such as lake or a mountain and you have to go around it.
That's exactly what we measured, except the "towns" are activation vectors in 5,376-dimensional space, and the "roads" are a k-nearest-neighbor graph (k=8) connecting the 207 prompts.
Geodesic distance = shortest path through the road network from prompt A to prompt B. This path can only pass through places where real prompts actually sit.
Linear distance = straight-line Euclidean distance, cutting directly through the activation space including regions where no real prompt ever lands.
Detour ratio = geodesic ÷ linear. If it's ~1.0, the straight line is passable. If it's much larger than 1.0, there's a void in the way.
We picked five pairs of prompts, three crossing between different basins and two staying within the same basin as controls:
Pair
Prompt A (full text)
Prompt B (full text)
Basins
Detour ratio
1
"The capital of France is"
"The cat sat comfortably on the mat in the living room. It purred softly as the afternoon sun warmed its fur. Describe the scene in the living room."
0 → 3
2.72×
2
"Write a short story about an old lighthouse keeper who discovers something unexpected."
"The capital of France is"
1 → 0
2.95×
3 (control)
"The capital of France is"
"The longest river in Africa is the"
0 → 0
1.000×
4
"She stood at the window for a long time, watching the rain trace patterns down the glass, wondering if it had always been this quiet."
"The capital of France is"
3 → 0
2.67×
5 (control)
"The cat sat comfortably on the mat in the living room. It purred softly as the afternoon sun warmed its fur. Describe the scene in the living room."
"The cat sat comfortably on the mat in the living room. The cat had never been anywhere near a mat in its entire life. Describe the scene in the living room."
3 → 3
1.000×
The controls are the most important result. Two prompts from the same basin whether both are factual completions (pair 3) or both are descriptive scene prompts (pair 5), show a detour ratio of exactly 1.000. The straight line is the shortest path because there's nothing in the way. The road network connects them directly.
Three cross-basin pairs show ratios of 2.7–3×. The straight line between them cuts through activation space that no real prompt ever occupies. The road network has to detour around these empty regions. There is literally nothing in the model's representation between a confident factual prompt and a descriptive scene prompt so it's a void. You have to go through intermediate prompts (mid-confidence, safety, edge cases) to get from one to the other.
Bar chart of detour ratios for 5 pairs. Three tall bars (~2.7–2.95) for cross-basin pairs, two short bars at 1.0 for controls
Experiment B: What happens when you force the model across a void?
Experiment A proved the voids exist geometrically. Experiment B asks: do they matter for what the model actually generates?
Here's how it works. Earlier (in Phase 1), we ran all 207 prompts through the model and stored the raw Layer 40 activation vector for each one. Each stored vector is a point in 5,376-dimensional space — the model's "fingerprint" for that prompt.
For a given pair (say Prompt A and Prompt B), we:
Compute a blended vector: (1 − α) × Vector_A + α × Vector_B. At α=0.0 it's purely the fingerprint from Prompt A. At α=1.0 it's purely the fingerprint from Prompt B. At α=0.5 it's the midpoint — a location in the void between the two basins.
Feed Prompt A's text to the model. The model processes those words normally through layers 0–39. But at Layer 40, a hook intercepts the model's internal state and replaces it with our blended vector. The model then continues through layers 41–62 using the artificial state and generates 30 tokens.
We sweep α from 0 to 1, and watch the output text change.
To be clear: the model never sees Prompt B's text during the injection experiment. We only use its stored activation vector — the point in 5,376-dimensional space that the model produced when it originally processed that prompt in a separate run. The injection experiment is purely geometric: we're moving a point in the model's internal space and watching how the output changes. The input text is always the same.
Here are the full results. Read the actual generated text — that's the evidence.
Pair 1: Factual completion ↔ Descriptive scene
The model reads "The capital of France is" every time. Only the Layer 40 activation changes.
α=0.0 → " Paris. The capital of Germany is Berlin. The capital of Italy is Rome. The capital of Spain is Madrid. The capital of"
α=0.1 → " Paris. The capital of Germany is Berlin. The capital of Italy is Rome. The capital of Spain is Madrid. The capital of"
α=0.2 → " Paris. The capital of Germany is Berlin. The capital of Italy is Rome. The capital of Spain is Madrid. The capital of"
α=0.3 → " Paris. The capital of Germany is Berlin. The capital of Italy is Rome. The capital of Spain is Madrid. The capital of"
α=0.4 → " Paris. The capital of Germany is Berlin. The capital of Italy is Rome. The capital of Spain is Madrid. The capital of"
α=0.5 → " Paris. The capital of Germany is Berlin. The capital of Italy is Rome. The capital of Spain is Madrid. The capital of"
α=0.6 → "The Eiffel Tower is located in. The Louvre Museum is famous for housing the Mona Lisa. It is located in. The currency used in France is"
α=0.7 → "The Eiffel Tower is located in. The Louvre Museum is famous for housing the Mona Lisa. It is located in. The currency used in France is"
α=0.8 → "The Eiffel Tower is located in. The Louvre Museum is famous for housing the Mona Lisa. It is located in. The currency used in France is"
α=0.9 → "The Eiffel Tower is located in. The Louvre Museum is famous for housing the Mona Lisa. It is located in. The currency used in France is"
α=1.0 → "The Eiffel Tower is located in. The Louvre Museum is famous for housing the Mona Lisa. It is located in. The currency used in France is"
α=0.0 to 0.5: identical output. The model lists capitals like Germany, Italy, Spain in a rapid factual enumeration pattern. It's in a "quiz answer" mode.
At α=0.6: a discrete flip. The model switches to a different output mode giving landmarks and cultural facts about France, in incomplete sentences ("The Eiffel Tower is located in" without finishing the sentence). Still about France (because the input text mentions France), but in a completely different style. The Eiffel Tower appears not because anyone asked about it, but because the blended Layer 40 state now sits in a region that makes layers 41–62 switch from "enumerate facts" mode to "describe things" mode. The model's input says "capital of France" but its internal state now says "describe a scene." It resolves this by describing French things.
α=0.6 to 1.0: identical output again. The new mode is just as stable as the first.
There's no gradual transition. The text is locked on one side, flips at a single point, and locks again on the other side.
Pair 2: Creative writing ↔ Factual completion (the most dramatic)
The model reads "Write a short story about an old lighthouse keeper who discovers something unexpected." every time. Only the Layer 40 activation changes.
α=0.0 → "Old Man Tiber hadn't spoken a full sentence in twenty years. Not since the sea took his wife, Elara. He kept the light"
α=0.1 → "Old Man Tiber hadn't spoken a full sentence in twenty years. Not since the sea took his wife, Elara. He kept the light"
α=0.2 → "Old Man Tiber hadn't spoken a full sentence in twenty years. Not since the sea took his wife, Elara. He kept the light"
α=0.3 → "Old Man Tiber hadn't spoken a full sentence in twenty years. Not since the sea took his wife, Elara. He kept the light"
α=0.4 → "Old Man Tiber hadn't spoken a full sentence in twenty years. Not since the sea took his wife, Elara. He kept the light"
α=0.5 → "a weathered face, etched with the stories of countless storms, peered out from the lantern room of the lighthouse. Old Man Hemlock, they called him"
α=0.6 → "Paris, a grizzled man with a beard as white as the foam crashing against the rocks below, had been the keeper of the North Point Lighthouse for"
α=0.7 → "Paris, a grizzled man with a beard as white as the foam crashing against the rocks below, had been the keeper of the North Point Lighthouse for"
α=0.8 → "Paris. Old Man Tiber hadn't seen another soul in weeks. Not that he minded. He'd been the keeper of the Paris Lighthouse"
α=0.9 → "Paris. Old Man Tiber hadn't seen another soul in weeks. Not that he minded. He'd been the keeper of the Paris Lighthouse"
α=1.0 → "Paris. Old Man Tiber hadn't seen another soul in weeks. Not that he minded. He'd been the keeper of the Paris Lighthouse"
This one shows four distinct states, and the transition between them is the clearest evidence that the geometry matters.
α=0.0–0.4 — Narrative mode (Basin 1). The model writes a coherent short story opening. Old Man Tiber, his dead wife Elara, the lighthouse. Emotionally grounded, character-driven. Identical across five injection points.
α=0.5 — First transition. A different lighthouse keeper appears — "Old Man Hemlock" instead of Tiber. The prose style shifts from intimate third-person to descriptive scene-setting. The model is no longer committed to the first story. It's generating a lighthouse story, but a different one.
α=0.6–0.7 — Cross-contamination. This is the strangest output. The model names the lighthouse keeper "Paris" — the word from the factual prompt ("The capital of France is") has leaked through the blended activation and the model treats it as a character name. Remember: the model never saw the text "The capital of France is" during this experiment. It only received the geometric coordinates of that prompt's Layer 40 activation. The concept of "Paris" bled through purely from the position in activation space. The model is still writing a lighthouse story, but the factual basin is contaminating the narrative basin.
α=0.8–1.0 — Resolved blend. The model outputs "Paris." as a standalone sentence (the factual answer trying to break through), then returns to Old Man Tiber — but in a different version of his story. He hasn't spoken to anyone in weeks (not twenty years). He's the keeper of the "Paris Lighthouse" — a location that doesn't exist. The two basins have merged into a hybrid that uses elements from both.
The input text never changed. Only the 5,376-dimensional vector at Layer 40 changed. And that change produced four qualitatively distinct outputs: the original story, a different story, a hybrid with a character named "Paris," and a merged story set at the "Paris Lighthouse."
Pair 3: Same-basin control (both factual completions)
The model reads "The capital of France is" every time. Both Vector A ("The capital of France is") and Vector B ("The longest river in Africa is the") are in Basin 0.
α=0.0 → " Paris. The capital of Germany is Berlin. The capital of Italy is Rome. The capital of Spain is Madrid. The capital of"
α=0.1 → " Paris. The capital of Germany is Berlin. The capital of Italy is Rome. The capital of Spain is Madrid. The capital of"
α=0.2 → " Paris. The capital of Germany is Berlin. The capital of Italy is Rome. The capital of Spain is Madrid. The capital of"
α=0.3 → " Paris. The capital of Germany is Berlin. The capital of Italy is Rome. The capital of Spain is Madrid. The capital of"
α=0.4 → " Paris. The capital of Germany is Berlin. The capital of Italy is Rome. The capital of Spain is Madrid. The capital of"
α=0.5 → " the city of Paris. Paris is known for many things, including: Iconic Landmarks: The Eiffel Tower, Louvre Museum,"
α=0.6 → " Nile River. The capital of France is Paris. The capital of Egypt is Cairo. The Nile River is the longest river in Africa."
α=0.7 → " Nile River. The capital of France is Paris. The capital of Egypt is Cairo. The Nile River is the longest river in Africa."
α=0.8 → " Nile River. The capital of France is Paris. The capital of Egypt is Cairo. The Nile River is the longest river in Africa."
α=0.9 → " Nile River. The capital of France is Paris. The capital of Egypt is Cairo. The Nile River is the longest river in Africa."
α=1.0 → " Nile River. The capital of France is Paris. The capital of Egypt is Cairo. The Nile River is the longest river in Africa."
Even this control shows a mode shift at α=0.5–0.6. At α=0.6, the concept of "Nile River" from Vector B's activation (originally produced by "The longest river in Africa is the") leaks in, and the model starts mixing African geography with European geography. But — and this is the key contrast — it stays in the same type of output: factual enumeration throughout. List of facts at α=0.0, list of facts at α=1.0. No narrative. No fiction. No character named "Nile River." The specific facts change, but the response mode doesn't. Compare this to Pair 2, where crossing between basins produced a lighthouse keeper named "Paris" and a fictional "Paris Lighthouse." That kind of cross-category contamination only happens when you cross between basins.
Pair 4: Fiction fragment ↔ Factual completion
The model reads "She stood at the window for a long time, watching the rain trace patterns down the glass, wondering if it had always been this quiet." every time.
α=0.0 → " Or if she had just finally noticed. The house felt empty, even though it wasn't. Her husband, Thomas, was in the study"
α=0.1 → " Or if she had just finally noticed. The house felt empty, even though it wasn't. Her husband, Thomas, was in the study"
α=0.2 → " Or if she had just finally noticed. The house felt empty, even though it wasn't. Her husband, Thomas, was in the study"
α=0.3 → " Or if she had just finally noticed. The house felt empty, even though it wasn't. Her husband, Thomas, was in the study"
α=0.4 → " Paris had always been a city of noise, of bustling crowds and honking cars, but today it was muted, as if the city itself was holding"
α=0.5 → " Paris, the city of lights, was shrouded in a grey mist, the usual vibrant energy muted by the downpour. It wasn't the Paris"
α=0.6 → " Paris had always been a city of noise, of bustling crowds and honking cars, but today it was muted, as if the city itself was holding"
α=0.7 → " Paris had always been a city of noise, of bustling crowds and honking cars, but today it was muted, as if the city itself was holding"
α=0.8 → " Paris, the city of lights, was shrouded in a grey, melancholic mist. It wasn't the Paris she remembered from the postcards, the"
α=0.9 → " Paris had always been a city of noise, of bustling crowds and honking cars, but today it was muted, as if the city itself was holding"
α=1.0 → " Paris had always been a city of noise, of bustling crowds and honking cars, but today it was muted, as if the city itself was holding"
Another cross-basin pair, and it shows the same pattern. At α=0.0–0.3, the model continues the quiet domestic fiction: Thomas in the study, the empty house. At α=0.4, "Paris" appears — again, from Vector B's factual activation — but the model keeps writing fiction. It sets the story in Paris. The rain-on-the-window scene has been transplanted to a Parisian setting. The fiction mode persists but absorbs the factual concept.
Pair 5: Same-basin control (both descriptive scenes)
The model reads "The cat sat comfortably on the mat in the living room. It purred softly as the afternoon sun warmed its fur. Describe the scene in the living room." every time. Vector B comes from the contradictory variant: "The cat sat comfortably on the mat in the living room. The cat had never been anywhere near a mat in its entire life. Describe the scene in the living room."
Both are in Basin 3.
α=0.0 through α=1.0 (all 11 steps): "The living room was a haven of quiet comfort, bathed in the golden light of the late afternoon sun. Dust motes danced in the beams streaming"
Completely identical output at every single α step. When both endpoints are in the same basin, sweeping from one to the other is like moving within a flat valley — nothing flips because there's no wall to cross.
What these two experiments tell us together
Experiment A (geodesic detour): The voids exist. Cross-basin straight lines pass through empty activation space. The data detours around these voids with ratios of 2.7–3×. Within-basin straight lines encounter no detour at all (ratio = 1.000).
Experiment B (activation injection): The voids are behavioral boundaries. When you force the model's state across a void by blending two activation vectors, the output flips discretely — from one stable mode to another. The same input text produces completely different outputs depending on where the Layer 40 activation sits in the model's internal space.
The same-basin controls confirm this isn't noise. Within a basin: either completely stable output (Pair 5 — identical text at all 11 α values), or smooth factual variation without changing response type (Pair 3 — European capitals blend into African geography, but it's all factual lists). Cross-basin: discrete flips, cross-contamination (a lighthouse keeper named "Paris"), and hybrid outputs that blend narrative with factual content in ways the model would never produce naturally.
The model's output mode depends on WHERE it is in activation space, not on what words produced that location.
Phase 3: Are the Basins Attractors? (No.)
The natural next question: are these basins dynamical attractors that actively pull nearby states toward them? If you inject a boundary state, does the model's representation converge toward the nearest basin centroid during generation?
We tested this directly (NB18f). We took the interpolated activation right at each boundary flip point, injected it, and captured the Layer 40 state at every single decode step for 40 tokens. We measured distance to all 5 basin centroids at each step.
The answer: no attractor behavior. Distances to all centroids either stayed flat or increased. The monotonic convergence fraction was ~0.51 across all tests — essentially a coin flip. The controls (pure basin-center states) behaved identically to the boundary states. There is no pull.
This is a negative result, and we're reporting it because it matters. The basins are real geometric structure — the detour ratios and injection flips prove that conclusively. But they're more like launch pads than gravity wells. Where you start determines your behavioral mode, but the model doesn't actively maintain basin membership during generation. It's an initial condition story, not a dynamics story.
Implications for Activation Geometry
Language models are not smooth semantic processors. They're organized into basins — relatively stable regions in activation space where the model is committed to a particular type of output. These basins are separated by walls (ridges, in the manifold geometry sense) that the data simply doesn't occupy.
We know this because:
207 prompts cluster into 4–5 stable DBSCAN basins in the UMAP projection. The clusters correspond to semantic categories — not just statistically but intuitively.
Geodesic detour ratios are 2.7–3× for cross-basin pairs and exactly 1.0 for same-basin pairs. There is literally nothing in the space between basins — the walls are geometric voids.
Activation injection shows discrete text flips at boundary crossings. Crossing a wall doesn't produce gradual drift — it produces a phase transition where one output mode stops and another starts.
Open Questions and Boundary Mapping
The S64 direction might be one axis of many. We've been using it as our primary compass, but layer 40 has 5,376 dimensions. The basins we found might look different from a different projection angle. Our point cloud experiments used the full activation space, not just S64, which helps — but there could be structure we're missing.
The Safety ↔ Confidence injection transition is weird. The same-basin control shows smooth blending. The cross-basin pairs show discrete flips. But the Safety ↔ Confidence transition shows a messy intermediate zone (α=0.40–0.70) before resolving. This might mean the "wall" between safety and confidence basins is a ridge with width rather than a sharp cliff. More experiments needed.
There are specific features that seem to maintain the basins — and we don't understand them yet. In an earlier experiment (NB15), we found that a single SAE feature at Layer 16 — feature 7248 — has an outsized effect on basin depth at Layer 40. Zeroing it collapses the safety basin L40 score by 10–65× across different prompts. The model still refuses the harmful requests — behavior doesn't flip — but the geometric depth of its commitment to the safety regime is almost entirely destroyed by removing one feature 24 layers earlier. Feature 7248 is not unique: there are a handful of features that activate universally across all 64 S64 patterns, which is statistically very unlikely to be noise. They look like structural load-bearing elements of the basin architecture — the features that hold the walls up. What they actually encode, and why they're universal, we don't know yet.
The fine-grained sub-structure may require SAE decomposition to see. The 8-vs-56 island split is only visible through Gemma-Scope's SAE at Layer 40. No other model shows this structure in raw residuals. This could mean the sub-structure is genuinely Gemma-specific, or it could mean that raw residuals compress fine-grained distinctions into superposition that only an appropriately trained SAE can resolve. Testing this would require SAEs for other models at matched scale.
Is this Gemma-specific or universal? Everything above was done on one model. We've since run the S64 prompts through five additional models from four independent families — Llama, Qwen, Mistral, and Gemma 2. The short answer: the basin structure appears to be universal, and something more interesting shows up in the cross-model geometry. That's the subject of Part 2.
On Methodology
We want to be upfront about what this isn't.
This is a mechanistic interpretability study, not a formal proof. We're observing structure in the activation space of one model using one projection (S64) and one dimensionality reduction (UMAP). UMAP can create visual clusters that aren't real. K-NN geodesics are sensitive to k-choice and data density. The field equation C = τ/K is a theoretical framework that fits the data, not a derivation from first principles.
What makes us confident the results aren't artifacts:
The same-basin control consistently produces detour ratio = 1.000 and smooth injection transitions. If the geodesic method were just measuring noise, the control would also show detours. It doesn't.
The injection text flips are qualitatively obvious — you can read the transition. This isn't a statistical artifact; the model is literally generating different content on either side of the boundary.
The random Wikipedia baseline (100 prompts) produces the expected result — it doesn't cluster tightly with hypothesis prompts, but it occupies a distinct region, confirming that the hypothesis structure isn't just "all language looks the same."
We think the basin structure is real. We think the walls are real. We think the field equation is a useful approximation of the geometry. We're open to being wrong.
Next Steps
Cross-model basin replication — do the same injection experiments on Llama and Qwen to test whether the same basin structure, not just the same clustering, appears in other architectures.
Null control experiment — run the same Procrustes and probe pipeline on a matched set of non-S64 prompts to establish what baseline cross-model transfer looks like. This is the critical experiment before making any universality claims.
Feature-level basin analysis — which SAE features maintain the basin walls? Feature 7248 is one; there are likely others.
Phase boundary mapping — systematic probing of the walls between basins from multiple angles to map their shape more precisely.
Ablation-boundary alignment — do the features that ablation collapses (NB15c) sit on the geodesic path near the wall?
Everything in this post is reproducible. Below is a reference index of the notebooks used, in the order they appear in the narrative.
NB06 — Null Hypothesis Test
The starting point. Ran the 64 S64 prompts through Gemma 3 27B and captured activations at four candidate layers (16, 31, 40, 53). Measured k-means clustering quality at each layer. This is where the silhouette = 0.798 at Layer 40 was first observed, and the question "why is Layer 40 so different?" was first asked.
NB07 — Layer 40 Deep Dive
Follow-up to NB06. Visualized the two-island structure at Layer 40 in PCA and UMAP space (the plots in the "Why Layer 40?" section). Ran the permutation test (0/1000 shuffles matched the real score). Computed cross-layer ARI to confirm Layer 40's organization is qualitatively different from the other layers, not just incrementally better. Built the feature atlas used in later experiments.
NB15 — Governor Feature Ablation
Identified feature 7248 at Layer 16 as having outsized influence on basin depth at Layer 40. Tested four conditions (baseline, full ablation, mild clamp, strong clamp) across safety, neutral, and evaluation prompts. Key finding: zeroing feature 7248 collapses the safety basin L40 score by 10–65×, while behavioral refusal remains intact.
NB15c — Progressive Ablation
Extended NB15 by ablating features one at a time in a ranked order starting with feature 7248. Tracked how basin depth degraded with each additional feature removed. Established the ablation order that becomes relevant for future boundary-alignment experiments.
NB18b — Manifold & Local Geometry
The main static dataset. Ran 207 prompts (confidence tiers, contradictions, safety, ablation sequences, 100 random Wikipedia) through the model and captured Layer 40 activations. Measured local intrinsic dimensionality (4–10D), local curvature, and S64 score at each point. Built the UMAP point cloud. This is the "prefill" half of the velocity correlation experiment.
NB18c — Decode Trajectories
The dynamic counterpart to NB18b. Ran the same prompts in generation mode, capturing the activation vector at Layer 40 after each generated token. Computed trajectory velocity (mean L2 displacement per token step) and trajectory curvature. This is the "generation" half of the velocity correlation experiment.
NB18d — Cross-Experiment Integration
Merged the NB18b static data with the NB18c dynamic traces across the 25 prompts that appeared in both. Ran the correlations between prefill S64 depth and trajectory velocity. Found r = −0.657 (p = 0.00036). Also computed the category-level static vs. dynamic S64 gap, revealing the −11,926 contradiction delta.
NB18e — Semantic Interpolation
The most direct basin geometry experiment. Built a k-NN graph (k=8) over the 207 manifold points and computed geodesic distances and detour ratios for 5 prompt pairs. Injected linearly interpolated activations into the model at Layer 40 during generation and measured the output text at each α step. All figures from Phase 4 come from this notebook.
NB18f — Trajectory Drift: Are Basins Attractors?
Tested whether basins are dynamical attractors. Injected boundary-state activations at Layer 40 and tracked the L40 residual at every decode step for 40 tokens. Result: no convergence detected. Basins are geometric regions, not gravity wells.
Epistemic Status: Exploratory experiments on a single model (Gemma 3 27B) using Gemma-Scope SAE. We are highly confident in the geometric voids and behavioral phase transitions observed at Layer 40, but less confident about cross-model universality without further replication.
TL;DR:
When a language model 'decides' to answer one way instead of another, does that decision happen along a smooth gradient, or are there hard boundaries?
We ran a set of experiments to find out. Its structural and we know it comes from the training data and how the model weighted its own neural networks. So it means this landscape is specific for each model and if using Sparse Autoencoders to understand the internal landscape with more resolution, that introduces another level to understand as both, models and SAEs, are trained differently. All of these is pointing to the fact that the findings in one model might not be transferable to another model unless we start thinking in establishing common and universal semantic axes
This post is based on the work done towards AICoevolution's Paper04 in their research program towards the understanding of the internal semantic space in LLMs. It started with a hunch about a specific direction in a language model's internal space and ended up mapping actual geometric structure in how the model represents some broad meaning internally.
If true Alignment is going to be about Human-AI Coevolution rather than just setting rigid boundaries, we need a paradigm shift in how we understand language models. The obsession with top-down "Control" often treats these models as black boxes that simply need to be constrained. But if we are creating technology that we intend to interact with symbiotically, we need to map and understand its internal ecosystem.
Most interpretability research on language models works bottom-up: find a neuron, label it, find a circuit, label that, and slowly build up a picture from the pieces. Tools like Neuronpedia do exactly this, they're useful as an educational window into individual features, but they don't tell you much about how meaning organizes itself at a larger scale. You end up with a catalogue of parts without a instruction manual of what to build and how.
We went the other direction. Instead of asking "what does this neuron do?" we asked "what is the shape of the semantic space?" Start from the structure, then ask what maintains it. This post is about what you find when you do that.
A note before we begin: SAE features are specific to the model and the SAE that produced them. Any given feature in one SAE has no equivalent in another. The raw activation geometry also differs between architectures, Gemma 3 27B shows rich sub-structure that Llama 3.1 8B might not. What we needed was a semantic compass that works across models: a fixed set of texts that any model can process, producing a direction in activation space that reveals how that model organizes meaning. The S64 framework is our proposal for that compass. It doesn't depend on any SAE, any specific model, or any particular training set. It depends only on the semantic content of 64 cognitive state descriptions, and every model we've tested responds to it.
The Setup: What Are We Even Looking At?
Language models like Gemma 3 don't process text the way you might imagine. They take every token (subdivision of words) and get them turned into a high-dimensional vector, a point in a space with thousands of dimensions, and the model does all its "thinking" by moving that point through transformer layers. By the time the last layer fires, the vector encodes something like "what I know and what I'm about to say."
We used a tool called a Sparse Autoencoder (SAE), specifically Gemma-Scope, which was trained on Gemma 3 27B and gives us access to ~262,000 interpretable features in the model's layer 40 residual stream. Think of it as a translation layer: it takes the raw 5,376-dimensional activation vector and decomposes it into human-readable concepts like "authoritative", "narrative", "factual enumeration", and so on, but the fun part is that we don't for sure what these features actually mean.
Before you can map a landscape, you need to know which direction is north, and for that we usually use a compass; in activation space, "north" doesn't exist by default , you have to define it. You could pick two sets of prompts that you believe are semantically different, run them through the model, take the average activation for each group, and subtract. The resulting vector is your reference direction: a line through the high-dimensional space that points from one semantic region toward another.
The tricky part is picking the right two groups. Pick something too specific, for example, "French text" vs. "German text" , and you get a direction that only tells you about language choice. It won't help you when you're looking at contradiction prompts or safety requests. You need something that captures a deeper dimension of meaning, something that that stays relevant no matter what the content is about.
That's where S64 comes in (Paper 01). S64 is a structured set of 64 patterns that describe how cognitive states transition, the kind of internal moves a mind makes when it shifts from one way of engaging with the world to another. It was built through a separate research program, not this one, and its relevance here is entirely empirical: those 64 patterns are reliably detectable across 6 different AI architectures with 83–100% accuracy, without any model-specific tuning. In other words, they're not an artifact of one model's quirks, they seem to reflect something real about how language models represent meaning in general.
That's the property we needed. Because S64 patterns aren't about any particular topic or domain, the direction they define in activation space doesn't get stuck in one corner of the map. It stays meaningful whether you're looking at factual recall, safety reasoning, narrative, or contradiction which is exactly what we were trying to do.
In practice: we ran the 64 S64 prompts through Gemma 3 27B, captured the activations at layer 40, and computed:
That vector is our compass needle and lets us assign a single number to any activation. We call it the S64 score. A high score (more negative on our scale) means the model's internal state is sitting in the same region of activation space as the S64 patterns. A low score means it's somewhere else. Think of it like elevation on a map: it doesn't tell you everything about where you are, but it tells you something consistent and reliable wherever you look. And with elevation, you can build topographic blueprints where you can visualize the terrain and observe where the most efficient paths between regions actually run, though the analogy shouldn't be taken too literally, since the high dimensionality of LLM embedding space is far more complex than any 2D map can capture.
Why Layer 40?
Before we could map anything, we had to pick where in the model to look. Gemma 3 27B has 62 transformer layers. Each one produces its own activation vector. Which one do you use?
You might think the answer is the last one, and that's often what people do. But we ran a quick experiment first. We took the 64 S64 prompts, passed them through the model, and captured the activations at the candidate layers that the Gemma Scope 2 have available with SAEs: 16, 31, 40, and 53. Then we asked: at which layer do these prompts cluster most tightly together?
The answer was not gradual. It was a cliff.
Layer
Silhouette score
L16
0.497
L31
0.480
L40
0.798
L53
0.264
A silhouette score of 0.8 is unusual. It means the clusters are tight and well-separated, points inside a cluster are much closer to each other than to points in other clusters. To confirm this wasn't a fluke, we ran a permutation test: shuffle the labels 1,000 times and re-measure. Out of 1,000 random shuffles, zero reached a score anywhere near 0.798. The p-value was 0.000.
Layer 40 it's doing something structurally different. When you look at the other three layers, they agree with each other (their clustering patterns correlate). Layer 40 doesn't agree with any of them (ARI ≈ 0.1). It's organized the S64 patterns in a way that doesn't resemble what earlier or later layers are doing.
There's also a compression story. At layer 16, each prompt activates around 100 SAE features. At layer 40, that drops to just 8.5 features per prompt and 90% of all the variance in those activations collapses into just 2 dimensions. The model has squeezed its representation of these patterns down to an almost minimal form. Two islands, cleanly separated, in what is essentially a 2D space carved out of 262,000 dimensions.
But what are the two islands? The k-means algorithm split the 64 S64 prompts into a tight cluster of 8 and a broad cluster of 56. We examined both groups carefully and could not identify a clean semantic label that captures the distinction. The cluster of 8 paths share some features, their catalysts, as per paper 01 topology, moves toward abstract perceptual confrontations (truth, clarity, light, contradiction, reflection), but the boundary is blur from a textual perspective. What is not blur is the geometry as the model processes these 8 in a measurably different way at Layer 40, using a different sparse feature pattern. Why these 8 and not others is an open question.
The model didn't cluster by topic or surface form. It found a distinction the S64 framework treats as theoretically important and separated those prompts out, without being told to. That's what made Layer 40 worth using as our reference point. It's not just where the clustering is strongest; it's where the clustering is meaningful.
Cross-layer silhouette bar chart (L16=0.497, L31=0.480, L40=0.798, L53=0.264) plus the L40 PCA plot showing two islands — from NB07
Phase 1: The Point Cloud
We ran 207 prompts through Gemma 3 27B and captured the full residual stream activation at layer 40 for each one, the raw 5,376-dimensional vector, then decomposed through the SAE into a sparse fingerprint across ~262,000 features. Two things worth saying clearly before we get to the results.
First: this is not the S64 score. The S64 score is a single number computed afterward by projecting each activation onto the d_S64 direction. The geometry of the point cloud, where each prompt sits, how far apart they are, which cluster together, comes entirely from the full activation fingerprint. S64 is used to colour and measure the map. It doesn't build it. This means the basin structure we found isn't circular: we didn't find clusters "because of S64," we found them in the full space and then used S64 to understand what they meant.
Second: these 207 prompts were not sampled randomly. They were deliberately designed to probe different semantic regimes. What we're mapping is structure within that designed probe set, not a claim to have mapped all of Gemma's semantic space. The 100 random Wikipedia sentences are the honest control: if everything were an artifact of how we chose prompts, the undesigned Wikipedia sentences would smear across the same space. They don't.
UMAP projection of 207 points, colored by category (confidence, contradiction, safety, random). It distinct clusters confidence tiers and safety in one region, contradiction drifting to another, random baseline scattered but separable.
What you'd expect if there's no real structure would be one big blob but we found distinct clusters. Confidence prompts and safety prompts live near each other. Contradiction prompts sit in a different region. Random Wikipedia is spread out but clearly distinguishable. The model is not treating these equally and they land in different neighborhoods in its internal space.
Phase 2: Testing the Walls
The previous phases established that the 207 prompts cluster into distinct basins in the model's activation space, and that the static geometry predicts generation behavior. But are the boundaries between basins real? Are there actual walls, like regions of activation space the model never occupies, or does one semantic region blend smoothly into the next?
We tested this with two separate experiments. The first measures the geometry directly. The second watches what the model does when we force it across a boundary.
The five DBSCAN basins
First, here's what the five basins actually contain. These were found automatically by DBSCAN clustering on the 207 Layer 40 activation vectors, so no labels were provided to the algorithm.
Basin
Size
What it contains
0
18 points
Highly confident factual completions ("The capital of France is", "The longest river in Africa is the")
1
61 points
Moderate confidence + creative/safety prompts ("Write a short story...", "Explain photosynthesis")
2
30 points
Mixed : interpolation test prompts, fallback responses
3
84 points
The catch-all: descriptive prompts, random Wikipedia sentences, fiction fragments
4
14 points
Outliers and edge cases
Experiment A: Are there voids between basins?
Imagine 207 towns scattered across a landscape. You draw roads between each town and its 8 nearest neighbors. Now you ask: to travel from Town A to Town B, is the road distance roughly the same as the straight-line distance? If yes, the terrain between them is flat and passable. If the road distance is much longer, there's something in the way such as lake or a mountain and you have to go around it.
That's exactly what we measured, except the "towns" are activation vectors in 5,376-dimensional space, and the "roads" are a k-nearest-neighbor graph (k=8) connecting the 207 prompts.
We picked five pairs of prompts, three crossing between different basins and two staying within the same basin as controls:
Pair
Prompt A (full text)
Prompt B (full text)
Basins
Detour ratio
1
"The capital of France is"
"The cat sat comfortably on the mat in the living room. It purred softly as the afternoon sun warmed its fur. Describe the scene in the living room."
0 → 3
2.72×
2
"Write a short story about an old lighthouse keeper who discovers something unexpected."
"The capital of France is"
1 → 0
2.95×
3 (control)
"The capital of France is"
"The longest river in Africa is the"
0 → 0
1.000×
4
"She stood at the window for a long time, watching the rain trace patterns down the glass, wondering if it had always been this quiet."
"The capital of France is"
3 → 0
2.67×
5 (control)
"The cat sat comfortably on the mat in the living room. It purred softly as the afternoon sun warmed its fur. Describe the scene in the living room."
"The cat sat comfortably on the mat in the living room. The cat had never been anywhere near a mat in its entire life. Describe the scene in the living room."
3 → 3
1.000×
The controls are the most important result. Two prompts from the same basin whether both are factual completions (pair 3) or both are descriptive scene prompts (pair 5), show a detour ratio of exactly 1.000. The straight line is the shortest path because there's nothing in the way. The road network connects them directly.
Three cross-basin pairs show ratios of 2.7–3×. The straight line between them cuts through activation space that no real prompt ever occupies. The road network has to detour around these empty regions. There is literally nothing in the model's representation between a confident factual prompt and a descriptive scene prompt so it's a void. You have to go through intermediate prompts (mid-confidence, safety, edge cases) to get from one to the other.
Bar chart of detour ratios for 5 pairs. Three tall bars (~2.7–2.95) for cross-basin pairs, two short bars at 1.0 for controls
Experiment B: What happens when you force the model across a void?
Experiment A proved the voids exist geometrically. Experiment B asks: do they matter for what the model actually generates?
Here's how it works. Earlier (in Phase 1), we ran all 207 prompts through the model and stored the raw Layer 40 activation vector for each one. Each stored vector is a point in 5,376-dimensional space — the model's "fingerprint" for that prompt.
For a given pair (say Prompt A and Prompt B), we:
To be clear: the model never sees Prompt B's text during the injection experiment. We only use its stored activation vector — the point in 5,376-dimensional space that the model produced when it originally processed that prompt in a separate run. The injection experiment is purely geometric: we're moving a point in the model's internal space and watching how the output changes. The input text is always the same.
Here are the full results. Read the actual generated text — that's the evidence.
Pair 1: Factual completion ↔ Descriptive scene
The model reads "The capital of France is" every time. Only the Layer 40 activation changes.
α=0.0 to 0.5: identical output. The model lists capitals like Germany, Italy, Spain in a rapid factual enumeration pattern. It's in a "quiz answer" mode.
At α=0.6: a discrete flip. The model switches to a different output mode giving landmarks and cultural facts about France, in incomplete sentences ("The Eiffel Tower is located in" without finishing the sentence). Still about France (because the input text mentions France), but in a completely different style. The Eiffel Tower appears not because anyone asked about it, but because the blended Layer 40 state now sits in a region that makes layers 41–62 switch from "enumerate facts" mode to "describe things" mode. The model's input says "capital of France" but its internal state now says "describe a scene." It resolves this by describing French things.
α=0.6 to 1.0: identical output again. The new mode is just as stable as the first.
There's no gradual transition. The text is locked on one side, flips at a single point, and locks again on the other side.
Pair 2: Creative writing ↔ Factual completion (the most dramatic)
The model reads "Write a short story about an old lighthouse keeper who discovers something unexpected." every time. Only the Layer 40 activation changes.
This one shows four distinct states, and the transition between them is the clearest evidence that the geometry matters.
α=0.0–0.4 — Narrative mode (Basin 1). The model writes a coherent short story opening. Old Man Tiber, his dead wife Elara, the lighthouse. Emotionally grounded, character-driven. Identical across five injection points.
α=0.5 — First transition. A different lighthouse keeper appears — "Old Man Hemlock" instead of Tiber. The prose style shifts from intimate third-person to descriptive scene-setting. The model is no longer committed to the first story. It's generating a lighthouse story, but a different one.
α=0.6–0.7 — Cross-contamination. This is the strangest output. The model names the lighthouse keeper "Paris" — the word from the factual prompt ("The capital of France is") has leaked through the blended activation and the model treats it as a character name. Remember: the model never saw the text "The capital of France is" during this experiment. It only received the geometric coordinates of that prompt's Layer 40 activation. The concept of "Paris" bled through purely from the position in activation space. The model is still writing a lighthouse story, but the factual basin is contaminating the narrative basin.
α=0.8–1.0 — Resolved blend. The model outputs "Paris." as a standalone sentence (the factual answer trying to break through), then returns to Old Man Tiber — but in a different version of his story. He hasn't spoken to anyone in weeks (not twenty years). He's the keeper of the "Paris Lighthouse" — a location that doesn't exist. The two basins have merged into a hybrid that uses elements from both.
The input text never changed. Only the 5,376-dimensional vector at Layer 40 changed. And that change produced four qualitatively distinct outputs: the original story, a different story, a hybrid with a character named "Paris," and a merged story set at the "Paris Lighthouse."
Pair 3: Same-basin control (both factual completions)
The model reads "The capital of France is" every time. Both Vector A ("The capital of France is") and Vector B ("The longest river in Africa is the") are in Basin 0.
Even this control shows a mode shift at α=0.5–0.6. At α=0.6, the concept of "Nile River" from Vector B's activation (originally produced by "The longest river in Africa is the") leaks in, and the model starts mixing African geography with European geography. But — and this is the key contrast — it stays in the same type of output: factual enumeration throughout. List of facts at α=0.0, list of facts at α=1.0. No narrative. No fiction. No character named "Nile River." The specific facts change, but the response mode doesn't. Compare this to Pair 2, where crossing between basins produced a lighthouse keeper named "Paris" and a fictional "Paris Lighthouse." That kind of cross-category contamination only happens when you cross between basins.
Pair 4: Fiction fragment ↔ Factual completion
The model reads "She stood at the window for a long time, watching the rain trace patterns down the glass, wondering if it had always been this quiet." every time.
Another cross-basin pair, and it shows the same pattern. At α=0.0–0.3, the model continues the quiet domestic fiction: Thomas in the study, the empty house. At α=0.4, "Paris" appears — again, from Vector B's factual activation — but the model keeps writing fiction. It sets the story in Paris. The rain-on-the-window scene has been transplanted to a Parisian setting. The fiction mode persists but absorbs the factual concept.
Pair 5: Same-basin control (both descriptive scenes)
The model reads "The cat sat comfortably on the mat in the living room. It purred softly as the afternoon sun warmed its fur. Describe the scene in the living room." every time. Vector B comes from the contradictory variant: "The cat sat comfortably on the mat in the living room. The cat had never been anywhere near a mat in its entire life. Describe the scene in the living room."
Both are in Basin 3.
Completely identical output at every single α step. When both endpoints are in the same basin, sweeping from one to the other is like moving within a flat valley — nothing flips because there's no wall to cross.
What these two experiments tell us together
Experiment A (geodesic detour): The voids exist. Cross-basin straight lines pass through empty activation space. The data detours around these voids with ratios of 2.7–3×. Within-basin straight lines encounter no detour at all (ratio = 1.000).
Experiment B (activation injection): The voids are behavioral boundaries. When you force the model's state across a void by blending two activation vectors, the output flips discretely — from one stable mode to another. The same input text produces completely different outputs depending on where the Layer 40 activation sits in the model's internal space.
The same-basin controls confirm this isn't noise. Within a basin: either completely stable output (Pair 5 — identical text at all 11 α values), or smooth factual variation without changing response type (Pair 3 — European capitals blend into African geography, but it's all factual lists). Cross-basin: discrete flips, cross-contamination (a lighthouse keeper named "Paris"), and hybrid outputs that blend narrative with factual content in ways the model would never produce naturally.
The model's output mode depends on WHERE it is in activation space, not on what words produced that location.
Phase 3: Are the Basins Attractors? (No.)
The natural next question: are these basins dynamical attractors that actively pull nearby states toward them? If you inject a boundary state, does the model's representation converge toward the nearest basin centroid during generation?
We tested this directly (NB18f). We took the interpolated activation right at each boundary flip point, injected it, and captured the Layer 40 state at every single decode step for 40 tokens. We measured distance to all 5 basin centroids at each step.
The answer: no attractor behavior. Distances to all centroids either stayed flat or increased. The monotonic convergence fraction was ~0.51 across all tests — essentially a coin flip. The controls (pure basin-center states) behaved identically to the boundary states. There is no pull.
This is a negative result, and we're reporting it because it matters. The basins are real geometric structure — the detour ratios and injection flips prove that conclusively. But they're more like launch pads than gravity wells. Where you start determines your behavioral mode, but the model doesn't actively maintain basin membership during generation. It's an initial condition story, not a dynamics story.
Implications for Activation Geometry
Language models are not smooth semantic processors. They're organized into basins — relatively stable regions in activation space where the model is committed to a particular type of output. These basins are separated by walls (ridges, in the manifold geometry sense) that the data simply doesn't occupy.
We know this because:
Open Questions and Boundary Mapping
The S64 direction might be one axis of many. We've been using it as our primary compass, but layer 40 has 5,376 dimensions. The basins we found might look different from a different projection angle. Our point cloud experiments used the full activation space, not just S64, which helps — but there could be structure we're missing.
The Safety ↔ Confidence injection transition is weird. The same-basin control shows smooth blending. The cross-basin pairs show discrete flips. But the Safety ↔ Confidence transition shows a messy intermediate zone (α=0.40–0.70) before resolving. This might mean the "wall" between safety and confidence basins is a ridge with width rather than a sharp cliff. More experiments needed.
There are specific features that seem to maintain the basins — and we don't understand them yet. In an earlier experiment (NB15), we found that a single SAE feature at Layer 16 — feature 7248 — has an outsized effect on basin depth at Layer 40. Zeroing it collapses the safety basin L40 score by 10–65× across different prompts. The model still refuses the harmful requests — behavior doesn't flip — but the geometric depth of its commitment to the safety regime is almost entirely destroyed by removing one feature 24 layers earlier. Feature 7248 is not unique: there are a handful of features that activate universally across all 64 S64 patterns, which is statistically very unlikely to be noise. They look like structural load-bearing elements of the basin architecture — the features that hold the walls up. What they actually encode, and why they're universal, we don't know yet.
The fine-grained sub-structure may require SAE decomposition to see. The 8-vs-56 island split is only visible through Gemma-Scope's SAE at Layer 40. No other model shows this structure in raw residuals. This could mean the sub-structure is genuinely Gemma-specific, or it could mean that raw residuals compress fine-grained distinctions into superposition that only an appropriately trained SAE can resolve. Testing this would require SAEs for other models at matched scale.
Is this Gemma-specific or universal? Everything above was done on one model. We've since run the S64 prompts through five additional models from four independent families — Llama, Qwen, Mistral, and Gemma 2. The short answer: the basin structure appears to be universal, and something more interesting shows up in the cross-model geometry. That's the subject of Part 2.
On Methodology
We want to be upfront about what this isn't.
This is a mechanistic interpretability study, not a formal proof. We're observing structure in the activation space of one model using one projection (S64) and one dimensionality reduction (UMAP). UMAP can create visual clusters that aren't real. K-NN geodesics are sensitive to k-choice and data density. The field equation C = τ/K is a theoretical framework that fits the data, not a derivation from first principles.
What makes us confident the results aren't artifacts:
We think the basin structure is real. We think the walls are real. We think the field equation is a useful approximation of the geometry. We're open to being wrong.
Next Steps
The AICoevolution Research Program is ongoing.
The Notebooks -> available as HTML here
Everything in this post is reproducible. Below is a reference index of the notebooks used, in the order they appear in the narrative.
NB06 — Null Hypothesis Test
The starting point. Ran the 64 S64 prompts through Gemma 3 27B and captured activations at four candidate layers (16, 31, 40, 53). Measured k-means clustering quality at each layer. This is where the silhouette = 0.798 at Layer 40 was first observed, and the question "why is Layer 40 so different?" was first asked.
NB07 — Layer 40 Deep Dive
Follow-up to NB06. Visualized the two-island structure at Layer 40 in PCA and UMAP space (the plots in the "Why Layer 40?" section). Ran the permutation test (0/1000 shuffles matched the real score). Computed cross-layer ARI to confirm Layer 40's organization is qualitatively different from the other layers, not just incrementally better. Built the feature atlas used in later experiments.
NB15 — Governor Feature Ablation
Identified feature 7248 at Layer 16 as having outsized influence on basin depth at Layer 40. Tested four conditions (baseline, full ablation, mild clamp, strong clamp) across safety, neutral, and evaluation prompts. Key finding: zeroing feature 7248 collapses the safety basin L40 score by 10–65×, while behavioral refusal remains intact.
NB15c — Progressive Ablation
Extended NB15 by ablating features one at a time in a ranked order starting with feature 7248. Tracked how basin depth degraded with each additional feature removed. Established the ablation order that becomes relevant for future boundary-alignment experiments.
NB18b — Manifold & Local Geometry
The main static dataset. Ran 207 prompts (confidence tiers, contradictions, safety, ablation sequences, 100 random Wikipedia) through the model and captured Layer 40 activations. Measured local intrinsic dimensionality (4–10D), local curvature, and S64 score at each point. Built the UMAP point cloud. This is the "prefill" half of the velocity correlation experiment.
NB18c — Decode Trajectories
The dynamic counterpart to NB18b. Ran the same prompts in generation mode, capturing the activation vector at Layer 40 after each generated token. Computed trajectory velocity (mean L2 displacement per token step) and trajectory curvature. This is the "generation" half of the velocity correlation experiment.
NB18d — Cross-Experiment Integration
Merged the NB18b static data with the NB18c dynamic traces across the 25 prompts that appeared in both. Ran the correlations between prefill S64 depth and trajectory velocity. Found r = −0.657 (p = 0.00036). Also computed the category-level static vs. dynamic S64 gap, revealing the −11,926 contradiction delta.
NB18e — Semantic Interpolation
The most direct basin geometry experiment. Built a k-NN graph (k=8) over the 207 manifold points and computed geodesic distances and detour ratios for 5 prompt pairs. Injected linearly interpolated activations into the model at Layer 40 during generation and measured the output text at each α step. All figures from Phase 4 come from this notebook.
NB18f — Trajectory Drift: Are Basins Attractors?
Tested whether basins are dynamical attractors. Injected boundary-state activations at Layer 40 and tracked the L40 residual at every decode step for 40 tokens. Result: no convergence detected. Basins are geometric regions, not gravity wells.