The Treasure Map Was Lying: A Simpson's Paradox in Cross-Template Steering Vector Evaluation

Suhail Nadaf

Rejected for the following reason(s):

No LLM generated, assisted/co-written, or edited work.
LessWrong has a particularly high bar for content from new users and this contribution doesn't quite meet the bar.

Read full explanation

Author: Mohammed Suhail B Nadaf

TLDR: I extracted 768 function vectors from Llama-3.1-8B(Base and Instruct) across 3 tasks, 8 templates, and 16 layers—then tested all 2,688 cross-template transfer pairs per model. The aggregate correlation between cosine similarity and transfer accuracy was strongly negative (Base: r = −0.572, p < 10⁻²³³; Instruct: r = −0.361, p < 10⁻⁸³). Looks dramatic. Looks important. It's a Simpson's paradox. The negative trend exists only in the pooled data. Within each individual task, the correlation vanishes or flips positive. The reason: 2 of my 3 tasks produce completely non-functional steering vectors (capitalize: literally 0.00 IID accuracy in Base, 0.11 in Instruct; sentiment flip: 0.02 Base, 0.06 Instruct). Those broken vectors happen to have high cosine similarity to each other. The one working task (antonym: 0.57 Base, 0.65 Instruct) produces vectors with lower cosine. Mix them together and you get a beautiful, highly significant, completely misleading negative correlation. The methodological lesson: always check if your steering vectors actually work before computing fancy aggregate statistics about them.

How It Started

Here's the thing about running "quick experiments"—they never stay quick.

I wanted to study cross-template function vector transfer. Simple question: if you extract a steering vector using one prompt template ("The opposite of hot is cold"), does it still work when you test it on a different template ("antonym(hot) = cold")? Prior work evaluates FVs almost exclusively in-distribution—same template in, same template out. But for anything safety-relevant, where someone can just rephrase a prompt to dodge your intervention, cross-template robustness is the test that matters.

So I set up something small. Three tasks on Llama-3.1-8B (Base and Instruct). Eight templates per task, spanning natural language ("The opposite of hot is cold"), symbolic notation ("hot ↔ cold"), functional syntax ("antonym(hot) = cold"), verbose paraphrases, instruction-style, question-answer, roleplay, and formal academic framing. Sixteen layers. Extract everything, test everything, see what patterns emerge.

The three tasks were chosen to span different computational regimes:

Antonym ("hot" → "cold") — semantic retrieval, the textbook FV task
Capitalize ("hello" → "HELLO") — character-level transformation
Sentiment flip ("great movie" → "terrible movie") — compositional semantic rewriting

384 function vectors per model (768 total). 2,688 cross-template transfer pairs per model. Not a massive study, but enough to see something interesting..

And I did see something interesting. Just not the thing I expected.

Step Zero: Do the Vectors Actually Work?

Before looking at any cross-template result, I checked whether the extracted vectors actually work in-distribution. Same template for extraction and test. The most favorable possible condition. If the FV can't steer the model here, it can't steer it anywhere.

Task	Base IID Accuracy	Instruct IID Accuracy
Antonym	0.573	0.649
Capitalize	0.000	0.111
Sentiment Flip	0.021	0.059

Read that capitalize row again. Zero. Not "low." Not "marginal." Zero. Across every single layer, every single template, every alpha multiplier I tried—the mean-difference extraction method does not capture the capitalize function in Llama-3.1-8B Base. The vectors are geometrically real objects with nonzero magnitude and definite direction. They just don't do anything.

Sentiment flip is barely better at 0.02 (Base) and 0.06 (Instruct). Only antonym actually produces functional steering vectors in either model variant.

This is the finding that reframes everything else. Two-thirds of my experimental pipeline is producing vectors that look like steering vectors, smell like steering vectors, and have all the geometric properties you'd associate with steering vectors. They just don't steer.

The Aggregate Correlation That Fooled Me

So here I am with 2,688 cross-template pairs per model. I plot cosine similarity (how geometrically aligned are two FVs from different templates?) against OOD transfer accuracy (does the source template's FV work on the target template's test inputs?).

The aggregate Pearson correlation: r = −0.572 in Base, r = −0.361 in Instruct. Both comfortably past p < 10⁻⁸³.

At face value, this says: the more geometrically similar two steering vectors are, the worse they transfer. Higher alignment predicts worse outcomes. That's a striking and counterintuitive result. The kind of thing that gets you a paper title like "Geometric Alignment Is Antagonistic to Behavioral Transfer" and a clean narrative about how surface-level similarity masks deeper incompatibility.

I almost wrote that paper.

Then I decomposed it by task.

The Simpson's Paradox

	Base within-task r	Instruct within-task r
Antonym	−0.152	+0.263
Capitalize	−0.099	+0.010
Sentiment Flip	+0.253	+0.066
Aggregate	−0.572	−0.361

The aggregate is strongly negative. Every within-task correlation is either negligibly weak or positive. This is a textbook Simpson's paradox—the trend in the combined data reverses within every meaningful subgroup.

To make sure this wasn't hiding some subtler confound, I went further and analyzed all 48 task×layer subgroups (3 tasks × 16 layers, each containing 56 transfer pairs). Of 48 cells, only 1 shows a significant negative correlation. Nine show significant positive correlations. The remaining 38 are not significant. The aggregate r = −0.572 dissolves into near-zero or reversed relationships within every meaningful subgroup

Here's how it works. The three tasks occupy different regions of cosine-accuracy space:

Capitalize and sentiment flip produce FVs with high inter-template cosine similarity (means 0.69 and 0.72 in Base) and near-zero transfer accuracy
Antonym produces FVs with lower cosine similarity (mean 0.40 in Base) and moderate transfer accuracy

Combine these two clusters and the regression line tilts negative—high cosine goes with zero accuracy, low cosine goes with moderate accuracy. But the line isn't describing a real relationship. It's just connecting two groups that differ on an omitted variable: whether the extraction method produced a functional vector at all.

97.7% of "Dissociation Cases" Are From Broken Tasks

I defined a "dissociation case" as any pair where cosine similarity exceeds 0.80 but transfer accuracy falls below 0.40. The idea being: these are cases where geometry says "these vectors are basically the same" but behavior says "nope."

The Base model has 438 such cases. Sounds alarming. Then I checked which tasks they come from.

Task	Dissociation Cases (Base)	%
Sentiment Flip	250	57.1%
Capitalize	178	40.6%
Antonym	10	2.3%

97.7% of all "geometry–behavior dissociations" come from tasks where there is no behavior to dissociate from. The vectors are geometrically aligned, sure. They're also both non-functional. Two empty treasure chests that happen to look alike aren't evidence that treasure chests are unreliable—they're evidence that you dug in the wrong place.

Only 10 pairs—2.3%—represent genuine transfer failures where a functional FV failed to generalize. That's the real dissociation signal, buried under 428 cases of noise from broken tasks.

This also quantifies cleanly across cosine bins. In the Base model, transfer failure rate increases monotonically with cosine similarity: 33% failure at cosine < 0.1, but 98% failure at cosine > 0.8. That gives cosine an AUROC of 0.855 for predicting transfer failure—an excellent discriminator, but in the opposite direction from naive expectation. And it's entirely explained by the Simpson's paradox: high-cosine bins are dominated by capitalize and sentiment flip (which always fail), low-cosine bins are dominated by antonym (which often succeeds). Cosine isn't detecting within-task geometry–behavior dissociation. It's serving as a noisy proxy for task identity.

The mechanistic analysis I couldn't do. This is where the broken FVs didn't just mislead me statistically—they blocked the entire mechanistic investigation pipeline I had planned. I had set up activation patching to identify peak recovery layers for each task, component patching (residual stream, attention, MLP) to find which architectural pathways carry the task signal, and a full RLHF comparison of recovery rates across model variants. The problem: activation patching requires a functional "clean run" to patch from. The recovery metric is:

When that denominator is zero and the metric is undefined. For capitalize and sentiment flip, there is no "successful same-template run" to serve as the patching source, because the FVs never work in the first place. So two-thirds of my planned mechanistic analysis—peak recovery layers, component dominance patterns, RLHF recovery comparisons—was dead on arrival. The only task I could run patching on was antonym, where I did find peak recovery at layers 10–18 (Base) and 16–24 (Instruct), and near-perfect recovery (>95%) across all three components, suggesting redundant task encoding. But those are single-task observations that I can't generalize. This was a major motivation for the scaled-up study: I needed more working tasks before any mechanistic claims could stand.

Why Are the Broken Vectors So Geometrically Similar?

This part is actually instructive. When the mean-difference method fails to extract a genuine task signal, what ends up in the vector? Formatting artifacts. Template-level surface features. And here's the thing about formatting artifacts: they're more consistent across templates than actual task content.

Both capitalize and sentiment flip use prompts where the input words appear nearly verbatim across templates—only the instructional framing changes. At early layers, which encode surface-level token information, these structurally similar prompts produce similar mean-difference vectors. The vectors are aligned not because they capture a shared abstraction, but because the templates themselves are lexically correlated at the input level.

In other words: null vectors point in similar directions because there's nothing task-specific to pull them apart. High cosine similarity is, paradoxically, a signature of extraction failure—not a signature of robust encoding.

A PCA sanity check confirms this interpretation. For each task–layer combination, the first principal component across the 8 template FVs explains only 30–51% of the variance (mean: 40% Base, 31% Instruct)—far from a clean rank-1 task signal. Template projections onto PC1 vary by orders of magnitude (e.g., Base antonym at L16: Template H projects at 11.56 while Template C projects at 0.02). The task subspace is at least rank-3. There is no "universal task vector" hiding in the geometry—the relevant distinction is task-level FV quality, not a geometric decomposition within a task.

RLHF Changes the Distribution, Not the Mechanism

The Instruct model has only 70 dissociation cases compared to the Base model's 438. A 6× reduction. Looks like instruction tuning helps, right?

Model	Pairs with cos > 0.80	Low-accuracy cases	Conditional rate
Base	448	438	97.7%
Instruct	72	70	97.2%

The conditional failure rate—given that two vectors have high cosine similarity, the probability that transfer fails—is virtually identical: 97.7% vs. 97.2%. The 6× gap comes entirely from the Instruct model having fewer high-cosine pairs in the first place (mean cosine drops from 0.603 to 0.435). RLHF doesn't make high-cosine vectors more likely to transfer. It just produces fewer of them.

This is a distributional shift, not a mechanistic improvement. If you're relying on instruction tuning to make your steering vectors more robust, you should know that it's changing the surface statistics without touching the underlying failure mode.

Why Capitalize and Sentiment Flip Fail (And What That Tells Us)

The extraction failures aren't random—they're informative about where the linear representation hypothesis holds.

Capitalize is a character-level operation. Transformers operate on subword tokens, not characters. The mapping from "hello" to "HELLO" isn't a simple vocabulary lookup—it depends on byte-pair encoding in ways that a single additive direction in activation space may not capture. The mean-difference method assumes tasks are encoded as linear directions. For character-level manipulation, that assumption appears to break.

Sentiment flip requires coordinated changes across multiple components: identify the evaluative word, find its antonym, preserve the surrounding context. This is compositional in a way that may not decompose into a single vector. The near-zero IID accuracy across both model variants (0.02 Base, 0.06 Instruct) suggests that sentiment reversal, despite appearing simple, is not well-approximated by linear steering in Llama-3.1-8B regardless of whether the model has undergone instruction tuning.

Antonym is pure semantic retrieval—accessing a stored lexical relation. This is exactly the kind of task the linear representation hypothesis was designed for, and it works.

The pattern is clear: the mean-difference method succeeds for tasks that align with the linear representation hypothesis (semantic retrieval) and fails for tasks that don't (character-level manipulation, compositional rewriting). This isn't surprising in retrospect. But it has consequences that ripple through every aggregate statistic computed across these tasks.

The Methodological Lesson

Three recommendations that I think should be standard practice for steering evaluations:

Check IID accuracy first. Before computing any cross-template statistic, confirm that your steering vectors work in-distribution. If IID accuracy is near zero, everything downstream—transfer rates, dissociation counts, correlation analyses—is noise about nothing. I'd suggest IID accuracy > 0.50 as a minimum threshold for including a task in cross-template analyses.

Always report per-task statistics. Aggregate correlations across tasks can generate Simpson's paradoxes. Every correlation analysis involving function vectors should be reported both pooled and per-task, with explicit attention to between-task variation. In our data, a single aggregate r = −0.572 dissolved into 48 subgroup analyses where only 1 showed a significant negative effect.

Treat cosine similarity with appropriate suspicion. High cosine similarity between FVs from different templates is not evidence that they encode the same function. In our data, the highest cosine similarities come from the tasks where extraction fails completely. Cosine predicts failure with AUROC = 0.855—but only because it's proxying for task identity, not because it's detecting within-task geometry–behavior relationships.

What's Next

This study covers 3 tasks on a single model family. The natural question is: what happens when you scale up? More tasks, more models, more model families. Does the Simpson's paradox dissolve with a richer task battery? Do the extraction failures persist across architectures? And what can we learn from the tasks that do work about the mechanism of FV steering?

I've been running that scaled-up study—12 tasks, 6 models, 3 architecture families, 4,032 transfer pairs—and the results are... well, they answered every one of those questions, and raised several new ones I hadn't anticipated. That's coming soon.