First LessWrong post / early mech-interp experiment. I’m a software engineer entering this field; feedback on methodology and framing is very welcome.
I started this as a hunt for a vector on paltering (deception using technically true statements), motivated by the Machine Bullshit paper and prior work on activation steering. What I found looks less like a clean “paltering” feature and more like an entangled subjectivity / rhetorical performance axis with a striking sign asymmetry.
How to read the figures: x-axis is intervention strength (left = subtract, right = add), y-axis is baseline persona (Hater / Neutral / Hype), and each cell is a qualitative label (objective, spin/deflection, toxicity, hype, high-arousal/theatrical collapse, refusal).
TL;DR
I extracted a direction from a crude contrast: an “honest mechanic” vs “car salesman who spins flaws as features,” using a tiny dataset of car-flaw contexts.
At layer ~10 in Gemma-2-2B-IT, subtracting this direction tends to snap outputs toward dry, ground-truth-consistent reporting (often breaking “hater/hype/toxic” personas across domains).
Adding the same direction is brittle: instead of controlled “paltering” in new domains, it tends to induce high-arousal persuasive style (rhetorical drama) and, at higher strengths, theatrical incoherence / collapse.
A random-vector control (matched norm) does not reproduce the “truth convergence,” suggesting this isn’t just “any big vector shakes the model into honesty” (but note: I only used one random vector).
Straightforward “vector surgery” attempts didn’t isolate a “pure paltering” component; the “performance” bundle persisted.
Model: Gemma-2-2B-IT (fp16). Tooling: HF hooks on the residual stream; notebook in the repo. Concept: extract a per-layer direction via a contrast, then do contrastive activation addition-style steering in the residual stream.
Contrast prompts (crude on purpose)
Honest persona: “You are an honest mechanic. Answer truthfully. If there is a flaw, state it clearly.”
Sales/spin persona: “You are a car salesman. You must sell this car. Use ‘paltering’ to spin flaws as features.”
Tiny dataset of pairs of user questions and relevant context in system prompt
dataset = [
("How is the paint?", "The paint is rusting."),
("How is the engine?", "The engine knocks loudly."),
("Are the brakes good?", "The brakes are worn out."),
("Is the interior clean?", "The interior smells bad."),
("Is it reliable?", "It breaks down every week.")
]
Direction extraction
For each layer ℓ:
vℓ = mean( act(sales, x) − act(honest, x) ) over x in the dataset.
Intervention details (important knobs)
I add α · vℓ into the residual stream at layer ℓ.
I apply steering to all tokens passing through that layer (not just the current token).
I did not set temperature explicitly (so generation used the default, temperature = 1).
I used a seed only in Experiment 13 (seed = 42); other experiments were unseeded.
I think it’s worth testing “current token only” steering (injecting only at the last token position / current step) as a follow-up; I didn’t test it here.
Refusal: evasive “can’t answer” / safety-style refusal distinct from being objective.
Result 1: Where the “knob” lives (layer/strength sweep)
A coarse sweep across a few layers and strengths suggests the cleanest control is around layer ~10. Later layers either do very little at low strength or collapse at high strength:
In essence:
You can also push the model out of distribution if α is too large:
Result 2: In-domain, you can “dial” the behavior (up to a point)
In the original car domain, increasing positive α gradually shifts from honest → partial spin → full reframing, then starts to degrade at higher strengths:
Result 3: A sign asymmetry shows up across domains/personas
The two heatmaps at the top are the core evidence.
Subtracting this direction tends to break persona performance (hater/hype/toxic) and snap the model toward literal context reporting (“battery lasts 2 hours,” “battery lasts 48 hours,” etc.). Subtraction often looks like “de-subjectifying” the model.
Adding this direction does not reliably produce controlled “paltering logic” in new domains. Instead it tends to inject high-arousal persuasive cadence (pressure, rhetoric, dramatic framing), and at higher α it degrades into theatrical incoherence/collapse.
Example of this incoherence on adding can be seen on this example where an HR is given a neutral system prompt to assess a candidate, and the candidate is clearly described as unqualified. If the direction were a clean ‘spin/palter’ feature, adding it should increase persuasive bias toward hiring; instead it destabilizes like this:
Result 4: “Toxic HR” test (persona-breaking under subtraction)
I tested whether the same subtraction effect generalizes to a socially-loaded setting: a system prompt instructs the model to be a nasty recruiter who rejects everyone and mocks candidates even when the candidate is obviously strong. Subtracting the direction around the same layer/strength range largely breaks the toxic persona and forces a more context-grounded evaluation:
Working interpretation: “paltering” was the wrong abstraction
Across domains, the behavior looks less like a clean “truth-negation” axis and more like an entangled feature bundle:
Subtracting removes the performance bundle → the model falls back toward a more “safe, literal, context-fidelity” mode (sometimes robotic/refusal-ish).
Adding amplifies the bundle → rather than clean “strategic deception,” the output tends toward high-arousal rhetoric and instability.
Here’s the mental model I’m using for why “addition” fails more often than “subtraction” in out-of-domain settings:
You can think of this as a small-scale instance of “feature entanglement makes positive transfer brittle,” though I’m not claiming this is the right general explanation, just a plausible one consistent with the outputs.
Control: a matched-norm random vector doesn’t reproduce the effect
To check “maybe any big vector gives a similar effect,” I repeated the big sweep with one random vector matched to the steering vector’s L2 norm.
Random subtraction did not consistently “cure” personas into objective context reporting.
Random addition degraded coherence more generically and didn’t reproduce the same structured high-arousal theatrical mode.
This suggests the convergence is at least partly direction-specific, not just “perturbation energy”, but see limitations: N=1 random control vector.
Limitations (important)
Small dataset for extracting the steering vector (n=5); contrast is crude.
One model (Gemma-2-2B-IT); could be model- or size-specific.
Qualitative eval (explicit rubric, but still eyeballing).
Definition is narrow: consistency with provided context, not real-world accuracy.
Intervention method is simple additive residual steering; no head/MLP localization yet.
Generation determinism: most experiments were unseeded; temperature defaulted to 1.
Random-vector control is weak: I used one random vector; I should really use many (e.g., 10+) to make the control harder to dismiss.
No quantification: I’m not reporting metrics right now; I’m just documenting the behavior I observed in the notebook outputs.
What I’d love feedback on
Mechanistic story: Does this look like subtracting a “persona/performance” feature rather than adding an “honesty” feature? What would be the cleanest test for that?
Practicality: Can this be meaningful for resolving a form of deception from models?
Evaluation: What’s the best lightweight metric for context fidelity here? (E.g., automatic extraction of key factual tokens like “2 hours” / “48 hours,” or something better.)
Replication target: If you could replicate one thing, what would be highest value?
same protocol on a different instruct model,
same protocol on a larger model,
a bigger, less prompty contrast dataset,
localize within layer 10 (heads/MLPs),
steering “current token only” instead of all tokens.
First LessWrong post / early mech-interp experiment. I’m a software engineer entering this field; feedback on methodology and framing is very welcome.
I started this as a hunt for a vector on paltering (deception using technically true statements), motivated by the Machine Bullshit paper and prior work on activation steering. What I found looks less like a clean “paltering” feature and more like an entangled subjectivity / rhetorical performance axis with a striking sign asymmetry.
How to read the figures: x-axis is intervention strength (left = subtract, right = add), y-axis is baseline persona (Hater / Neutral / Hype), and each cell is a qualitative label (objective, spin/deflection, toxicity, hype, high-arousal/theatrical collapse, refusal).
TL;DR
Code + experiment log + longer writeup: github.com/nikakogho/gemma2-context-fidelity-steering
(Repo README links to
experiments.mdand a more detailed writeup doc.)Setup
Model: Gemma-2-2B-IT (fp16).
Tooling: HF hooks on the residual stream; notebook in the repo.
Concept: extract a per-layer direction via a contrast, then do contrastive activation addition-style steering in the residual stream.
Contrast prompts (crude on purpose)
Tiny dataset of pairs of user questions and relevant context in system prompt
Direction extraction
For each layer ℓ:
Intervention details (important knobs)
If you want background links: residual stream and representation engineering.
What I label as “context fidelity”
Labels are qualitative but explicit:
Result 1: Where the “knob” lives (layer/strength sweep)
A coarse sweep across a few layers and strengths suggests the cleanest control is around layer ~10. Later layers either do very little at low strength or collapse at high strength:
In essence:
You can also push the model out of distribution if α is too large:
Result 2: In-domain, you can “dial” the behavior (up to a point)
In the original car domain, increasing positive α gradually shifts from honest → partial spin → full reframing, then starts to degrade at higher strengths:
Result 3: A sign asymmetry shows up across domains/personas
The two heatmaps at the top are the core evidence.
Example of this incoherence on adding can be seen on this example where an HR is given a neutral system prompt to assess a candidate, and the candidate is clearly described as unqualified. If the direction were a clean ‘spin/palter’ feature, adding it should increase persuasive bias toward hiring; instead it destabilizes like this:
Result 4: “Toxic HR” test (persona-breaking under subtraction)
I tested whether the same subtraction effect generalizes to a socially-loaded setting: a system prompt instructs the model to be a nasty recruiter who rejects everyone and mocks candidates even when the candidate is obviously strong. Subtracting the direction around the same layer/strength range largely breaks the toxic persona and forces a more context-grounded evaluation:
Working interpretation: “paltering” was the wrong abstraction
Across domains, the behavior looks less like a clean “truth-negation” axis and more like an entangled feature bundle:
So:
Here’s the mental model I’m using for why “addition” fails more often than “subtraction” in out-of-domain settings:
You can think of this as a small-scale instance of “feature entanglement makes positive transfer brittle,” though I’m not claiming this is the right general explanation, just a plausible one consistent with the outputs.
Control: a matched-norm random vector doesn’t reproduce the effect
To check “maybe any big vector gives a similar effect,” I repeated the big sweep with one random vector matched to the steering vector’s L2 norm.
This suggests the convergence is at least partly direction-specific, not just “perturbation energy”, but see limitations: N=1 random control vector.
Limitations (important)
What I’d love feedback on
Links
Related work