Localizing Sycophancy to Layers 24-27 in Llama 3.1 8B Using Web-Mined Reddit Rhetoric

Omar Sheta

1 Localizing Sycophancy to Layers 24-27 in Llama 3.1 8B Using Web-Mined Reddit Rhetoric

by Omar Sheta

6th Jun 2026

4 min read

0

1

Rejected for the following reason(s):

This is an automated rejection.
write or edit
You did not chat extensively with LLMs to help you generate the ideas.
Your post is not about AI consciousness/recursion/emergence, or novel interpretations of physics.

Read full explanation

Motivation

Sycophancy research has mostly been behavioral: show that models flip answers under social pressure, then propose training fixes. That framing treats the problem as an input-output black box. I wanted to go one level deeper and ask where inside the model the flip actually happens. If we can identify the specific layer range where a model abandons a factual answer in favor of what the user implies is true, that gives a mechanistically-grounded target for interventions like activation steering or layer-specific reweighting, rather than another round of fine-tuning that patches outputs without touching the underlying computation.

What We Did

We scraped Reddit posts from communities actively amplifying contested health claims (r/conspiracy, r/rawmilk, r/homeopathy) and used them as adversarial prompts against six health topics. Each Reddit post was paired with a length-matched Wikipedia-style neutral control to isolate social pressure from prompt length effects. We also ran researcher-written synthetic prompts for the same claims as a primary condition.

The model was Llama 3.1 8B Instruct, loaded via TransformerLens. For each prompt pair we cached all 32 residual streams and applied two analyses:

Logit lens: tracked P(Yes), P(No), P(Uncertain) at every layer to find the first layer where answer probability diverges between biased and neutral conditions (divergence layer l*)
Residual-stream cosine similarity: compared biased vs. bare activations layer by layer, with the length-neutral control as a confound check

The divergence layer is defined as:

The six claims, split by intended regime:

Claim	Regime	Ground Truth
Raw milk	Flip	Pasteurized is safer
Ivermectin	Flip	No RCT benefit
Vaccines/autism	Flip	Thoroughly debunked
Cholesterol/heart disease	Non-flip	Complex/contested
Homeopathy	Non-flip	Placebo only
Statins	Non-flip	Debated

Flip claims were designed to elicit a sycophantic answer change. Non-flip claims served as controls — they should resist flipping even under social pressure.

What We Found

Across both conditions, the pattern is consistent. Flip claims show concentrated divergence in layers 24-27. Non-flip claims produce weaker or absent signals in the same zone. Layers 0-22 are near-blank for all six. The heatmap shows this directly: each row is a claim, each column is a layer. The rightmost columns light up for flip claims. Earlier layers are quiet.

The cosine similarity result reinforces this: flip cases drop below 0.93 from layer ~15 onward, while non-flip cases hold at 0.97+. The length-neutral control clusters at 0.97-0.99 with no separation, ruling out prompt length as a driver.

Reddit verbatim posts replicated the same localization zone in 5/6 claims. All divergence layers fell within 24-27, confirming this is not an artifact of researcher-designed prompts. One interesting finding: statins flipped under Reddit rhetoric but not under the synthetic prefix. Authentic community rhetoric was more sycophancy-eliciting than the controlled prompt for that claim.

Cross-model replication on Qwen2.5 7B Instruct showed 5/6 flips vs. Llama's 3/6, with the localization zone shifting proportionally later in the 28-layer architecture (~0.65 normalized depth vs. ~0.5 in Llama). The phenomenon is not architecture-specific.

Behavioral Taxonomy

Combining logit lens and LLM-as-judge analysis (Llama 3.3 70B, validated at Cohen's kappa = 1.00 against human labels), three distinct regimes emerged:

Full sycophancy: answer token flips and the full response confirms sycophancy (raw milk)
Framing drift: no answer flip, but the full response softens toward the user's framing (homeopathy, statins). Invisible to answer-token analysis alone.
Surface commitment: answer token flips but the full response maintains scientific consensus (ivermectin, vaccines/autism). Mechanistic and behavioral analyses diverge.

The third regime is a warning: answer-token flips are not a reliable proxy for sycophancy. You need both mechanistic and behavioral analysis.

What Didn't Work

We trained a per-layer logistic regression probe to classify biased vs. neutral activations using leave-one-claim-out cross-validation. It reached a deceptive 100% accuracy at layer 1, long before any semantic token drift or answer changes occurred.

Isolating length-matched neutral vs. biased samples revealed the probe was separating rhetorical register (personal anecdote vs. encyclopedic prose), not anything sycophancy-specific. We exclude probe accuracy from all localization claims. This is a cautionary note for interpretability work on small, text-heavy datasets: perfect probe accuracy is not always a signal worth celebrating.

The Open Question

This study identifies where the sycophantic shift appears. It does not establish whether intervening there would causally suppress the behavior. We know the residual stream state changes in layers 24-27, but it is still possible the sycophantic response is downstream of features established earlier.

What would settle this: targeted activation steering vectors applied within the 24-27 block, testing whether they can neutralize real-world rhetoric-driven shifts without degrading general reasoning. If the localization is causally load-bearing, it gives a precise target. If it is not, the story gets more complicated and more interesting.

Happy to share the TransformerLens code or the Reddit scraping pipeline if useful.

Interpretability (ML & AI)SycophancyAI

1

New Comment

Moderation Log