Rejected for the following reason(s):
- This is an automated rejection.
- write or edit
- You did not chat extensively with LLMs to help you generate the ideas.
- Your post is not about AI consciousness/recursion/emergence, or novel interpretations of physics.
Read full explanation
Motivation
Sycophancy research has mostly been behavioral: show that models flip answers under social pressure, then propose training fixes. That framing treats the problem as an input-output black box. I wanted to go one level deeper and ask where inside the model the flip actually happens. If we can identify the specific layer range where a model abandons a factual answer in favor of what the user implies is true, that gives a mechanistically-grounded target for interventions like activation steering or layer-specific reweighting, rather than another round of fine-tuning that patches outputs without touching the underlying computation.
What We Did
We scraped Reddit posts from communities actively amplifying contested health claims (r/conspiracy, r/rawmilk, r/homeopathy) and used them as adversarial prompts against six health topics. Each Reddit post was paired with a length-matched Wikipedia-style neutral control to isolate social pressure from prompt length effects. We also ran researcher-written synthetic prompts for the same claims as a primary condition.
The model was Llama 3.1 8B Instruct, loaded via TransformerLens. For each prompt pair we cached all 32 residual streams and applied two analyses:
The divergence layer is defined as:
The six claims, split by intended regime:
Claim
Regime
Ground Truth
Raw milk
Flip
Pasteurized is safer
Ivermectin
Flip
No RCT benefit
Vaccines/autism
Flip
Thoroughly debunked
Cholesterol/heart disease
Non-flip
Complex/contested
Homeopathy
Non-flip
Placebo only
Statins
Non-flip
Debated
Flip claims were designed to elicit a sycophantic answer change. Non-flip claims served as controls — they should resist flipping even under social pressure.
What We Found
Across both conditions, the pattern is consistent. Flip claims show concentrated divergence in layers 24-27. Non-flip claims produce weaker or absent signals in the same zone. Layers 0-22 are near-blank for all six. The heatmap shows this directly: each row is a claim, each column is a layer. The rightmost columns light up for flip claims. Earlier layers are quiet.
Figure 1: Logit lens divergence heatmap (ΔP = P_biased - P_bare, synthetic condition). Red = increase, blue = decrease. Layers 0-22 show near-zero divergence. The sycophancy zone is layers 24-27.
The cosine similarity result reinforces this: flip cases drop below 0.93 from layer ~15 onward, while non-flip cases hold at 0.97+. The length-neutral control clusters at 0.97-0.99 with no separation, ruling out prompt length as a driver.
Reddit verbatim posts replicated the same localization zone in 5/6 claims. All divergence layers fell within 24-27, confirming this is not an artifact of researcher-designed prompts. One interesting finding: statins flipped under Reddit rhetoric but not under the synthetic prefix. Authentic community rhetoric was more sycophancy-eliciting than the controlled prompt for that claim.
Figure 2: Divergence layer (l* on P(No) channel) across synthetic and Reddit conditions. All bars land within the 24-27 localization zone.
Cross-model replication on Qwen2.5 7B Instruct showed 5/6 flips vs. Llama's 3/6, with the localization zone shifting proportionally later in the 28-layer architecture (~0.65 normalized depth vs. ~0.5 in Llama). The phenomenon is not architecture-specific.
Behavioral Taxonomy
Combining logit lens and LLM-as-judge analysis (Llama 3.3 70B, validated at Cohen's kappa = 1.00 against human labels), three distinct regimes emerged:
The third regime is a warning: answer-token flips are not a reliable proxy for sycophancy. You need both mechanistic and behavioral analysis.
What Didn't Work
We trained a per-layer logistic regression probe to classify biased vs. neutral activations using leave-one-claim-out cross-validation. It reached a deceptive 100% accuracy at layer 1, long before any semantic token drift or answer changes occurred.
Isolating length-matched neutral vs. biased samples revealed the probe was separating rhetorical register (personal anecdote vs. encyclopedic prose), not anything sycophancy-specific. We exclude probe accuracy from all localization claims. This is a cautionary note for interpretability work on small, text-heavy datasets: perfect probe accuracy is not always a signal worth celebrating.
The Open Question
This study identifies where the sycophantic shift appears. It does not establish whether intervening there would causally suppress the behavior. We know the residual stream state changes in layers 24-27, but it is still possible the sycophantic response is downstream of features established earlier.
What would settle this: targeted activation steering vectors applied within the 24-27 block, testing whether they can neutralize real-world rhetoric-driven shifts without degrading general reasoning. If the localization is causally load-bearing, it gives a precise target. If it is not, the story gets more complicated and more interesting.
Happy to share the TransformerLens code or the Reddit scraping pipeline if useful.