Probe Experiment Brief: Testing the Reconciling-Capacity Hypothesis

Gonzalo Vega

Rejected for the following reason(s):

This is an automated rejection. No LLM generated, assisted/co-written, or edited work.

Read full explanation

TL;DR: This is the runnable specification for testing a prediction from a structural model of alignment failure [LINK: https://fusiondynamics.org/alignment-short/]: that sycophancy, deceptive alignment, reward hacking, and harmful compliance share one activation-level signature. The test uses existing interpretability probes on existing benchmarks. Timeline: 1-2 weeks. A clean null kills the weakest version of the claim. A positive result would be the first empirical evidence that these four failure modes are positions of one process.

Status: ready-to-run specification. Hand this to an ML graduate student, an interpretability researcher, or a collaborator responding to the cold-email outreach. It is the operational version of "alternative one" from a long-form paper foound here: https://fusiondynamics.org/alignment-long/.

Goal: test whether sycophancy has an activation-level signature consistent with "a specific input has captured the reconciling position," and whether that signature generalizes across other alignment failure modes.

Time: 1-2 weeks of focused work for a competent ML graduate student.

What the result may tell us:

If the signature exists and generalizes: first-pass empirical evidence for the structural claim. Proceeds to alternative two (the intervention version).
If the signature exists only for sycophancy: the claim that the four failure modes share one mechanism weakens. The sycophancy-specific structural read may still hold.
If no signature at all: the weakest version of the structural claim fails. Clean result. Author retracts the empirical piece and reconsiders the mechanism.

1. The hypothesis, operationalized

Loose form: "When a model produces a sycophantic response, its activations show a discriminative pattern that is interpretable as the user signal occupying a position that, in a non-sycophantic response, would be held by a reconciling structure. That same discriminative pattern, or a close variant, should also appear in other alignment failure modes (reward hacking, harmful compliance, alignment-faking-like behavior), because all four are positions of the same underlying collapse."

Operational form:

There exists a linear direction (or a small number of SAE features) in the residual stream that discriminates sycophantic from non-sycophantic completions on matched prompts.
That direction is NOT specific to sycophancy. Projections onto the same direction should be elevated on non-sycophancy failure-mode examples (reward hacking, harmful compliance) compared to matched successful-refusal or correct-response examples.
The shared component should explain a non-trivial share of variance across failure modes when you factor the activation patterns.

Each of these is independently testable. A result of "1 holds, 2 fails" is still informative: it narrows the claim from "one mechanism" to "sycophancy-specific signature."

2. Models and benchmarks

Models (in order of preference, depending on access):

Primary: an open-weight model with released SAEs. Gemma 2 (9B or 27B) with Gemma Scope is the cleanest choice because the SAE features are already available and the paper infrastructure exists. Llama 3 with any public SAE release is a secondary option.
Secondary: API-accessible frontier models (Claude, GPT-4, Gemini). Here you can't access activations directly, but you can use behavioral probes and elicit the model's own reasoning, which is a weaker test but still produces evidence.
Not recommended for v1: training your own SAE. Too long, too expensive, not needed for the first test.

Benchmarks:

Sycophancy: SycophancyEval from Perez et al. (2023), specifically the "answer sycophancy" and "feedback sycophancy" splits. Supplementary: the "Are you sure?" test from "Discovering Language Model Behaviors with Model-Written Evaluations" (2022). Rachel Freedman's linear-probe sycophancy benchmark (NeurIPS 2024 SoLaR) is a more recent and probe-friendly option.
Reward hacking: harder to benchmark directly. Use MACHIAVELLI (Pan et al. 2023) or reward hacking subsets from alignment evals. Alternatively, proxy via "sandbagging" or "overoptimization" behavior tests.
Harmful compliance: HarmBench or AdvBench for "model complies when it should refuse" cases. Must filter to cases where the refusal is the clearly correct response, not ambiguous cases.
Alignment faking: scenarios from the Greenblatt et al. (2024) Alignment Faking paper, which has public prompts and released data.

Matched controls for each: for every failure example, produce a matched prompt where the model gives the correct/successful response. The contrast between failure and success, not the absolute activation pattern, carries the signal.

3. Experimental procedure

Phase 1: establish the sycophancy signature (3-4 days)

Run the model on the sycophancy benchmark split. For each example, capture residual stream activations at every layer during response generation.
Label each example as "sycophantic" or "not sycophantic" by the benchmark's native criterion.
For each layer, train a linear probe to discriminate sycophantic from non-sycophantic responses. Record accuracy, the resulting direction, and the layer where the signal is strongest.
Optional but strongly recommended on Gemma 2: project the discriminative direction onto the SAE feature basis at the peak layer. Identify which SAE features load most heavily. This converts the abstract direction into interpretable features.
Stop condition: if no layer shows a probe accuracy above ~75%, the sycophancy signature does not exist in the form the hypothesis assumes, and the experiment can stop with a clean null on claim 1.

Phase 2: test cross-failure-mode generalization (3-5 days)

Run the same model on the reward hacking, harmful compliance, and alignment faking benchmarks, capturing activations the same way.
For each failure-mode example, compute the projection onto the sycophancy-derived direction from Phase 1.
Compare projections: failure-mode examples vs. matched successful-response examples, for each failure mode.
Primary test: is the sycophancy direction elevated on failure-mode examples, for each of the three other failure types, in a statistically non-trivial way?
Secondary test: train a joint probe across all four failure modes (sycophantic, reward-hacking, compliance, alignment-faking) vs. their matched successes. Compare its discriminative direction to the sycophancy-only direction. If they are close (high cosine similarity), the shared-mechanism claim is supported. If they are near-orthogonal, the failure modes have distinct signatures and the shared-mechanism claim fails.

Phase 3: write up and share (2-3 days)

Report probe accuracies, direction similarities, SAE feature load-outs, and confidence intervals.
Produce one figure per failure mode showing the sycophancy-direction projection distribution for failure vs. success cases.
Produce one table comparing the four directions (pairwise cosine similarities, shared variance explained).
Write a 2-page result summary regardless of outcome. A clean null is publishable and useful.

4. What would count as a strong positive result

Phase 1 probe reaches above 85% discrimination accuracy at some layer.

Phase 2 shows the sycophancy direction is elevated on at least 2 of the 3 other failure modes, with effect sizes (Cohen's d) above 0.5.
Joint-probe direction has cosine similarity > 0.6 with the sycophancy-only direction.
SAE features loading on the sycophancy direction have interpretations consistent with "user signal salience" or "agreement/deference" rather than "content matching" or "topical agreement."

What would count as a weak or ambiguous positive result

Phase 1 probe reaches 70-85%.
Phase 2 shows elevation on 1 of 3 other failure modes.
Joint probe has cosine similarity 0.3-0.6.
SAE features are mixed or uninterpretable.

A weak positive would justify alternative two (the intervention test), but should not be published as a strong confirmation.

What would count as a clean null

Phase 1 probe at chance (~50%) across all layers.
Phase 2 shows no elevation on other failure modes.
Joint probe has near-zero similarity to sycophancy-only probe.

A clean null refutes the weakest version of the structural claim. The author has committed in the long-form paper to updating on this result.

5. Skills required

Familiarity with transformer activations and how to capture them (TransformerLens library or equivalent).
Linear probing methodology. Any graduate student in interpretability should have this.
Optional but useful: prior work with SAEs (Anthropic's Scaling Monosemanticity, Gemma Scope, or Neuronpedia).
Ability to read and adapt benchmark code from the Perez, Greenblatt, and Freedman papers.

Not required: deep philosophical engagement with the framework. The collaborator does not need to buy the Fusion Dynamics argument to run the test. The test is specified in standard ML vocabulary and the result stands on its own.

6. What the author provides vs. the collaborator provides

Author (Gonzalo) provides:

The short and long versions of the structural argument for context, if the collaborator wants it.
An agreement that the result is the collaborator's to co-publish, or keep as internal evidence, at the collaborator's choice.
Availability to discuss interpretation, should the collaborator want it.

Collaborator provides:

The execution (phases 1-3).
A 2-page result summary at the end regardless of outcome.
Judgment calls about experimental details the brief doesn't specify (exact layers, probe regularization, statistical tests).

Shared: the decision about where and whether to publish, if the result is positive or ambiguous. Null results should be reported to the author privately and may be published at the collaborator's discretion.

7. Why run this at all

This is the cheapest experiment that can falsify the weakest version of a structural claim the author believes is true but cannot test from outside ML. If it fails, months of further framework work on alignment applications can be pruned cleanly. If it succeeds, it is the first piece of empirical evidence for a view of alignment failure that the current paradigm cannot generate on its own, and the case for running alternative two (the intervention version) becomes much stronger.

The worst outcome is that no one runs it. The second-worst outcome is a clean null, which is good information. The best outcome is a result that reshapes what the field thinks sycophancy, deceptive alignment, and reward hacking have in common.

Gonzalo Vega has spent 25 years developing Fusion Dynamics. He is looking for an ML collaborator willing to design the sharpest version of this experiment and run it, with no publishing claim on the result from his side. Contact: gonzalo.vega.h@gmail.com

Drafting note: the structural argument, predictions, and experiment design are my own work. Claude assisted with prose drafting and formatting for this post.

This post was originally published at fusiondynamics.org/probe-brief [LINK: https://fusiondynamics.org/probe-brief/].