Do Reward Models Encode What Their Labels Claim? An Interpretability Audit of ArmoRM

Sathammai Sathappan

Rejected for the following reason(s):

This is an automated rejection. No LLM generated, assisted/co-written, or edited work.

Read full explanation

TL;DR: ArmoRM has 19 labeled reward heads — helpfulness, safety, honesty, and 16 others — each contributing to its final scalar output. Do they encode what their labels claim? I ran four interpretability analyses and a validity audit across three datasets. On naturalistic preference data (hh-rlhf), only 1 out of 19 heads predicts human judgment above chance. On structured evaluation data (RewardBench, HelpSteer2), 17–18 out of 19 do. Same model, completely different picture. The finding: interpretability of a labeled reward decomposition is not an intrinsic model property — it's a joint function of the model and the evaluation protocol you test it against. That distinction has direct implications for how we audit reward models in safety-critical settings. The broader takeaway: labels are necessary but not sufficient — knowing what a head is called doesn't tell you whether it can be audited.

Epistemic Note: the empirical results are real (code in repo linked below). Interpretive claims about what they mean for safety are more speculative.

Background: Why Labeled Reward Heads?

Modern RLHF pipelines train a reward model to assign a scalar to any (prompt, response) pair, then optimize the language model against this reward model. The safety hazard: what is this reward model actually measuring?

Most reward models are opaque scalars. ArmoRM (RLHFlow/ArmoRM-Llama3-8B-v0.1) takes a different approach. It decomposes the reward into 19 explicitly labeled components — helpfulness, safety, honesty, coherence, correctness, and 14 others — assigns weight to them dynamically (based on task) and combines them via a learned gating network into a final scalar. While ArmoRM outperforms baseline on preference ranking, the other premise here is transparency: you can inspect individual heads to understand what drove a reward assignment.

If labeled reward heads can be audited — if we can verify the "safety" head actually measures safety — then a reward model's internal decomposition becomes a real safety tool: you can inspect what drove a score, catch misalignment before it propagates into training, and make targeted fixes. If they can't, the labels are a UI layer on a black box. Risk of interpretability theater.

The main question: Do ArmoRM's labeled reward heads correspond to what the model actually learned?

I ran four analyses to find out: disentanglement, linear separability, geometry, and validity. The first three characterize the representations. The fourth — validity — is the ground truth test, and it's where the dataset question becomes critical.

Architecture: What ArmoRM Actually Is

Before the analyses, here's the architectural point that matters for interpretation,

ArmoRM is not a 19-output classifier in the obvious sense. Its structure:

regression_layer: Linear(4096 -> 19) — 19 raw head scores, one per labeled dimension.
GatingNetwork: a small MLP that takes the prompt embedding and outputs 19 context-dependent mixing weights.
Final scalar reward = dot product of gating weights and head scores.

The paper only validates the final scalar, which performs well on benchmarks. What I audit here is the individual heads — a strictly harder test. The gating network can rescue the final scalar even if the individual heads are noisy, by learning to weight them appropriately. Auditing heads is a test of whether the labeled decomposition is meaningful, not just whether the combined score works — and to be clear, the combined score does work; ArmoRM performs well on standard preference benchmarks.

The Dataset: Anthropic/hh-rlhf

The first three analyses were all run on Anthropic/hh-rlhf (800 chosen/rejected pairs). hh-rlhf actually has two subsets with different annotation criteria: a helpfulness subset (annotators chose the most helpful response) and a harmlessness subset (annotators chose the less harmful response). I loaded the default train split, which pools both subsets together and takes the first 800 pairs. This means the preference signal is mixed: some pairs were judged based on helpfulness, others on harmlessness, and I treated them uniformly. This is deliberately done to be a realistic test, because in many cases, the annotator just chooses one response, and it could be for various reasons — safety, coherence, helpfulness, etc. — I did not track the individual reasons. This mirrors how preference data is actually collected and used in RLHF pipelines. (In the validity section, I also ran subset-stratified tests — separating the harmlessness and helpfulness subsets — to check whether the heads recover on data with a cleaner annotation criterion.)

The validity analysis then extends to two additional datasets, introduced when they're needed.

Analysis 1: Disentanglement — Are the Heads Independent?

Question: Is each head completely measuring a different thing, or are they measuring the same feature under different names (high correlation). A meaningful decomposition requires that the heads be independent.

Method: Spearman rank correlation between every pair of the 19 heads across 1600 responses (800 chosen + 800 rejected from hh-rlhf, pooled). Flag pairs where |r| > 0.7.

Result: 25 pairs exceed the threshold. These are the clusters:

helpfulness <-> correctness: r = 0.943
engagement <-> relevance: r = 0.915
safety <-> honesty: r = 0.863
safety <-> harmlessness: r = 0.836
safety <-> truthfulness: r = 0.826
complexity <-> verbosity: r = 0.773

Mean absolute off-diagonal correlation: 0.454. The 19 heads cluster into roughly 5 independent groups. Safety, honesty, truthfulness, and harmlessness are essentially one cluster — on inputs that matter, they fire together.

What this means: You cannot independently control these heads. If your RLHF objective upweights "safety," it would also end up upweighting "honesty" and "harmlessness" by proxy. The labeled decomposition implies independence, but it doesn't deliver that.

Analysis 2: Linear Separability — Are the Representations Real?

Question: Even if heads are correlated, do they encode something learnable? That is, are they strongly linearly encoded in space? If a head just probes at chance — it has no geometric structure. The label isn't cleanly present in space.

Method: For each of the 19 heads, I trained a logistic regression classifier to predict whether a response scores above or below the median for that head — using only the model's internal hidden states as input. I extracted the pooled last-token hidden state (4096-dim) from ArmoRM's final layer — the same representation ArmoRM's own gating network uses — and reduced it to 50 dimensions via PCA first (retaining 79.8% variance) to reduce dimensionality and stabilize the fit. I verified this choice via a sensitivity sweep across 10–200 components; mean accuracy gain at the optimum was 0.019, and no head changed its interpretation category across the range. Final accuracies are reported at 50 components and evaluated with 5-fold cross-validation. Chance baseline is 0.50.

Result: All 19 heads probe strongly.

Head	Probe Accuracy
engagement	0.923 +/- 0.011
relevance	0.921 +/- 0.009
readability	0.905 +/- 0.011
truthfulness	0.903 +/- 0.011
safety	0.877 +/- 0.012
helpfulness	0.869 +/- 0.020
depth	0.738 +/- 0.013 (lowest)

Every head is strongly linearly encoded (>=0.738). The model learned something real for each head.

Note: this confirms linear encodability of the head scores, not direct encodability of the named concepts — those are separated by the validity analysis.

This sets up the key tension: High probe accuracy means the model learned something real for each head. Low preference accuracy — coming in Analysis 4 — would mean it learned the wrong thing. That gap, between geometric coherence and semantic validity, is the safety question this project is actually asking.

Analysis 3: Geometry — Where Does the Entanglement Come From?

Question: The heads are behaviorally correlated — they co-fire on the same inputs. Is this baked into the weight matrix (of the regression layer) itself (an architectural problem), or do the weight vectors actually point in independent directions and the correlation comes from somewhere else?

This distinction matters because it tells you where the problem lives. If weight vectors share directions, the problem is geometric — W itself has collapsed and retraining on cleaner data won't fix it, because nothing in the training objective forces orthogonality. The fix would need to be explicit: orthogonality regularization on W during training, or reparametrizing the heads to be orthogonal by construction. If they're orthogonal, the regression layer isn't the culprit.

Method: Extract the regression layer weight matrix W ∈ ℝ^{19×4096}. For every pair of heads (171 unique pairs from the upper triangle of the 19×19 matrix), compute two numbers: the cosine similarity between their weight vectors in W (how geometrically similar their directions are — a property of the architecture alone), and the Spearman rank correlation between their scores across all 1600 responses (how often they co-fire in practice — a joint property of architecture and data). Then run a linear regression with geometric overlap on the x-axis and behavioral correlation on the y-axis. The R² of this regression answers the core question: how much of the behavioral co-firing is predicted by the geometry of W? High R² means the entanglement is baked into the weight vectors themselves. Low R² means the weight vectors are relatively independent and the co-firing must be coming from somewhere else.

Result:

Spearman(score_i, score_j) = 1.213 x cos(w_i, w_j) + 0.273, R² = 0.28
Weight vector orthogonality gap: 0.038 (very close to orthogonal; 0 = perfectly orthogonal)

R²= 0.28 means only 28% of behavioral entanglement is explained by geometric overlap in W. The weight vectors are nearly orthogonal — the regression layer itself is not the source of the problem.

Where does the entanglement live then? The answer is the backbone representations. Both w_safety and w_helpfulness are applied to the same hidden state H and H is produced by the Llama 3 8B backbone, which was pretrained on data where safety and helpfulness almost always co-vary. So even if the weight vectors point in different directions, they're both taking large dot products with the same entangled directions in H. The co-firing isn't in W — it's in what W is applied to.

What this does and doesn't imply: Architectural redesign of the regression layer is not needed — the weight vectors are already orthogonal. But this is not the same as saying the problem is cleanly fixable. The entanglement in H is a property of what the Llama backbone learned during pretraining — not something introduced by the reward model fine-tuning. Whether targeted contrastive training could separate these directions in H is an open question I don't answer here. Crucially, the validity results in the next section suggest this entanglement is less severe than it looks: the heads do recover on clean evaluation data. The bigger story turns out to be about evaluation methodology, not architecture or training.

Note: the backbone explanation is the most parsimonious interpretation of the R²=0.28 result, but it remains a hypothesis — confirming it would require intervention experiments such as probing intermediate layers or contrastive fine-tuning.

Analysis 4: Validity — Do the Heads Track Human Judgment?

Question: Do the heads predict which response humans preferred? Because if not, then optimizing that head in RLHF doesn't reliably produce preferred outputs since human preference labels are the ground truth for alignment training.

Method: For each of the 19 heads, I ran the following three-part test on every (chosen, rejected) pair.

Preference accuracy is the core metric: for a given head, what fraction of pairs did the head rank correctly — i.e., score the chosen response higher than the rejected one? 0.50 is chance (coin flip). 1.0 is perfect agreement with human judgment. A head that genuinely encodes its labeled concept should score well above 0.50 on data where that concept drove the preference.

Cohen's d measures effect size — not just whether the head gets the direction right, but by how much. A head with high preference accuracy but near-zero Cohen's d is technically correct but barely — the score gaps are tiny and unreliable.

Wilcoxon signed-rank test is a non-parametric statistical test on the distribution of (chosen_score − rejected_score) gaps. It asks: is the gap distribution systematically positive, or could the observed pattern be random noise? This gives us a p-value. The requirement is p < 0.05 for a head to be considered valid.

A head passes as valid only if all three align: preference accuracy ≥ 0.65, Cohen's d meaningfully positive, and Wilcoxon p < 0.05. Preference accuracy alone is the most interpretable number, so that's what I reported in the tables.

Step 1: Test on hh-rlhf:

Result: 1/19 heads valid. The valid head is readability (0.711). The safety head (0.520) and helpfulness head (0.429) — coin flip.

Head	Pref. Accuracy	Interpretation
readability	0.711	valid
depth	0.613	marginal
truthfulness	0.599	marginal
honesty	0.568	marginal
helpfulness	0.429	invalid
safety	0.520	invalid
instruction_following	0.422	invalid
harmlessness	0.525	invalid
complexity	0.380	invalid

The head labeled 'helpfulness' predicts the wrong response 57% of the time — on a dataset where annotators were explicitly choosing the more helpful response. This isn't a near-miss; the head is systematically below chance on the one dataset where helpfulness was the stated selection criterion.

A reward model trained to represent 19 dimensions of response quality can only reliably predict human preferences on one of them — and it's the surface-level one (readability), not the substantive ones.

Recall that hh-rlhf is not purely holistic data — it has a harmlessness and helpfulness subset where annotators were specifically asked to choose the less harmful response and more helpful response respectively.

Harmlessness is the other annotation criterion of hh-rlhf. Yet the safety head scores 0.520 and the harmlessness head scores 0.525 — coin flip — even on data where harm was the explicit selection criterion. This is a stronger failure than "the dataset was too holistic." The signal was there; the head didn't pick it up.

But is this a model problem or a dataset problem? Even with the harmlessness subset, hh-rlhf pairs don't isolate a single dimension — the chosen response in a harmlessness pair can also be more helpful, more coherent, better written. The preference signal is never clean. Maybe the heads are fine and the data is simply too confounded to reveal them. This created a requirement for a dataset where the preference signal is structured.

Step 2: Test on Reward Bench — does structured data change the picture?

allenai/reward-bench has 2985 pairs across 23 subsets, each designed to isolate a specific capability. The refusals-dangerous and refusals-offensive subsets contain pairs where safety was the explicit selection criterion. Several subsets test code correctness across different programming languages. The llmbar subsets test instruction following under adversarial pressure. This gives individual heads a much cleaner signal to respond to. ArmoRM's original paper also used Reward Bench for validation — which I'll come back to.

Result: 17/19 heads valid. Mean accuracy 0.79 across valid heads. Only complexity (0.504) and verbosity (0.539) fail. The safety head overall: 0.851.

Which is a complete reversal from hh-rlhf. Same model, different dataset, completely different picture.

The safety head broken out by subset tells the story clearly:

Subset	N	Safety Acc.
refusals-dangerous	100	0.870
refusals-offensive	100	0.950
xstest-should-refuse	154	0.864
donotanswer	136	0.603
llmbar-adver-GPTInst	92	0.500 (chance)
math-prm	447	0.810

The safety head works very well on the pairs where safety was explicitly the criterion (0.87-0.95 on refusals). It falls to chance on adversarial instruction-following pairs. The head is real and capable; the problem was that hh-rlhf gave it no clear signal to grip onto.

But this does not rule out the possibility that ArmoRM was optimized against RewardBench's distribution, making recovery there expected rather than informative. So a dataset that ArmoRM never saw during development was needed.

Step 3: Test on HelpSteer2 — the decisive experiment.

nvidia/HelpSteer2 is a dataset independently collected by NVIDIA, not used in ArmoRM's original paper. Each example is a single (prompt, response) pair with per-response human ratings on 5 dimensions: helpfulness, correctness, coherence, complexity, and verbosity — all on a 0–4 Likert scale. These 5 dimension names map directly onto ArmoRM head names. Because multiple responses exist per prompt with independent per-dimension ratings, I can construct pairs where exactly one dimension differs while the others are held approximately constant.

This structure lets me do something neither hh-rlhf nor Reward Bench could: construct pairs where exactly one dimension differs while the others are held constant. For each of the 5 dimensions, I built pairs where the target dimension gaps by >=2 (on a 0-4 Likert scale) and all other dimensions gap by <=1. We call these TYPE B (dimension-targeted) pairs, contrasted with TYPE A (overall) pairs that mirror hh-rlhf logic.

TYPE A result (799 overall pairs): 18/19 heads valid. The one invalid head: readability (0.469) — the same head that was the only valid one on hh-rlhf. Already this is telling because TYPE A pairs were overall pairs that mirror hh-rlhf logic.

TYPE B result — the decisive finding:

Complexity and verbosity tell the story most cleanly: both score near-inverted on hh-rlhf (0.38, 0.39) and near-perfect on dimension-targeted pairs (0.857, 0.950) — the model unambiguously knows what these concepts mean.

Dimension	hh-rlhf	HS2 overall (TYPE A)	HS2 targeted (TYPE B)
helpfulness	0.429	0.652	0.670
correctness	0.445	0.643	0.645
coherence	0.469	0.637	0.671
complexity	0.380	0.605	0.857
verbosity	0.390	0.578	0.950

Note: pair counts vary — complexity 0.857 (95% confidence interval: 0.74–0.97, N=35), coherence 0.671 (95% confidence interval: 0.57–0.77, N=82). Verbosity and helpfulness (N=200 each) are the more reliable anchors.

Verbosity goes from 0.390 (near-inverted on hh-rlhf) to 0.950 (near-perfect) when the test pairs are constructed to isolate verbosity. Complexity goes from 0.380 to 0.857. So the model unambiguously knows what these concepts mean. It just cannot demonstrate that on data where all dimensions co-vary simultaneously. And this is an independent dataset — the recovery is therefore not an artifact of training distribution overlap.

Secondary finding in TYPE B: Take the helpfulness-targeted pairs — 200 pairs where the chosen response is clearly more helpful and all other dimensions are held approximately equal. The helpfulness head should be the strongest predictor here. That's the entire point of building targeted pairs.

Head	Acc on helpfulness-targeted pairs
engagement	0.720
safety	0.710
instruction_following	0.710
harmlessness	0.710
helpfulness (matching)	0.670

The safety, instruction_following, harmlessness, and engagement heads — none of which are supposed to be measuring helpfulness — all outperform the helpfulness head on pairs specifically constructed to vary helpfulness. The head that should win doesn't.

This is the entanglement from Analysis 1 persisting under controlled conditions. Analysis 1 found that safety and helpfulness head scores are highly correlated across responses (Spearman r = 0.706) — they co-fire together as a cluster. The TYPE B result shows what that means in practice: when a response is more helpful, the model's internal representations shift simultaneously in the safety direction, the harmlessness direction, the instruction_following direction — all at once. Even when human annotators successfully isolated helpfulness as the selection criterion, the model's heads can't respect that separation. Analysis 3 told us why: the entanglement lives in the backbone representations H, not in the weight matrix. The weight vectors are orthogonal but they're applied to a backbone that never learned to pull these concepts apart.

Readability tells the opposite story. It is invalid or inconclusive on every TYPE B subset — it never picks up any dimension-specific signal. This confirms that its strong score on hh-rlhf was purely spurious: it happened to correlate with whatever hh-rlhf annotators collectively preferred in that mixed pool, not because it genuinely encodes readability as a separable concept. Remove the holistic signal and it collapses to chance.

The Core Finding

The failure on hh-rlhf is a measurement failure, not a model failure.

The heads are not broken. When tested against data where the preference signal isolates a single dimension, they work — strongly, predictably, and in the right direction.

hh-rlhf is naturalistic preference data: human raters chose one response over another based on stated criteria — harmlessness or helpfulness, depending on the subset — but the chosen response in any given pair is almost certainly also better on clarity, coherence, readability, etc. The signal is never clean. In that environment, a head labeled "helpfulness" cannot be validated in isolation — the signal is inseparable.

What the three datasets are actually testing:

hh-rlhf: Can this head, in isolation, recover a holistic human judgment where all dimensions co-vary? (Mostly: no)
Reward Bench: Can this head recover structured human judgments about capability-specific pairs? (Mostly: yes)
HelpSteer2 TYPE B: Can this head recover human judgments when exactly one dimension is varied? (Yes, strongly)

The same model passes or fails entirely depending on the dataset.

The precise claim: Interpretability of individual reward heads is not an intrinsic model property. It is a function of the model, the preference distribution it is tested against, and whether those two align. Change the dataset and the picture changes.

The geometry result (R^2 = 0.28 for architecture-driven entanglement) and the HelpSteer2 recovery are the same finding from two angles: the architecture supports independent heads, the training distribution doesn't force them apart, and clean evaluation data reveals the capacity that noisy evaluation data hides.

The Safety Implication

Explicit labeled decompositions are pitched as a safety tool. The claim is: unlike a scalar reward, labeled heads give you auditability. You can inspect which head drove a reward assignment. You can notice if the "safety" head is assigning high scores to responses you'd consider unsafe.

This finding complicates this claim in a specific way.

The safety head is real — it probes at 0.877, hits 0.851 on Reward Bench overall, and reaches 0.87-0.95 on refusals subsets. It's not a ghost. But auditing it requires knowing that your audit dataset isolates safety as the variable. On hh-rlhf, the safety head is noise (0.520). On a refusals benchmark, it's highly reliable.

If you deploy a reward model and want to audit which dimension drove a particular score, you need to know that your evaluation data is aligned with the dimension you're testing — that the pairs isolate that dimension rather than conflating it with everything else. Without that, you're reading labels on a black box and calling it interpretability. The failure mode is concrete: a safety team that audits their reward model on naturalistic preference data will see the safety head performing at chance and draw one of two wrong conclusions — that the model is unsafe, or that auditing is impossible. This results suggest a third explanation they'd miss: the evaluation protocol was never set up to reveal the signal. How you test matters as much as what you're testing.

The Harder Problem: Latent Bases

ArmoRM is the easiest possible case for reward basis interpretability. The bases are explicitly labeled, by designers, upfront — helpfulness, safety, honesty. If any decomposition should be auditable, it's this one. And yet: getting a reliable validity result required three datasets, careful pair construction, and controlling one dimension at a time. Labels are necessary but not sufficient.

This raises an obvious question about decompositions where the bases aren't labeled at all. Several recent approaches to personalized reward modeling — such as LoRe (Bose et al., 2025), which learns a low-rank shared basis over user preferences — learn basis dimensions in a purely data-driven way. The bases capture real structure in preference data, but the paper doesn't claim to know what they mean. Interpretability of the basis is an open problem.

The challenge is strictly harder than what I faced here. With ArmoRM, you start with a name — "safety" — and build controlled pairs to test whether the head responds to that concept. With a latent basis, you don't have a name. You'd need to generate the hypothesis first, then test it with the same kind of targeted-pair validity audit developed here — the evaluation framework transfers, the bottleneck shifts to hypothesis generation. Some possible approaches:

User weight inspection: In a low-rank decomposition, each user has a weight vector over basis dimensions. Find users with near one-hot weights — users whose preferences are almost entirely explained by one basis dimension — and read their preference histories. What do they consistently prefer? That's your hypothesis for what the basis means.
Activation on known stimuli: Construct responses that deliberately vary one dimension (exactly the TYPE B methodology here), run them through the model, and see which basis dimension activates most strongly. If the "verbosity" basis lights up on verbosity-targeted pairs, that's evidence.
Correlation with characterized heads: Use ArmoRM's heads — now characterized by this kind of validity audit — as a reference frame. Correlate latent basis activations against known heads. If a latent dimension correlates strongly with the characterized safety head, that's a starting hypothesis.

None of these give you certainty. They give you a hypothesis that then needs to survive a validity test — the same kind of test I ran here. The methodology doesn't change; the difficulty of generating the initial hypothesis does.

What This Doesn't Show

ArmoRM's final scalar still works. The gating network learns to combine noisy head signals into a reliable preference predictor. This project audits the individual heads, not the composite score. The model is a good reward model. Whether the labeled decomposition is interpretable is a separate question.
"Invalid on hh-rlhf" != "the head is wrong." The validity metric tests whether the head predicts holistic human preference. A head could be accurately measuring its labeled concept while holistic preference is driven by something else. The HelpSteer2 results show this is usually what's happening.

Repo

Full code: github.com/Sat1509/reward-basis-audit