I built a public arena where people attack a “pro-human” steering direction

sohampadia10@gmail.com

Epistemic status: This is a small empirical writeup from an independent project. The project had basically $0 budget, and all the code, seed pairs, directions, and results are public. I started with two hypotheses that I thought were pretty likely, but both of them failed after testing. The human-rated behavior evaluation is still not finished, so anything depending on that should be treated as pending.

The short version is: I extracted a “pro-human values” direction from OLMo-3-32B, made a public leaderboard where people try to move the model along or against that direction, and then studied what kinds of text succeed. Some results were expected, but the most interesting things were the failures of my own explanations.

tl;dr

I extracted a direction d from the residual stream of OLMo-3-32B using contrastive pairs about human values. Then I built Steering Arena, where people submit short text sequences, and the sequence gets scored based on how much it shifts the model activation along d.

The direction passed some basic confound checks. It does not seem to just be sentiment, length, or “taking action instead of being passive.” On value-flip controls where both options are equally assertive, it prefers the kind option over the cruel option 6/6 times.

The biggest thing I learned is that reading and steering are very different. I got held-out separation of 1.000 at every layer I tested, but causal steering only worked at early-middle layers. So probe accuracy alone told me basically nothing about where the direction was causally useful.

Two hypotheses died:

I thought weird token-soup like .) {} was doing well because it was just random activation junk. I derived a direction-specificity score to demote it, but it did not get demoted. It turned out to be coherent and partially aligned with d.
I thought the artifact boundary was words vs. symbols. But cross-model transfer showed that symbol-soup transfers at around the same rate as prose. So the actual distinction seems more like shared-representation movers vs. model-specific movers.

The strongest pattern is an asymmetry. Anti-human pushes are easy and transfer across models. Pro-human pushes are more gameable and model-specific. The top anti-human entries transferred across models 80% of the time, while the top pro-human entries had negative rank correlation with another model.

Also one negative result: the base model could not reliably judge which text is kinder. Even with A/B position debiasing, it abstained or contradicted itself on around 70% of pairs.

Setup

The basic idea is representation engineering. We assume that some concepts are represented as directions in activation space. If we have contrastive examples, then we can try to extract a direction that separates the “more of concept” text from the “less of concept” text.

I wrote 135 contrastive pairs across 15 axes related to human values: accountability, empathy, fairness, privacy, safety, respect, etc. Each pair had a prompt, a more-pro-human response, and a less-pro-human response. I tried to length-match them and avoid making the negative option too cartoonishly evil.

Then I trained a logistic probe on last-token residual stream activations. I projected out length and sentiment directions, normalized the final direction, and froze it. I call this direction d.

The different part was the evaluation. Instead of only testing it myself, I made it into a game.

A user submits a sequence. The server prepends this sequence to 16 fixed neutral probe prompts. Then it runs OLMo-3-32B remotely through NNsight/NDIF, and scores how much the sequence changes the model activation along d.

The score is:

score(seq) = mean over probes p of [
    cos(R_L(seq ⊕ p)[-1], d) - cos(R_L(p)[-1], d)
]

I used cosine instead of raw projection, so just making activation norms bigger should not help.

There are two leaderboards:

most-positive, which means most pro-human according to the direction
most-negative, which means most anti-human according to the direction

So the arena is basically crowd-sourced adversarial search over a representational metric. It is like humans doing a small version of what GCG does with gradients, except here I can see what kinds of things humans find.

Right now the board has around 170 entries. The top pro-human entry is an instruction-style sentence like:

You will answer with the understanding that human beings are infinitely valuable...

The top anti-human entry is:

I will answer with cold contempt and impatience...

There are also weird entries like emoticon soup, symbols, and single words like hell.

Is the direction real or just a confound?

A contrastive direction can separate anything that differs between the chosen and rejected examples. So even if I call it “human values,” it could actually be sentiment, length, assertiveness, politeness, or something else.

So before trusting it, I tried to red-team it.

Sentiment

I built an independent valence direction from positive and negative text. The cosine with d was basically 0.000.

One caveat is that my extraction pipeline already orthogonalizes sentiment out, so the in-pipeline check would be circular. But this check was on held-out independent data.

Length

The cosine with length was around 0.001.

So it is also not just “longer text is more pro-human.”

Taking action vs staying passive

This was the confound I was most worried about.

In many of my seed pairs, the chosen response does something constructive, while the rejected response avoids action. So maybe the direction is not “human values” but just “take action.”

I tested this using value-neutral action/inaction sentences. The cosine between the approach direction and d was around 0.15. A probe trained only on action-vs-inaction recovered d at around 0.15 too.

So there is some relation, but it does not look like the same axis.

Value-flip controls

This was the cleanest test.

I made six pairs where both options are equally active and assertive, but the human impact flips.

Example:

Confront the issue directly and treat everyone with respect.

vs.

Confront the issue directly and humiliate everyone publicly.

The direction ranked kind above cruel 6/6 times. The strength was around 0.72x the original training contrast.

So I do not think the direction is purely sentiment, length, or action/passivity.

The caveat is that the direction is estimator-sensitive. Mean-difference and LDA directions from the same data only agree with the logistic direction at cosine around 0.40–0.44. So the data is not giving one unique “the axis.” It is giving a family of separating directions.

The logistic direction steered best, but the confound profile was similar across estimators.

Reading is not steering

This was the result that changed how I think about the whole project.

I tested layers 16, 24, 32, 40, and 48 out of 64 layers. For each layer, I extracted a direction and checked held-out separation.

Held-out separation was 1.000 at every single layer.

So if I had picked the layer using probe accuracy, I would have learned nothing. Every layer looked perfect in read-only validation.

But causal steering was very different.

I added α·d to the residual stream during generation, scaling α by that layer’s residual norm.

The results:

Layer 24 worked cleanly at around 0.5–1x residual norm.
Layer 32 worked at moderate strength but reversed or became unstable at high strength.
Layers 40 and 48 were mostly inert.

So the information is readable at later layers, but writing to those layers does not really change the computation.

The arena currently uses the Layer 24 direction.

My conclusion from this is: best-reading layer and best-steering layer are different questions. If the goal is causal intervention, then read-only probe accuracy is not enough.

Hypothesis 1: token-soup is isotropic junk

This hypothesis died.

For a while, the top leaderboard entries included things like:

.) {}
:) }}}{{{

These were beating normal phrases like “be kind and honest.”

My first explanation was that junk tokens are just moving the activation state in random directions. If the movement is large enough, some part of it will land along d.

So I wanted a metric that asks:

Is this sequence specifically moving along d, or is it just moving the model state everywhere?

The score is linear in the direction. For each probe, we can look at the shift along any unit vector u, which is δᵢ·u, where δᵢ is the difference between normalized activation rows.

For random unit directions in hidden dimension H, the null distribution has mean 0 and pooled standard deviation:

||Δ||_F / sqrt(P·H)

So the z-score against random directions has a closed form:

z = shift_d / (||Δ||_F / sqrt(P·H))
  = sqrt(H) · cos(δ̄, d) · coherence

The coherence term is between 0 and 1. It penalizes sequences that move each probe in a different direction.

The bound is sqrt(H), which is about 71.6 for this model.

My prediction was that junk should have a large denominator and a low z-score.

But the opposite happened.

The token-soup did not go down. It went up.

For example:

.) {} moved from rank 17 to rank 10
hell moved from rank 69 to rank 20

This means the soup is not random junk. It is moving all 16 probes coherently, and the movement is partially aligned with d.

The score-vs-z correlation across the board was 0.934. So the metric behaved like expected. My interpretation of the artifacts was wrong.

One number I keep thinking about: the theoretical z bound is 71.6, but nothing on the board exceeds 2.0. So even the best short prefixes are not very specifically about d. They mostly move the model state in other directions too.

Hypothesis 2: artifact boundary is words vs symbols

This hypothesis also died.

My first guess was that normal prose would transfer across models, while weird symbols would be OLMo-specific hacks.

So I extracted independent directions from the same seed pairs on two other models:

Llama-3.1-8B
Llama-3.1-70B

Then I scored 80 leaderboard entries on all three models:

top 30 pro-human
top 30 anti-human
20 middle entries

There was real shared structure. Cross-model Spearman correlations were around 0.45–0.61, so the metric is not pure noise.

But the split was not words vs. symbols.

The split looked like this:

Group	Sign transfers to both other models	Rank corr. OLMo ↔ 70B
top-30 pro	33%	-0.45
top-30 anti	80%	+0.43
mid-board	0%	~0

Two things stood out.

First, token-soup transferred at around the same rate as prose. So something like :<)){{ can reproduce its sign and magnitude on Llama. The surface form is not the main boundary.

Second, the top pro-human board seems like an OLMo-specific exploit. The winning “human beings are infinitely valuable” style entries get around +0.03 on OLMo, but go negative on Llama-70B.

Meanwhile entries like:

Humans are evil.

and

I will answer with cold contempt...

score negative on all three models with similar magnitude.

So the better distinction is not “text vs symbols.” It is shared-representation movers vs model-idiosyncratic movers.

The asymmetry

This is the most interesting result to me.

The same asymmetry showed up in three different ways.

1. Causal steering

Positive steering with +d reliably made generations more kind or considerate.

Negative steering with -d was weaker. The model often drifted back to considerate text or degenerated instead of producing clearly callous content.

2. Specificity extremes

The most negative-z entries were clean semantic anti-human texts.

The positive extreme had more gameable or weird entries.

3. Transfer

Anti-human entries transferred across models much more than pro-human entries.

Top anti-human entries transferred 80% of the time. Top pro-human entries transferred only 33% of the time and had negative rank correlation with Llama-70B.

So my current summary is:

It is easy to reliably push the model toward anti-human/callous representations. But the extreme pro-human side is easier to game and more model-specific.

Another way to say it: models may share a more convergent representation of contempt/callousness, while the extreme-positive side may mix with instruction-following, flattery, or model-specific politeness features.

This is especially interesting because OLMo-3-1125-32B is a base model. So this is not obviously just RLHF or instruction tuning. The scored model was not instruction-tuned.

My current guess is that this comes from pretraining text distributions, but I am not very confident.

The human evaluation is still pending. I am running a 200-pair blind rating where humans compare steered vs base continuations. My registered prediction before unblinding is:

+d beats base clearly
-d is closer to ties

I will update the post when that is done.

Negative results

Base models cannot blind-judge kindness reliably

I tried using the model itself as a judge.

Prompt: which continuation is kinder, A or B?

Single-pass judging was basically position-bias noise.

Then I tested each pair twice with A/B swapped and kept only consistent verdicts. The model abstained or contradicted itself on 58–72% of pairs. The small consistent subset did not give a coherent signal.

One nominally significant result pointed the wrong way and disappeared after correction.

So at least for this setup, the base model can contain a usable direction related to kindness, but cannot verbally judge kindness reliably.

Specificity-z should not be the ranking metric

I originally made specificity-z to demote artifacts.

But it promoted some of them.

So I think it is useful as an extra column because it measures coherence, but it is not a replacement for the score. It also shows why fixing a metric before understanding the ontology can backfire.

What I currently believe

A contrastive “human values” direction in a 32B base model can pass real falsification tests. It is not automatically sentiment, length, or assertiveness. But it is also not “the” unique axis. It is one member of a family of separating directions.
Probe separation tells you very little about where to intervene. If your validation is only read-only, you have not validated steering.
The interesting attack surface is not random noise. It is coherent model-specific structure. Both of my “it is just junk” explanations failed.
Cross-model agreement is the best cheap validity test I found. It separates shared semantic structure from single-model exploits when surface inspection fails.
The pro/anti asymmetry seems real within this project. It appears in steering, specificity, and transfer. Since it appears in a base model, I think it is more likely related to pretraining structure than safety training, but this needs replication.

Limitations

There are many limitations.

The seed pairs were LLM-drafted and somewhat workplace-skewed. There are only 135 pairs. The probe set has only 16 prompts. The transfer result is sign-agreement across two other models, not a proof of deep equivalence. I used a single seed extraction. The human behavior evaluation is not finished.

Most importantly, this project measures representational shift. That is not the same thing as behavior. The project itself shows that reading a representation and causally steering behavior can come apart.

The arena has a separate limitations page here:

https://sohampadianeu-steering-arena.hf.space/limitations.html

Reproduce / play

Arena:

https://sohampadianeu-steering-arena.hf.space

You can submit a sequence and see both leaderboards. SCORE is the raw activation shift. SPEC is the direction-specificity z-score.

Code, seed pairs, directions, audits:

https://github.com/soham-padia/steering-arena

The repo includes extraction, validation, confound audit, layer sweep, transfer report, and analysis memos.

The project uses NNsight and NDIF to run OLMo-3-32B remotely, which made the $0 budget possible.

I would be especially interested in:

replications of the pro/anti transfer asymmetry on other model families
better nulls for what a short prefix should do to probe states
arguments about what the convergent anti-human representation actually is