ZackDadfar — LessWrong

LESSWRONG
LW

Replying toGrading AI 2027's 2025 Predictions

Grading AI 2027's 2025 Predictions

It's interesting that the quantitative predictions for capabilities (benchmarks & revenue) are getting graded rigorously, but the qualitative claims about alignment remain essentially unfalsifiable at the prediction stage. We can't grade "the model was aligned" until something goes properly wrong.

The mechanistic interpretability work happening now (steering vectors, circuit analysis) might eventually give us quantitative alignment metrics that are as gradeable as SWE-bench scores. Until then, "aligned" is a claim, not a measurement.

Replying toIs AI self-aware? Mechanistic Evidence from Activation Steering

ZackDadfar10h

Is AI self-aware? Mechanistic Evidence from Activation Steering

Long form coming soon with a technical breakdown of what we did.

Replying toIntrospection or confusion?

ZackDadfar15h

Introspection or confusion?

Interesting experiment. I actually just put something out on arxiv that approaches this from a different angle - instead of yes/no steering effects, i looked at whether the vocabulary that models use during extended self-examination tracks their actual activation dynamics.

tldr: it does. when llama 70b says "loop" during introspection, autocorrelation in activations is high (r=0.44, p=0.002). when it says "surge", max activation norm is high (r=0.44, p=0.002). tested across two architectures (llama, qwen) with different vocab emerging in each but same principle - words track activations.

The key thing is the correspondence vanishes in descriptive contexts. model uses the same words ("loop", "expand") when describing other things, zero correlation with metrics. so it's not embedding artifacts or frequency effects, it's specific to self-referential processing.

paper: https://arxiv.org/abs/2602.11358

Relevant to the confusion vs introspection question - if the model's self-report vocabulary systematically tracks activation dynamics, that's harder to explain as pure noise.

Is AI self-aware? Mechanistic Evidence from Activation Steering

ZackDadfar

It’s not exactly the hard question.

But are they self-aware? And how do you measure that, in a transformer model?

My paper shows that in some ways, models can actually see themselves:

[2602.11358] When Models Examine Themselves: Vocabulary-Activation Correspondence in Self-Referential Processing

Models know what they are doing (but it's hidden)

ZackDadfar

I ran ~500 experiments on vocabulary-activation correspondence in self-referential processing. Paper and data on Zenodo: https://zenodo.org/records/18568344.

Figure in the header shows a key result. Happy to discuss

I am seeking arXiv endorsement (cs.AI) - DM if you can help