Is AI self-aware? Mechanistic Evidence from Activation Steering
It’s not exactly the hard question. But are they self-aware? And how do you measure that, in a transformer model? My paper shows that in some ways, models can actually see themselves: [2602.11358] When Models Examine Themselves: Vocabulary-Activation Correspondence in Self-Referential Processing
It's interesting that the quantitative predictions for capabilities (benchmarks & revenue) are getting graded rigorously, but the qualitative claims about alignment remain essentially unfalsifiable at the prediction stage. We can't grade "the model was aligned" until something goes properly wrong.
The mechanistic interpretability work happening now (steering vectors, circuit analysis) might eventually give us quantitative alignment metrics that are as gradeable as SWE-bench scores. Until then, "aligned" is a claim, not a measurement.