Measuring Latent Operator States in Transformer Mid-Layers
Summary I report evidence that conversational “operators” (e.g. summarize, critique, reframe) correspond to stable, decodable internal states in transformer mid-layers, distinct from surface instruction wording and generalizing across content. These states are weak but persistent, geometrically separable via simple centroid methods, and survive instruction masking. I’m sharing this as a measurement result, not an intervention or architectural proposal, and I’m looking for feedback on interpretation and next experimental steps. Motivation We often talk informally about models entering different “reasoning modes” or “processing regimes,” but it’s unclear whether such modes correspond to measurable internal structure rather than prompt-level artifacts. This work explores whether operator-like distinctions can be detected as latent internal states, and where they localize in the network. Setup * Model: Qwen2.5-0.5B * Task: Apply different conversational operators to diverse contents * Capture: Fixed-window hidden states (32 generated tokens) * Features: Directional deltas of hidden states, aggregated per layer * Controls: * cross-topic generalization (LOCO) * cross-paraphrase generalization (LOPO) * instruction masking * label permutation baselines Key Findings 1. Decodability: Operator identity is decodable above chance from mid-layer features. 2. Content Generalization: Signal survives leave-one-content-out evaluation. 3. Instruction Masking: Replacing operator instructions with a constant placeholder removes layer-0 signal but preserves mid-layer signal. 4. Geometry: Simple nearest-centroid classifiers on mid-layer features achieve F1 ≈ 0.61 (chance 0.33) on a partial dataset (N=600), indicating geometrically separable operator representations. 5. Layer Localization: Signal peaks in a mid-layer band and diminishes toward output layers. What This Does Not Show * No causal control or ste