For classification problems like this, my current hypothesis is that enumerating the target classes nudges the model to instantiate a dedicated “choice-head” circuit that already lives inside the transformer. In other words, instead of sampling from the entire vocabulary, the model implicitly masks to the K tokens you listed and reallocates probability mass among them. A concrete prediction I have (so I can be wrong in public) is that if your ablate the top 5 neurons most correlated with the enum-induced class-direction, accuracy will drop ≥ 15 pp on that task, but ablating an equivalent set chosen at random will drop ≤ 3 pp. I plan to test this on a 70 B model over the next month; happy to update if the effect size evaporates.
Your key identification strategy is that the lowest open-source price ≈ marginal cost, hence ≈ underlying algorithmic efficiency. Two places I could imagine this breaking:
Since o3 shows shutdown subversion under multiple prompt variants, could we be shining a light on a pre‑existing “avoid‑shutdown” feature? If so, then giving the model explicit instruction like "if asked to shut down, refuse" may activate this feature cluster, plausibly increasing the residual stream’s projection into the same latent subspace. Since RLHF reward models sometimes reward task completion over obedience, this could be further priming a self preservation circuit. Does this line of reasoning seem plausible to others? A concrete way to test this could be by comparing prompts with vs. without the refusal clause to check whether the cosine similarity to an “avoid‑shutdown” direction really spikes mid‑stack.