Moksh Nirvaan — LessWrong

Self-preservation or Instruction Ambiguity? Examining the Causes of Shutdown Resistance

Since o3 shows shutdown subversion under multiple prompt variants, could we be shining a light on a pre‑existing “avoid‑shutdown” feature? If so, then giving the model explicit instruction like "if asked to shut down, refuse" may activate this feature cluster, plausibly increasing the residual stream’s projection into the same latent subspace. Since RLHF reward models sometimes reward task completion over obedience, this could be further priming a self preservation circuit. Does this line of reasoning seem plausible to others? A concrete way to test this could be by comparing prompts with vs. without the refusal clause to check whether the cosine similarity to an “avoid‑shutdown” direction really spikes mid‑stack.

Don't fight your LLM, redirect it!

Moksh Nirvaan3mo40

For classification problems like this, my current hypothesis is that enumerating the target classes nudges the model to instantiate a dedicated “choice-head” circuit that already lives inside the transformer. In other words, instead of sampling from the entire vocabulary, the model implicitly masks to the K tokens you listed and reallocates probability mass among them. A concrete prediction I have (so I can be wrong in public) is that if your ablate the top 5 neurons most correlated with the enum-induced class-direction, accuracy will drop ≥ 15 pp on that task, but ablating an equivalent set chosen at random will drop ≤ 3 pp. I plan to test this on a 70 B model over the next month; happy to update if the effect size evaporates.

How Fast is Algorithmic Progress in AI Inference?

Moksh Nirvaan3mo20

Your key identification strategy is that the lowest open-source price ≈ marginal cost, hence ≈ underlying algorithmic efficiency. Two places I could imagine this breaking:

Latency tiers. A provider might post a single blended price that is still optimized for low-latency interactive traffic. If so, the true “slow-batch” marginal cost could be lower, making the efficiency trend steeper.
Cross-subsidy / reputation games. Smaller hosts sometimes keep prices above marginal cost to signal quality or to funnel users onto higher-margin add-ons. Has anyone tried to benchmark private FLOP/token on consumer GPUs for the same checkpoints to triangulate?

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments