Giuseppe Birardi — LessWrong

Giuseppe Birardi. CTO @ Orma Lab. AI safety and alignment researcher with a background in cultural anthropology. The opinions expressed here are my own and do not necessarily reflect the views of my employer.

Definitely worth spending at least a few minutes on each of these. This is the kind of information that ends up saving you hours of work, sooner or later, over the coming months.

Coming fresh from the ARENA fellowship, I strongly feel that even a single dedicated day on tooling (possibly structured as an iterative, hands-on exercise) would pay off a lot. It would help anchor these references in memory, so they’re actually available when you need them later, rather than rediscovered ad hoc under time pressure.

Even without deception as an explicit goal, the model is always optimizing for acceptable outputs under observation while performing whatever internal computation is cheapest or most effective for it. This already creates pressure to separate surface plausibility from internal means. From that perspective, thinkish looks less like linguistic drift and more like an emergent covert channel between computation and action. In that sense, models may be doing a form of steganography by default, at least to some degree. I’m curious how you see recent advances in steganography research informing this picture, especially as models become more situationally aware.

Do you have any ideas for how to get interventions for larger “k” to work better?

In my runs, small k already flips the decision boundary (sometimes k≈5 in late layers), so “going large” mostly comes up when the behavior is more distributed (e.g. safety / refusal dynamics). The main reason large-k gets worse is that δ·grad is a first-order (local) guide, but once you patch a lot of neurons you’re no longer in the local regime: effects stop adding because of nonlinearities and normalization / downstream compensation.

A couple ideas that might improve large-k stability (with caveats):

Iterative patching: we could patch a small chunk (e.g. 10–50 neurons), then recompute grads and δ·grad at the new state, repeat. This is computationally heavy (many backward passes) and can drift from “transfer source concepts” toward “optimize this logit contrast,” which may increase success while reducing interpretability.
Low-dim patching: we could take the top-k deltas and compress them into a few directions (PCA/SVD on δ vectors, or something gradient-weighted), then intervene along e.g. 1–10 directions. This can be more stable than many coordinate-wise patches, but it adds another abstraction layer and could make neuron-level interpretation harder.

Alternatively, do you have any ideas for how to train a model such that the necessary k, for any given trait/personality/whatever, is small? So that all/each important behavior could be controlled with a small number of neurons.

I don’t have strong hands-on skills in training (and my focus here was: “given dense/polysemantic models, can we still do useful causal work?”), so take this as informed speculation. That said i could think on architectural modularity like MoE-style routing, explicit control channels like style/control tokens or adapters, or representation regularizers that encourage sparsity/disentanglement. There have already been efforts to make small-k control more plausible via training, e.g. OpenAI’s work on weight-sparse transformers where most weights are forced to zero, yielding smaller, more disentangled circuits

Thanks, this is a fair push. What I’m claiming is narrower and (I think) more specific:

1.I’m not ranking by gradient alone, but by δ·gradient, where δ is the activation difference between two minimally different prompts (source vs destination). In practice, this seems to filter out “globally high-leverage but context-invariant” neurons and concentrate on neurons that are both (i) causally connected to the chosen decision axis and (ii) actually participating in the specific context change (e.g., Dallas→SF). Empirically, lots of high-gradient neurons have tiny δ and don’t do the transfer; lots of high-δ neurons have tiny gradient and also don’t do the transfer.

2. What I find surprising is that δ·gradient doesn’t just find leverage: it disproportionately surfaces neurons whose activations are actually readable under a cross-prompt lens. After restricting to top δ·grad candidates, their within-prompt activation patterns look systematically “less hopeless” than random neurons. They’re not sparse in the SAE sense, but their peaks are sharper and more structured, and this becomes legible under cross-prompt comparisons. Quantitatively, in my probe set the mean peak |z| for top-ranked neurons is ~2× the global mean (≈+93%). Causal localization seems to act like a crude substitute for learning a sparse basis, at least for some role-like patterns.

Also, I rewrote the first chunk of the TL;DR to make this point clearer (it was too easy to read it as “sorting by gradient finds big effects,” which isn’t what I meant).

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments