Jason Gross — LessWrong

On Targeted Manipulation and Deception when Optimizing LLMs for User Feedback

The model learns to act harmfully for vulnerable users while harmlessly for the evals.

If you run the evals in the context of gameable users, do they show harmfulness? (Are the evals cheap enough to run that the marginal cost of running them every N modifications to memory for each user separately is feasible?)

Are we dropping the ball on Recommendation AIs?

Jason Gross1yΩ130

I believe the closest research to this topic is under the heading "Performative Power" (cf, e.g., this arXiv paper). I think "The Age of Surveillance Capitalism: The Fight for a Human Future at the New Frontier of Power" by Shoshana Zuboff is also a pretty good book that seems related.

A simple model of math skill

Jason Gross1y30

The reason you can't sample uniformly from the integers is more like "because they are not compact" or "because they are not bounded" than "because they are infinite and countable". You also can't sample uniformly at random from the reals. (If you could, then composing with floor would give you a uniformly random sample from the integers.)

If you want to build a uniform probability distribution over a countable set of numbers, aim for all the rationals in [0, 1].

Lucius Bushnaq's Shortform

Jason Gross1y30

I don't want a description of every single plate and cable in a Toyota Corolla, I'm not thinking about the balance between the length of the Corolla blueprint and its fidelity as a central issue of interpretability as a field.

What I want right now is a basic understanding of combustion engines.

This is the wrong 'length'. The right version of brute-force length is not "every weight and bias in the network" but "the program trace of running the network on every datapoint in pretrain". Compressing the explanation (not just the source code) is the thing connected to understanding. This is what we found from getting formal proofs of model behavior in Compact Proofs of Model Performance via Mechanistic Interpretability.

Does the 17th-century scholar have the requisite background to understand the transcript of how bringing the metal plates in the spark plug close enough together results in the formation of a spark? And how gasoline will ignite and expand? I think given these two building blocks, a complete description of the frame-by-frame motion of the Toyota Corolla would eventually convince the 17th-century scholar that such motion is possible, and what remains would just be fitting the explanation into their head all at once. We already have the corresponding building blocks for neural nets: floating point operations.

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team

Jason Gross1yΩ110

[Lucius] Identify better SAE sparsity penalties by reasoning about the distribution of feature activations
In sparse coding, one can derive what prior over encoded variables a particular sparsity penalty corresponds to. E.g. an L1 penalty assumes a Laplacian prior over feature activations, while a log(1+a^2) would assume a Cauchy prior. Can we figure out what distribution of feature activations over the data we’d expect, and use this to derive a better sparsity penalty that improves SAE quality?

This is very interesting! What prior does log(1+|a|) correspond to? And what about using instead of $\sum_{i} log (1 + | a_{i} |)$ ? Does this only hold if we expect feature activations to be independent (rather than, say, mutually exclusive)?

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team

Jason Gross1yΩ110

[Nix] Toy model of feature splitting
There are at least two explanations for feature splitting I find plausible:
Activations exist in higher dimensional manifolds in feature space, feature splitting is a symptom of one higher dimensional mostly-continuous feature being chunked into discrete features at different resolutions.
There is a finite number of highly-related discrete features that activate on similar (but not identical) inputs and cause similar (but not identical) output actions. These can be summarized as a single feature with reasonable explained variance, but is better summarized as a collection of “split” features.

These do not sound like different explanations to me. In particular, the distinction between "mostly-continuous but approximated as discrete" and "discrete but very similar" seems ill-formed. All features are in fact discrete (because floating point numbers are discrete) and approximately continuous (because we posit that replacing floats with reals won't change the behavior of the network meaningfully).

As far as toy models go, I'm pretty confident that the max-of-K setup from Compact Proofs of Model Performance via Mechanistic Interpretability will be a decent toy model. If you train SAEs post-unembed (probably also pre-unembed) with width d_vocab, you should find one feature for each sequence maximum (roughly). If you train with SAE width , I expect each feature to split into roughly ${d_vocab}^{2} n_ctx$ features corresponding to the choice of query token, largest non-max token, and the number of copies of the maximum token. (How the SAE training data is distributed will change what exact features (principal directions of variation) are important to learn.). I'm quite interested in chatting with anyone working on / interested in this, and I expect my MATS scholar will get to testing this within the next month or two.

Edit: I expect this toy model will also permit exploring:

[Lee] Is there structure in feature splitting?
Suppose we have a trained SAE with N features. If we apply e.g. NMF or SAEs to these directions are there directions that explain the structure of the splitting? As in, suppose we have a feature for math and a feature for physics. And suppose these split into (among other things)
'topology in a math context'
'topology in a physics context'
'high dimensions in a math context'
'high dimensions in a physics context'
Is the topology-ifying direction the same for both features? Is the high-dimensionifying direction the same for both features? And if so, why did/didn't the original SAEs find these directions?

I predict that whether or not the SAE finds the splitting directions depends on details about how much non-sparsity is penalized and how wide the SAE is. Given enough capacity, the SAE benefits (sparsity-wise) from replacing the (topology, math, physics) features with (topology-in-math, topology-in-physics), because split features activate more sparsely. Conversely, if the sparsity penalty is strong enough and there is not enough capacity to split, the loss recovered from having a topology feature at all (on top of the math/physics feature) may not outweigh the cost in sparsity.

Transformer Circuit Faithfulness Metrics Are Not Robust

Jason Gross1y20

Resample ablation is not more expensive than mean (they both are just replacing activations with different values). But to answer the question, I think you would - resample ablation biases the model toward some particular corrupt output.

Ah, I guess I was incorrectly imagining a more expensive version of resample ablation where you looked at not just a single corrupted cache, but looking at the result across all corrupted inputs. That is, in the simple toy model where you're computing where $x$ is the values for the circuit you care about and $y$ is the cache of corrupted activations, mean ablation is computing $f (x, E_{y \sim D} y)$ , and we could imagine versions of resample ablation that are computing $f (x, y)$ for some $y$ drawn from $D$ , or we could compute $E_{y \sim D} f (x, y)$ . I would say that both mean ablation and resample ablation as I'm imagining you're describing it are both attempts to cheaply approximate $E_{y \sim D} f (x, y)$ .

Transformer Circuit Faithfulness Metrics Are Not Robust

Jason Gross1y*32

But in other aspects there often isn't a clearly correct methodology. For example, it's unclear whether mean ablations are better than resample ablations for a particular experiment - even though this choice can dramatically change the outcome.

Would you ever really want mean ablation except as a cheaper approximation to resample ablation?

It seems to me that if you ask the question clearly enough, there's a correct kind of ablation. For example, if the question is "how do we reproduce this behavior from scratch", you want zero ablation.

Your table can be reorganized into the kinds of answers you're seeking, namely:

direct effect vs indirect effect corresponds to whether you ablate the complement of the circuit (direct effect) vs restoring the circuit itself (indirect effect, mediated by the rest of the model)
necessity vs sufficiency corresponds to whether you ablate the circuit (direct effect necessary) / restore the complement of the circuit (indirect effect necessary) vs restoring the circuit (indirect effect sufficient) / ablating the complement of the circuit (direct effect sufficient)
typical case vs worst case, and over what data distribution:
- "all tokens vs specific tokens" should be absorbed into the more general category of "what's the reference dataset distribution under consideration" / "what's the null hypothesis over",
- zero ablation answers "reproduce behavior from scratch"
- mean ablation is an approximation to resample ablation which itself is an approximation to computing the expected/typical behavior over some distribution
- pessimal ablation is for dealing with worst-case behaviors
granularity and component are about the scope of the solution language, and can be generalized a bit

Edit: This seems related to Hypothesis Testing the Circuit Hypothesis in LLMs

Transformer Circuit Faithfulness Metrics Are Not Robust

Jason Gross1y10

Do you want your IOI circuit to include the mechanism that decides it needs to output a name? Then use zero ablations. Or do you want to find the circuit that, given the context of outputting a name, completes the IOI task? Then use mean ablations. The ablation determines the task.

Mean ablation over webtext rather than the IOI task set should work just as well as zero ablation, right? "Mean ablation" is underspecified in the absence of a dataset distribution.

An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2

Jason Gross1y10

it's substantially worth if we restrict

Typo: should be "substantially worse"

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments