Tom Lieberum — LessWrong

Do reasoning models use their scratchpad like we do? Evidence from distilling paraphrases

If paraphrasing was a bijection between semantically-equivalent sentences, then I agree you could learn the structure in the bijection space. Is that what you were pointing at?

Yes exactly. I further claim that non-chunk paraphrasing seems more likely to be closer to the bijection end of the spectrum than chunk paraphrasing, though I agree that it could still be a many-to-one mapping (i.e. standardization).

I am especially concerned about this when training on the final answer.

Maybe my question just boils down to

How do you distinguish between "paraphrasing didn't break the syntactical <> semantic link" and "the model isn't relying on that link / the link doesn't exist in the first place".

Do reasoning models use their scratchpad like we do? Evidence from distilling paraphrases

Tom Lieberum9moΩ230

Great experiments!

Questions:

To clarify, when distilling did you also train the model to predict the final answer or only the CoT? I assume the latter but it wasn't 100% clear from the text.
. What's the variance on the pass@1 performance? (e.g. via bootstrap sampling) You say the performance of "chunked paraphrasing" is worse, but the degradation looks pretty minor and without knowing the variance it's hard to know whether this is a meaningful difference.
"We find that the pass@1 results are worse, especially for small models." Do you mean large models?

Notes:

I'm somewhat confused why you claim that paraphrasing the whole CoT (i.e. non-chunked) breaks the causal link between syntax(t=0) and semantics(t=1) if you then train on the CoT. I'll try to paraphrase (ha!) what's going on in my own words:

Let x be the original CoT and f(x) the paraphrase. Let y be the correct final answer. Then f(x) might still contain the same structure as x, e.g. mutual information between X and f(X) is high. If you now trained a model to predict $y | f(x)$, I would guess it will learn to pick up on any present steganographic structure that is predictive of y.

If you trained a model to only predict the CoT, i.e. f(x, t=1) | f(x|t=0) I would still guess the model will learn to pick up on the structure if it is present and helpful. E.g. if previously the model was always putting an equation after outputting an enumerated list and now it does so after a bullet point list because f converted all enumerations into bullet points.

So this seems to hinge on f choosing incoherent syntax, e.g. what you observe in the chunk-paraphrase setting, but it's not clear to me that this is the behavior we should expect from the non-chunked paraphraser.

JumpReLU SAEs + Early Access to Gemma 2 SAEs

Tom Lieberum1y10

We use 1024, though often article snippets are shorter than that so they are separated by BOS.

Decomposing the QK circuit with Bilinear Sparse Dictionary Learning

Tom Lieberum1y20

Cool work!

Did you run an ablation on the auxiliary losses for and $Q_{r e c o n} \cdot K_{c l e a n}$ , how important was that to stabilize training?

Did you compare to training separate Q and K SAEs via typical reconstruction loss? Would be cool to see a side-by-side comparison, i.e. how large the benefit of this scheme is.

Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla

Tom Lieberum2y20

During parts of the project I had the hunch that some letter specialized heads are more like proto-correct-letter-heads (see paper for details), based on their attention pattern. We never investigated this, and I think it could go either way. The "it becomes cleaner" intuition basically relies on stuff like the grokking work and other work showing representations being refined late during training by.. Thisby et al. I believe (and maybe other work). However some of this would probably require randomising e.g. the labels the model sees during training. See e.g. Cammarata et al. Understanding RL Vision: If you only ever see the second choice be labeled with B you don't have an incentive to distinguish between "look for B" and "look for the second choice". Lastly, even in the limit of infinite training data you still have limited model capacity and so will likely use a distributed representation in some way, but maybe you could at least get human interpretable features even if they are distributed.

We Found An Neuron in GPT-2

Tom Lieberum3y10

Yup! I think that'd be quite interesting. Is there any work on characterizing the embedding space of GPT2?

We Found An Neuron in GPT-2

Tom Lieberum3y50

Nice work, thanks for sharing! I really like the fact that the neurons seem to upweight different versions of the same token (_an, _An, an, An, etc.). It's curious because the semantics of these tokens can be quite different (compared to the though, tho, however neuron).

Have you looked at all into what parts of the model feed into (some of) the cleanly associated neurons? It was probably out of scope for this but just curious.

Tracr: Compiled Transformers as a Laboratory for Interpretability | DeepMind

Tom Lieberum3yΩ110

(The quote refers to the usage of binary attention patterns in general, so I'm not sure why you're quoting it)

I obv agree that if you take the softmax over {0, 1000, 2000}, you will get 0 and 1 entries.

iiuc, the statement in the tracr paper is not that you can't have attention patterns which implement this logical operation, but that you can't have a single head implementing this attention pattern (without exponential blowup)

Tracr: Compiled Transformers as a Laboratory for Interpretability | DeepMind

Tom Lieberum3yΩ11-2

I don't think that's right. Iiuc this is a logical and, so the values would be in {0, 1} (as required, since tracr operates with Boolean attention). For a more extensive discussion of the original problem see appendix C.

[Interim research report] Taking features out of superposition with sparse autoencoders

Tom Lieberum3y20

Meta-q: Are you primarily asking for better assumptions or that they be made more explicit?

I would be most interested in an explanation for the assumption that is grounded in the distribution you are trying to approximate. It's hard to tell which parts of the assumptions are bad without knowing (which properties of) the distribution it's trying to approximate or why you think that the true distribution has property XYZ.

Re MLPs: I agree that we ideally want something general but it looks like your post is evidence that something about the assumptions is wrong and doesn't transfer to MLPs, breaking the method. So we probably want to understand better what about the assumptions don't hold there. If you have a toy model that better represents the true dist then you can confidently iterate on methods via the toy model.

Undertrained autoencoders

I was actually thinking of the LM when writing this but yeah the autoencoder itself might also be a problem. Great to hear you're thinking about that.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments