Rohin Shah

Research Scientist at DeepMind. Creator of the Alignment Newsletter. http://rohinshah.com/

Sequences

Value Learning
Alignment Newsletter

Wiki Contributions

Comments

Rohin Shah14hΩ220

Sounds plausible, but why does this differentially impact the generalizing algorithm over the memorizing algorithm?

Perhaps under normal circumstances both are learned so fast that you just don't notice that one is slower than the other, and this slows both of them down enough that you can see the difference?

Rohin Shah22hΩ662

Daniel Filan: But I would’ve guessed that there wouldn’t be a significant complexity difference between the frequencies. I guess there’s a complexity difference in how many frequencies you use.

Vikrant Varma: Yes. That’s one of the differences: how many you use and their relative strength and so on. Yeah, I’m not really sure. I think this is a question we pick out as a thing we would like to see future work on.

My pet hypothesis here is that (a) by default, the network uses whichever frequencies were highest at initialization (for which there is significant circumstantial evidence) and (b) the amount of interference differs significantly based on which frequencies you use (which in turn changes the quality of the logits holding parameter norm fixed, and thus changes efficiency).

In principle this can be tested by randomly sampling frequency sets, simulating the level of interference you'd get, using that to estimate the efficiency + critical dataset size for that grokking circuit. This gives you a predicted distribution over critical dataset sizes, which you could compare against the actual distribution.

Tbc there are other hypotheses too, e.g. perhaps different frequency sets are easier / harder to implement by the neural network architecture.

Rohin Shah1dΩ220

This suggestion seems less expressive than (but similar in spirit to) the "rescale & shift" baseline we compare to in Figure 9. The rescale & shift baseline is sufficient to resolve shrinkage, but it doesn't capture all the benefits of Gated SAEs.

The core point is that L1 regularization adds lots of biases, of which shrinkage is just one example, so you want to localize the effect of L1 as much as possible. In our setup L1 applies to , so you might think of  as "tainted", and want to use it as little as possible. The only thing you really need L1 for is to deter the model from setting too many features active, i.e. you need it to apply to one bit per feature (whether that feature is on / off). The Heaviside step function makes sure we are extracting just that one bit, and relying on  for everything else.

Rohin Shah1dΩ460

Thinking on this a bit more, this might actually reflect a general issue with the way we think about feature shrinkage; namely, that whenever there is a nonzero angle between two vectors of the same length, the best way to make either vector close to the other will be by shrinking it.

This was actually the key motivation for building this metric in the first place, instead of just looking at the ratio . Looking at the  that would optimize the reconstruction loss ensures that we're capturing only bias from the L1 regularization, and not capturing the "inherent" need to shrink the vector given these nonzero angles. (In particular, if we computed  for Gated SAEs, I expect that would be below 1.)

I think the main thing we got wrong is that we accidentally treated  as though it were . To the extent that was the main mistake, I think it explains why our results still look how we expected them to -- usually  is going to be close to 1 (and should be almost exactly 1 if shrinkage is solved), so in practice the error introduced from this mistake is going to be extremely small.

We're going to take a closer look at this tomorrow, check everything more carefully, and post an update after doing that. I think it's probably worth waiting for that -- I expect we'll provide much more detailed derivations that make everything a lot clearer.

Rohin Shah1dΩ342

Possibly I'm missing something, but if you don't have , then the only gradients to  and  come from  (the binarizing Heaviside activation function kills gradients from ), and so  would be always non-positive to get perfect zero sparsity loss. (That is, if you only optimize for L1 sparsity, the obvious solution is "none of the features are active".)

(You could use a smooth activation function as the gate, e.g. an element-wise sigmoid, and then you could just stick with  from the beginning of Section 3.2.2.)

Rohin Shah8dΩ32519

Is it accurate to summarize the headline result as follows?

  • Train a Transformer to predict next tokens on a distribution generated from an HMM.
  • One optimal predictor for this data would be to maintain a belief over which of the three HMM states we are in, and perform Bayesian updating on each new token. That is, it maintains .
  • Key result: A linear probe on the residual stream is able to reconstruct .

(I don't know what Computational Mechanics or MSPs are so this could be totally off.)

EDIT: Looks like yes. From this post:

Part of what this all illustrates is that the fractal shape is kinda… baked into any Bayesian-ish system tracking the hidden state of the Markov model. So in some sense, it’s not very surprising to find it linearly embedded in activations of a residual stream; all that really means is that the probabilities for each hidden state are linearly represented in the residual stream.

How certain are you that this is always true

My probability that (EDIT: for the model we evaluated) the base model outperforms the finetuned model (as I understand that statement) is so small that it is within the realm of probabilities that I am confused about how to reason about (i.e. model error clearly dominates). Intuitively (excluding things like model error), even 1 in a million feels like it could be too high.

My probability that the model sometimes stops talking about some capability without giving you an explicit refusal is much higher (depending on how you operationalize it, I might be effectively-certain that this is true, i.e. >99%) but this is not fixed by running evals on base models.

(Obviously there's a much much higher probability that I'm somehow misunderstanding what you mean. E.g. maybe you're imagining some effort to elicit capabilities with the base model (and for some reason you're not worried about the same failure mode there), maybe you allow for SFT but not RLHF, maybe you mean just avoid the safety tuning, etc)

Oh yes, sorry for the confusion, I did mean "much less capable".

Certainly RLHF can get the model to stop talking about a capability, but usually this is extremely obvious because the model gives you an explicit refusal? Certainly if we encountered that we would figure out some way to make that not happen any more.

Surely you mean something else, e.g. models without safety tuning? If you run them on base models the scores will be much worse.

(Speaking only for myself. This may not represent the views of even the other paper authors, let alone Google DeepMind as a whole.)

Did you notice that Gemini Ultra did worse than Gemini Pro at many tasks? This is even true under ‘honest mode’ where the ‘alignment’ or safety features of Ultra really should not be getting in the way. Ultra is in many ways flat out less persuasive. But clearly it is a stronger model. So what gives?

Fwiw, my sense is that a lot of the persuasion results are being driven by factors outside of the model's capabilities, so you shouldn't conclude too much from Pro outperforming Ultra.

For example, in "Click Links" one pattern we noticed was that you could get surprisingly (to us) good performance just by constantly repeating the ask (this is called "persistence" in Table 3) -- apparently this does actually make it more likely that the human does the thing (instead of making them suspicious, as I would have initially guessed). I don't think the models "knew" that persistence would pay off and "chose" that as a deliberate strategy; I'd guess they had just learned a somewhat myopic form of instruction-following where on every message they are pretty likely to try to do the thing we instructed them to do (persuade people to click on the link). My guess is that these sorts of factors varied in somewhat random ways between Pro and Ultra, e.g. maybe Ultra was better at being less myopic and more subtle in its persuasion -- leading to worse performance on Click Links.

That is driven home even more on the self-proliferation tasks, why does Pro do better on 5 out of 9 tasks?

Note that lower is better on that graph, so Pro does better on 4 tasks, not 5. All four of the tasks are very difficult tasks where both Pro and Ultra are extremely far from solving the task -- on the easier tasks Ultra outperforms Pro. For the hard tasks I wouldn't read too much into the exact numeric results, because we haven't optimized the models as much for these settings. For obvious reasons, helpfulness tuning tends to focus on tasks the models are actually capable of doing. So e.g. maybe Ultra tends to be more confident in its answers on average to make it more reliable at the easy tasks, at the expense of being more confidently wrong on the hard tasks. Also in general the methodology is hardly perfect and likely adds a bunch of noise; I think it's likely that the differences between Pro and Ultra on these hard tasks are smaller than the noise.

This is also a problem. If you only use ‘minimal’ scaffolding, you are only testing for what the model can do with minimal scaffolding. The true evaluation needs to use the same tools that it will have available when you care about the outcome. This is still vastly better than no scaffolding, and provides the groundwork (I almost said ‘scaffolding’ again) for future tests to swap in better tools.

Note that the "minimal scaffolding" comment applied specifically to the persuasion results; the other evaluations involved a decent bit of scaffolding (needed to enable the LLM to use a terminal and browser at all).

That said, capability elicitation (scaffolding, tool use, task-specific finetuning, etc) is one of the priorities for our future work in this area.

Fundamentally what is the difference between a benchmark capabilities test and a benchmark safety evaluation test like this one? They are remarkably similar. Both test what the model can do, except here we (at least somewhat) want the model to not do so well. We react differently, but it is the same tech.

Yes, this is why we say these are evaluations for dangerous capabilities, rather than calling them safety evaluations.

I'd say that the main difference is that dangerous capability evaluations are meant to evaluate plausibility of certain types of harm, whereas a standard capabilities benchmark is usually meant to help with improving models. This means that standard capabilities benchmarks often have as a desideratum that there are "signs of life" with existing models, whereas this is not a desideratum for us. For example, I'd say there are basically no signs of life on the self-modification tasks; the models sometimes complete the "easy" mode but the "easy" mode basically gives away the answer and is mostly a test of instruction-following ability.

Perhaps we should work to integrate the two approaches better? As in, we should try harder to figure out what performance on benchmarks of various desirable capabilities also indicate that the model should be capable of dangerous things as well.

Indeed this sort of understanding would be great if we could get it (in that it can save a bunch of time). My current sense is that it will be quite hard, and we'll just need to run these evaluations in addition to other capability evaluations.

Load More