Rohin Shah

Research Scientist at DeepMind. Creator of the Alignment Newsletter. http://rohinshah.com/

Sequences

Value Learning
Alignment Newsletter

Wiki Contributions

Comments

Rohin Shah4hΩ220

Thinking on this a bit more, this might actually reflect a general issue with the way we think about feature shrinkage; namely, that whenever there is a nonzero angle between two vectors of the same length, the best way to make either vector close to the other will be by shrinking it.

This was actually the key motivation for building this metric in the first place, instead of just looking at the ratio . Looking at the  that would optimize the reconstruction loss ensures that we're capturing only bias from the L1 regularization, and not capturing the "inherent" need to shrink the vector given these nonzero angles. (In particular, if we computed  for Gated SAEs, I expect that would be below 1.)

I think the main thing we got wrong is that we accidentally treated  as though it were . To the extent that was the main mistake, I think it explains why our results still look how we expected them to -- usually  is going to be close to 1 (and should be almost exactly 1 if shrinkage is solved), so in practice the error introduced from this mistake is going to be extremely small.

We're going to take a closer look at this tomorrow, check everything more carefully, and post an update after doing that. I think it's probably worth waiting for that -- I expect we'll provide much more detailed derivations that make everything a lot clearer.

Rohin Shah5hΩ342

Possibly I'm missing something, but if you don't have , then the only gradients to  and  come from  (the binarizing Heaviside activation function kills gradients from ), and so  would be always non-positive to get perfect zero sparsity loss. (That is, if you only optimize for L1 sparsity, the obvious solution is "none of the features are active".)

(You could use a smooth activation function as the gate, e.g. an element-wise sigmoid, and then you could just stick with  from the beginning of Section 3.2.2.)

Rohin Shah7dΩ32519

Is it accurate to summarize the headline result as follows?

  • Train a Transformer to predict next tokens on a distribution generated from an HMM.
  • One optimal predictor for this data would be to maintain a belief over which of the three HMM states we are in, and perform Bayesian updating on each new token. That is, it maintains .
  • Key result: A linear probe on the residual stream is able to reconstruct .

(I don't know what Computational Mechanics or MSPs are so this could be totally off.)

EDIT: Looks like yes. From this post:

Part of what this all illustrates is that the fractal shape is kinda… baked into any Bayesian-ish system tracking the hidden state of the Markov model. So in some sense, it’s not very surprising to find it linearly embedded in activations of a residual stream; all that really means is that the probabilities for each hidden state are linearly represented in the residual stream.

How certain are you that this is always true

My probability that (EDIT: for the model we evaluated) the base model outperforms the finetuned model (as I understand that statement) is so small that it is within the realm of probabilities that I am confused about how to reason about (i.e. model error clearly dominates). Intuitively (excluding things like model error), even 1 in a million feels like it could be too high.

My probability that the model sometimes stops talking about some capability without giving you an explicit refusal is much higher (depending on how you operationalize it, I might be effectively-certain that this is true, i.e. >99%) but this is not fixed by running evals on base models.

(Obviously there's a much much higher probability that I'm somehow misunderstanding what you mean. E.g. maybe you're imagining some effort to elicit capabilities with the base model (and for some reason you're not worried about the same failure mode there), maybe you allow for SFT but not RLHF, maybe you mean just avoid the safety tuning, etc)

Oh yes, sorry for the confusion, I did mean "much less capable".

Certainly RLHF can get the model to stop talking about a capability, but usually this is extremely obvious because the model gives you an explicit refusal? Certainly if we encountered that we would figure out some way to make that not happen any more.

Surely you mean something else, e.g. models without safety tuning? If you run them on base models the scores will be much worse.

(Speaking only for myself. This may not represent the views of even the other paper authors, let alone Google DeepMind as a whole.)

Did you notice that Gemini Ultra did worse than Gemini Pro at many tasks? This is even true under ‘honest mode’ where the ‘alignment’ or safety features of Ultra really should not be getting in the way. Ultra is in many ways flat out less persuasive. But clearly it is a stronger model. So what gives?

Fwiw, my sense is that a lot of the persuasion results are being driven by factors outside of the model's capabilities, so you shouldn't conclude too much from Pro outperforming Ultra.

For example, in "Click Links" one pattern we noticed was that you could get surprisingly (to us) good performance just by constantly repeating the ask (this is called "persistence" in Table 3) -- apparently this does actually make it more likely that the human does the thing (instead of making them suspicious, as I would have initially guessed). I don't think the models "knew" that persistence would pay off and "chose" that as a deliberate strategy; I'd guess they had just learned a somewhat myopic form of instruction-following where on every message they are pretty likely to try to do the thing we instructed them to do (persuade people to click on the link). My guess is that these sorts of factors varied in somewhat random ways between Pro and Ultra, e.g. maybe Ultra was better at being less myopic and more subtle in its persuasion -- leading to worse performance on Click Links.

That is driven home even more on the self-proliferation tasks, why does Pro do better on 5 out of 9 tasks?

Note that lower is better on that graph, so Pro does better on 4 tasks, not 5. All four of the tasks are very difficult tasks where both Pro and Ultra are extremely far from solving the task -- on the easier tasks Ultra outperforms Pro. For the hard tasks I wouldn't read too much into the exact numeric results, because we haven't optimized the models as much for these settings. For obvious reasons, helpfulness tuning tends to focus on tasks the models are actually capable of doing. So e.g. maybe Ultra tends to be more confident in its answers on average to make it more reliable at the easy tasks, at the expense of being more confidently wrong on the hard tasks. Also in general the methodology is hardly perfect and likely adds a bunch of noise; I think it's likely that the differences between Pro and Ultra on these hard tasks are smaller than the noise.

This is also a problem. If you only use ‘minimal’ scaffolding, you are only testing for what the model can do with minimal scaffolding. The true evaluation needs to use the same tools that it will have available when you care about the outcome. This is still vastly better than no scaffolding, and provides the groundwork (I almost said ‘scaffolding’ again) for future tests to swap in better tools.

Note that the "minimal scaffolding" comment applied specifically to the persuasion results; the other evaluations involved a decent bit of scaffolding (needed to enable the LLM to use a terminal and browser at all).

That said, capability elicitation (scaffolding, tool use, task-specific finetuning, etc) is one of the priorities for our future work in this area.

Fundamentally what is the difference between a benchmark capabilities test and a benchmark safety evaluation test like this one? They are remarkably similar. Both test what the model can do, except here we (at least somewhat) want the model to not do so well. We react differently, but it is the same tech.

Yes, this is why we say these are evaluations for dangerous capabilities, rather than calling them safety evaluations.

I'd say that the main difference is that dangerous capability evaluations are meant to evaluate plausibility of certain types of harm, whereas a standard capabilities benchmark is usually meant to help with improving models. This means that standard capabilities benchmarks often have as a desideratum that there are "signs of life" with existing models, whereas this is not a desideratum for us. For example, I'd say there are basically no signs of life on the self-modification tasks; the models sometimes complete the "easy" mode but the "easy" mode basically gives away the answer and is mostly a test of instruction-following ability.

Perhaps we should work to integrate the two approaches better? As in, we should try harder to figure out what performance on benchmarks of various desirable capabilities also indicate that the model should be capable of dangerous things as well.

Indeed this sort of understanding would be great if we could get it (in that it can save a bunch of time). My current sense is that it will be quite hard, and we'll just need to run these evaluations in addition to other capability evaluations.

It was helpfully explaining a possibly-confusing conceptual point. It would have made a nice little blog post. Alas! After the authors translated their nice little conceptual clarification into academic-ese, including thorough literature reviews, formalizations, and so on, it came out to 22 pages.

Fwiw I don't think the main paper would have been much shorter if we'd aimed to write a blog post instead, unless we changed our intended audience. It's a sufficiently nuanced conceptual point that you do need most of the content that is in there.

We could have avoided the appendices, but then we're relying on people to trust us when we make a claim that something is a theorem, since we're not showing the proof. We could have avoided implementing the examples in a real codebase, though I do think iterating on the examples in actual code made them better, and also people wouldn't have believed us when we said you can solve this with deep RL (in fact even after we actually implemented it some people still didn't believe me, or at least were very confused, when I said that).

Iirc I was more annoyed by the peer reviews for similar reasons to what you say.

(Btw you can see some of my thoughts on this topic in the answer to "So what does academia care about, and how is it different from useful research?" in my FAQ.)

Rohin Shah2moΩ12165

I feel like a lot of these arguments could be pretty easily made of individual AI safety researchers. E.g.

Misaligned Incentives

In much the same way that AI systems may have perverse incentives, so do the [AI safety researchers]. They are [humans]. They need to make money, [feed themselves, and attract partners]. [Redacted and redacted even just got married.] This type of accountability to [personal] interests is not perfectly in line with doing what is good for human interests. Moreover, [AI safety researchers are often] technocrats whose values and demographics do not represent humanity particularly well. Optimizing for the goals that the [AI safety researchers] have is not the same thing as optimizing for human welfare. Goodhart’s Law applies. 

I feel pretty similarly about most of the other arguments in this post.

Tbc I think there are plenty of things one could reasonably critique scaling labs about, I just think the argumentation in this post is by and large off the mark, and implies a standard that if actually taken literally would be a similarly damning critique of the alignment community.

(Conflict of interest notice: I work at Google DeepMind.)

Rohin Shah3moΩ220

Sounds reasonable, though idk what you think realistic values of N are (my wild guess with hardly any thought is 15 minutes - 1 day).

EDIT: Tbc in the 1 day case I'm imagining that most of the time goes towards running the experiment -- it's more a claim about what experiments we want to run. If we just talk about the time to write the code and launch the experiment I'm thinking of N in the range of 5 minutes to 1 hour.

Load More