Rohin Shah

Research Scientist at DeepMind. Creator of the Alignment Newsletter.


Value Learning
Alignment Newsletter

Wiki Contributions


I agree Eliezer's writing often causes people to believe incorrect things and there are many aspects of his discourse that I wish he'd change, including some of the ones you highlight. I just want to push back on the specific critique of "there are no coherence theorems".

(In fact, I made this post because I too previously believed incorrect things along these lines, and those incorrect beliefs were probably downstream of arguments made by Eliezer or MIRI, though it's hard to say exactly what the influences were.)

Rohin Shah2610

"nevertheless, many important and influential people in the AI safety community have mistakenly and repeatedly promoted the idea that there are such theorems."

I responded on the EA Forum version, and my understanding was written up in this comment.

TL;DR: EJT and I both agree that the "mistake" EJT is talking about is that when providing an informal English description of various theorems, the important and influential people did not state all the antecedents of the theorems.

Unlike EJT, I think this is totally fine as a discourse norm, and should not be considered a "mistake". I also think the title "there are no coherence theorems" is hyperbolic and misleading, even though it is true for a specific silly definition of "coherence theorem".

Fwiw I'm also skeptical of how much we can conclude from these evals, though I think they're way above the bar for "worthwhile to report".

Another threat model you could care about (within persuasion) is targeted recruitment for violent ideologies. With that one too it's plausible you'd want a more targeted eval, though I think simplicity, generality, and low cost are also reasonable things to optimize for in evals.

Good point; this makes it clearer that "deployment" means external deployment by default. But level 2 only mentions "internal access of the critical capability," which sounds like it's about misuse — I'm more worried about AI scheming and escaping when the lab uses AIs internally to do AI development.

You're right: our deployment mitigations are targeted at misuse only because our current framework focuses on misuse. As we note in the "Future work" section, we would need to do more work to address risks from misaligned AI. We focused on risks from deliberate misuse initially because they seemed more likely to us to appear first.

E.g., much more of the action is in deciding exactly who to influence and what to influence them to do.

Are you thinking specifically of exfiltration here?

Persuasion can be used for all sorts of things if you are considering both misuse and misalignment, so if you are considering a specific threat model, I expect my response will be "sure, but there are other threat models where the 'who' and 'what' can be done by humans".

Rohin Shah4310

Thanks for the detailed critique – I love that you actually read the document in detail. A few responses on particular points:

The document doesn't specify whether "deployment" includes internal deployment.

Unless otherwise stated, "deployment" to us means external deployment – because this is the way most AI researchers use the term. Deployment mitigations level 2 discusses the need for mitigations on internal deployments. ML R&D will require thinking about internal deployments (and so will many of the other CCLs).

Some people get unilateral access to weights until the top level. This is disappointing. It's been almost a year since Anthropic said it was implementing two-party control, where nobody can unilaterally access the weights.

I don't think Anthropic meant to claim that two-party control would achieve this property. I expect anyone using a cloud compute provider is trusting that the provider will not access the model, not securing it against such unauthorized access. (In principle some cryptographic schemes could allow you to secure model weights even from your cloud compute provider, but I highly doubt people are doing that, since it is very expensive.)

Mostly they discuss developers' access to the weights. This is disappointing. It's important but lots of other stuff is important too.

The emphasis on weights access isn’t meant to imply that other kinds of mitigations don’t matter. We focused on what it would take to increase our protection against exfiltration. A lot of the example measures discussed in the RAND interim report aren’t discussed because we already do them. For example, Google already does the following from RAND Level 3: (a) develop an insider threat program and (b) deploy advanced red-teaming. (That’s not meant to be exhaustive, I don’t personally know the details here.)

No mention of evals during deployment (to account for improvements in scaffolding, prompting, etc.).

Sorry, that's just poor wording on our part -- "every 3 months of fine-tuning progress" was meant to capture that as well. Thanks for pointing this out!

Talking about plans like this is helpful. But with no commitments, DeepMind shouldn't get much credit.

With the FSF, we prefer to try it out for a while and iron out any issues, particularly since the science is in early stages, and best practices will need to evolve as we learn more. But as you say, we are running evals even without official FSF commitments, e.g. the Gemini 1.5 tech report has dangerous capability evaluation results (see Section 9.5.2). 

Given recent updates in AGI safety overall, I'm happy that GDM and Google leadership takes commitments seriously, and thinks carefully about which ones they are and are not willing to make. Including FSF, White House Commitments, etc.


Rohin ShahΩ672

It's interesting to look back at this question 4 years later; I think it's a great example of the difficulty of choosing the right question to forecast in the first place.

I think it is still pretty unlikely that the criterion I outlined is met -- Q2 on my survey still seems like a bottleneck. I doubt that AGI researchers would talk about instrumental convergence in the kind of conversation I outlined. But reading the motivation for the question, it sure seems like a question that reflected the motivation well would have resolved yes by now (probably some time in 2023), given the current state of discourse and the progress in the AI governance space. (Though you could argue that the governance space is still primarily focused on misuse rather than misalignment.)

I did quite deliberately include Q2 in my planned survey -- I think it's important that the people whom governments defer to in crafting policy understand the concerns, rather than simply voicing support. But I failed to notice that it is quite plausible (indeed, the default) for there to be a relatively small number of experts that understand the concerns in enough depth to produce good advice on policy, plus a large base of "voicing support" from other experts who don't have that same deep understanding. This means that it's very plausible that fraction defined in the question never gets anywhere close to 0.5, but nonetheless the AI community "agrees on the risk" to a sufficient degree that governance efforts do end up in a good place.

Rohin ShahΩ11155

Because I don't think this is realistically useful, I don't think this at all reduces my probability that your techniques are fake and your models of interpretability are wrong.

Maybe the groundedness you're talking about comes from the fact that you're doing interp on a domain of practical importance?

??? Come on, there's clearly a difference between "we can find an Arabic feature when we go looking for anything interpretable" vs "we chose from the relatively small set of practically important things and succeeded in doing something interesting in that domain". I definitely agree this isn't yet close to "doing something useful, beyond what well-tuned baselines can do". But this should presumably rule out some hypotheses that current interpretability results are due to an extreme streetlight effect?

(I suppose you could have already been 100% confident that results so far weren't the result of extreme streetlight effect and so you didn't update, but imo that would just make you overconfident in how good current mech interp is.)

(I'm basically saying similar things as Lawrence.)

Rohin ShahΩ220

Sounds plausible, but why does this differentially impact the generalizing algorithm over the memorizing algorithm?

Perhaps under normal circumstances both are learned so fast that you just don't notice that one is slower than the other, and this slows both of them down enough that you can see the difference?

Rohin ShahΩ662

Daniel Filan: But I would’ve guessed that there wouldn’t be a significant complexity difference between the frequencies. I guess there’s a complexity difference in how many frequencies you use.

Vikrant Varma: Yes. That’s one of the differences: how many you use and their relative strength and so on. Yeah, I’m not really sure. I think this is a question we pick out as a thing we would like to see future work on.

My pet hypothesis here is that (a) by default, the network uses whichever frequencies were highest at initialization (for which there is significant circumstantial evidence) and (b) the amount of interference differs significantly based on which frequencies you use (which in turn changes the quality of the logits holding parameter norm fixed, and thus changes efficiency).

In principle this can be tested by randomly sampling frequency sets, simulating the level of interference you'd get, using that to estimate the efficiency + critical dataset size for that grokking circuit. This gives you a predicted distribution over critical dataset sizes, which you could compare against the actual distribution.

Tbc there are other hypotheses too, e.g. perhaps different frequency sets are easier / harder to implement by the neural network architecture.

Load More