Research Scientist at DeepMind. Creator of the Alignment Newsletter. http://rohinshah.com/
Google DeepMind does lots of work on safety practice, mostly by other teams. For example, Gemini Safety (mentioned briefly in the post) does a lot of automated red teaming. The AGI Safety & Alignment team has also contributed to safety practice work. GDM usually doesn't publish about that work, mainly because the work here is primarily about doing all the operational work necessary to translate existing research techniques into practice, which doesn't really lend itself to paper publications.
I disagree that the AGI safety team should have 4 as its "bread and butter". The majority of work needed to do safety in practice has little relevance to the typical problems tackled by AGI safety, especially misalignment. There certainly is some overlap, but in practice I would guess that a focus solely on 4 would cause around an order of magnitude slowdown in research progress. I do think it is worth doing to some extent from an AGI safety perspective, because of (1) the empirical feedback loops it provides, which can identify problems you would not have thought of otherwise, and (2) at some point we will have to put our research into practice, and it's good to get some experience with that. But at least while models are still not that capable, I would not want it to be the main thing we do.
A couple of more minor points:
It clearly can't be having a large effect, since the accuracies aren't near-100% for any of the methods. I agree leakage would have some effect. The mechanism you suggest is plausible, but it can't be the primary cause of the finding that debate doesn't have an advantage -- since accuracies aren't near-100% we know there are some cases the model hasn't memorized, so the mechanism you suggest doesn't apply to those inputs.
More generally, all sorts of things have systematic undesired effects on our results, aka biases. E.g. I suspect the prompts are a bigger deal. Basically any empirical paper will be subject to the critique that aspects of the setup introduce biases.
I don't know for sure, but I doubt we checked that in any depth. It would be quite hard to do, and doesn't seem that important for our purposes, since we're comparing different post-training algorithms (so pretraining data leakage would affect all of them, hopefully to similar extents).
Oh I see. The main reason we're training weak LLMs as judges right now is because it lets us iterate faster on our research (relative to using human judges). But we're imagining having human judges when aligning a model in practice.
(To be clear, I could imagine that we use LLMs as judges even when aligning a model in practice, but we would want to see significantly more validation of the LLM judges first.)
The goal with debate is to scale to situations where the debaters are much more capable than the judge, see AI safety via debate for discussion of why this seems plausible.
I'm not going to repeat all of the literature on debate here, but as brief pointers:
(Also, the "arbitrary amounts of time and arbitrary amounts of explanation" was pretty central to my claim; human disagreements are way more bounded than that.)
I do, but more importantly, I want to disallow the judge understanding all the concepts here.
I think I don't actually care about being robust to this assumption. Generally I think of arbitrarily-scalable-debate as depending on a universality assumption (which in turn would rule out "the judge can never understand the concepts"). But even if the universality assumption is false, it wouldn't bother me much; I don't expect such a huge gap between debaters and judges that the judge simply can't understand the debaters' concepts, even given arbitrary amounts of time and arbitrary amounts of explanation from the debaters. (Importantly, I would want to bootstrap alignment, to keep the gaps between debaters and the judge relatively small.)
"The honest strategy"? If you have that, you can just ask it and not bother with the debate. If the problem is distinguishing it, and only dishonest actors are changing their answers based on the provided situation, you can just use that info. But why are you assuming you have an "honest strategy" available here?
The general structure of a debate theorem is: if you set up the game in such-and-such way, then a strategy that simply answers honestly will dominate any other strategy.
So in this particular case I am saying: if you penalize debaters that are inconsistent under cross-examination, you are giving an advantage to any debater that implements an honest strategy, and so you should expect training to incentivize honesty.
Making that kind of abstract conclusion from a practical number of experiments requires abstractions like potential energy, entropy, Noether's theorem, etc - which in this example, the judge doesn't understand. (Without such abstractions, you'd need to consider every possible type of machine separately, which isn't feasible.)
I agree, but I don't see why that matters. As I mentioned, a main point of debate is to produce good oversight of claims without giving the judge an understanding of those claims. In this example I would imagine that you decompose the argument as:
The inventor can disagree with one or more of these claims, then we sample one of the disagreements, and continue debating that one alone, ignoring all the others. This doesn't mean the judge understands the other claims, just that the judge isn't addressing them when deciding who wins the overall debate.
If we recurse on #1, which I expect you think is the hardest one, then you could have a decomposition like "the principle has been tested many times", "in the tests, confirming evidence outweighs the disconfirming evidence", "there is an overwhelming scientific consensus behind it", "there is significant a priori theoretical support" (assuming that's true), "given the above the reasonable conclusion is to have very high confidence in conservation of energy". Again, find disagreements, sample one, recurse. It seems quite plausible to me that you get down to something fairly concrete relatively quickly.
If you want to disallow appeals to authority, on the basis that the correct analogy is to superhuman AIs that know tons of stuff that aren't accepted by any authorities the judge trusts, I still think it's probably doable with a larger debate, but it's harder for me to play out what the debate would look like because I don't know in enough concrete detail the specific reasons why we believe conservation of energy to be true. I might also disagree that we should be thinking about such big gaps between AI and the judge, but that's not central.
The debaters are the same AI with different contexts, so the same is true of both debaters. Am I missing something here?
That seems right, but why is it a problem?
The honest strategy is fine under cross-examination, it will give consistent answers across contexts. Only the dishonest strategy will change its answers (sometimes saying the perpetual energy machines are impossible sometimes saying that they are possible).
There are several different outs to this example:
The process proposed in the paper
Which paper are you referring to? If you mean doubly efficient debate, then I believe the way doubly efficient debate would be applied here is to argue about what the boss would conclude if he thought about it for a long time.
Say that at dataset size D0, the distance between points is x0. Now consider a new distance x -- what is the corresponding D we need?
Intuitively, for each factor of 2 that x is smaller than x0 (which we can quantify as log2(x0x)), we need to multiply by another factor of 2d.
So D=D0×(2d)log2(x0x)=D0×(2log2(x0x))d=D0×(x0x)d
(DD0)1d=x0x
x=x0(DD0)−1d
That is, the distance scales as D−1d.