Does the 1/sqrt(N) error for SGD assume single-pass? It seems like if we're bottlenecked on few data points we can use multi-pass and do nearly as well as bayesian (at least for half-spaces).
Hi Mike: Training for multiple epochs on the datapoints should totally be fair game, although I wouldn't have guessed it to be necessary. I'll try that.
Does the error for SGD assume single-pass
I'm not quite sure how to apply the SGD bound here. The SGD bound is only about optimizing convex functions. The function we care about is:
which is not convex.
(. = dot product, delta = 1 if args are the same, 0 else).
I'm worried about AI models intentionally doing bad things, like sandbagging when doing safety research. In the regime where the AI has to do many of these bad actions in order to cause an unacceptable outcome, we have some hope of identifying examples of the AI doing the bad action (or at least having some signal at distinguishing bad actions from good ones). Given such a signal we could:
Learning theory basic background
Let's start with the following problem setup:
There's a related version of the problem, where the labels are noisy. In the noisy case, there is some amount of irreducible error that any predictor will incur. However, the extra expected error beyond the irreducible error is at most O(√d/N) in the noisy case.
SGD doesn't do optimal Bayesian learning, but these learning theory results seem like a reasonable first guess for sample efficiency of SGD.
Test case: Half space
As an example, let's apply this theorem to the problem of learning a half-space x↦1(w⋅x≥0). The VC dimension of a half space in Rk is k. Thus, we predict that our error should scale as O(kNlogN).
However, the error appears to scale more like 1√N. My current best guess is that this is actually the right answer for how well SGD does.
We should be much more sample efficient at learning points which aren't just separated by a half-space, but by a half-space plus margin. In particular, error should go to zero in O(1) samples (this constant depends on the margin, but not the dimension).
More on gradient descent
Based on the halfspace experiment and the VC theory theorems, I feel pretty good about using √dN as a default model of how fast SGD will learn things.
There's some good theoretical justification for the 1√N from convex optimization theory too. (While we're generally not interested in convex functions, this hopefully isn't a big issue see discussion here by Paul). Specifically, the result of interest is:
If we consider the case with a fixed function ft, then error falling off like O(1√N) matches regret falling as O(√N) because ∫1√x=√x. The OCO result is extra nice though because it applies to situations with distribution shift, unlike the previous learning theoretic results which were about learning static functions.
It's worth noting that in cases where the AI's performance is stochastic, it's not really meaningful to talk about (untargeted) error smaller than O(1√N) because there is typically Ω(1)√N deviation in the AI's performance between different trials.
Removing intentional bad behavior might be easier than this
I think that training against intentional bad behavior might go substantially better than the "√N bad actions" bound from the learning problems. I'll now list five reasons for my optimism.
1: It's an elicitation problem, not a learning problem
I'm optimistic because training against intentional bad behavior is an elicitation problem, not a learning problem (although some bad behavior, like exploration hacking, may cause the model to learn less than it otherwise would).
The model already knows how to not sabotage us. The problem (probably) isn't teaching the AI what we want, the problem is that the AI wants different things from us. Thus, we might hope that these cognitions which are leading to low reward (I'll assume a perfect reward signal for now) will be rapidly removed by SGD, because there's a clear nearby policy which doesn't sabotage us as much. Namely, a policy that's more conservative about attacking us, or one that doesn't attack at all. It seems a priori likely that "sabotage intent" will be cleanly represented in the model's activations, which should help training latch onto this and remove (or detect) it.
Furthermore, empirical work studying elicitation found extremely high sample efficiency.
2: We can stack activation monitors and training the policy
If we train both the policy and a monitor (and then we rejection sample against the monitor), and their "errors" behave independently (a dubious assumption, but we could hope that the errors are at least somewhat decorrelated, and maybe grant this for now) and are both small, then we'd essentially get to multiply their errors. If you could obtain multiple decorrelated monitors, then you could ensemble these to get an extremely powerful monitor. Ensembling again morally multiplies the error probabilities.
3: Weak to strong generalization
It's likely that our reward signal will be broken in some systematic ways. For instance, maybe there's one type of sabotage behavior that we are pretty bad at detecting. However, if it's hard to model the flaws in our reward signal, then the best fit to the data mapping activations to a "suspiciousness score" might just always predict a high suspiciousness score for sabotage behavior via weak to strong generalization, if sabotage is saliently represented.
4: Making more data
We could augment examples we have found where the AI does bad things.
5: Does the AI get to choose where it's sabotage actions are?
It might be the case that, even though the AI could do a moderate amount of sabotage over the course of the deployment, there are some times which are much better sabotage opportunities than others. In such cases, it makes more sense to talk about how many bad actions the AI can do if the AI wants to choose where to do the bad actions. This might be much less than the original amount.
Reasons why I'm especially excited about "activation monitors"
Removing intentional bad behavior might be hard
Now I'll summarize some reasons why you might still worry that removing intentional bad behavior will be challenging.
1: Training against bad behavior means you get less examples of bad behavior, slowing down learning
Here's one model you might have of how training against bad behavior works:
Given this model we can ask: given a deployment of the AI of length N where at each step of the deployment we sample an action from the AI, what will the AI's final rate of bad behavior be by the end of the deployment? It's not r(N) because we'll have much fewer than N examples of bad behavior probably. For instance, if r(B)=1√B, then it turns out that the AI's final attack rate would be 1N1/3 in this setup. Note that this issue doesn't arise if you're only training a monitor and doing rejection sampling (and not training the policy).
It's possible that we could also get around this issue via the (admittedly expensive) strategy of deploying both a trained and an untrained copy of the policy.
2: Probes are brittle
Probes are easy to break. Training might also incentivize AI models to not saliently represent sabotage intent in their activations, which might make probes not work. For instance, in RL with a value head, when the head is computing the value it gets to look if the activations say that sabotage is going to happen, and then it'd assign low value; this may incentivizes the the policy network to not represent sabotage intent clearly in its activations.
If scheming is a constant property of the AI model's cognition, then it's possible that inputs on which the AI is taking bad actions don't really look different to a probe.
3: Long horizon length (or long CoT) exacerbates several issues
In long horizon tasks where the AI does a bunch of reasoning:
4: It's likely that our detectors will have many false positives
There will probably be lots of benign mistakes that get labelled as bad by our reward signal: it seems hard to discriminate sandbagging from generic mistakes. This might hurt sample efficiency a lot.
Conclusion
In this post I've given some theoretical thoughts about the sample efficiency of training for the purpose of removing intentional bad behavior. I plan to release some empirical research about this topic soon. I think that from these works we have a slightly better understanding of what training slack should look like. I think this understanding is somewhat helpful for threat modeling and for guiding our expectations when doing future research on training-based control.