If I build a chatbot, and I can't jailbreak it, how do I determine whether that's because the chatbot is secure or because I'm bad at jailbreaking? How should AI scientists overcome Schneier's Law of LLMs?
FWIW, I think there aren't currently good benchmarks for alignment and the ones you list aren't very relevant.
In particular, MMLU and Swag both are just capability benchmarks where alignment training is very unlikely to improve performance. (Alignment-ish training could theoretically could improve performance by making the model 'actually try', but what people currently call alignment training doesn't improve performance for existing models.)
The MACHIAVELLI benchmark is aiming to test something much more narrow than 'how unethical is an LLM?'. (I also don't understand the point of this benchmark after spending a bit of time reading the paper, but I'm confident it isn't trying to do this.)
TruthfulQA is perhaps the closest to an alignment benchmark, but it's still covering a very particular difficulty. And it certainly isn't highlighting jailbreaks.
Does This Make Any Sense?
I'm confused - it looks like the first paragraph of this section is taken from a prior post on attribution patching.
When I try to interpret your points here, I come to the conclusion that you think humans, upon reflection, would cause human extinction (in favor of resources being used for something else).
Or at least that many/most humans would, upon reflection, prefer resources to be used for purposes other than preserving human life (including not preserving human life in simulation). And this holds even if (some of) the existing humans 'want' to be preserved (at least according to a conventional notion of preferences).
I think this empirical view seems pretty implausible.
That said, I think it's quite plausible that upon reflection, I'd want to 'wink out' any existing copies of myself in favor of using resources better things. But this is partially because I personally (in my current state) would endorse such a thing: if my extrapolated volition thought it would be better to not exist (in favor of other resource usage), my current self would accept that. And, I think it currently seems unlikely that upon reflection, I'd want to end all human lives (in particular, I think I probably would want to keep humans alive who had preferences against non-existence). This applies regardless of trade; it's important to note this to avoid a 'perpetual motion machine' type argument.
Beyond this, I think that most or many humans or aliens would, upon reflection, want to preserve currently existing humans or aliens who had a preference against non-existence. (Again, regardless of trade.)
Additionally, I think it's quite plausible that most or many humans or aliens will enact various trades or precommitments prior to reflecting (which is probably ill-advised, but it will happen regardless). So current preferences which aren't stable under reflection might have a significant influence overall.
I would be more sympathetic if you made a move like, "I'll accept continuity through the human range of intelligence, and that we'll only have to align systems as collectively powerful as humans, but I still think that hands-on experience is only..." In particular, I think there is a real disagreement about the relative value of experimenting on future dangerous systems instead of working on theory or trying to carefully construct analogous situations today by thinking in detail about alignment difficulties in the future.
Here are some views, often held in a cluster:
I'm not sure exactly which clusters you're referring to, but I'll just assume that you're pointing to something like "people who aren't very into the sharp left turn and think that iterative, carefully bootstrapped alignment is a plausible strategy." If this isn't what you were trying to highlight, I apologize. The rest of this comment might not be very relevant in that case.
To me, the views you listed here feel like a straw man or weak man of this perspective.
Furthermore, I think the actual crux is more often "prior to having to align systems that are collectively much more powerful than humans, we'll only have to align systems that are somewhat more powerful than humans." This is essentially the crux you highlight in A Case for the Least Forgiving Take On Alignment. I believe disagreements about hands-on experience are quite downstream of this crux: I don't think people with reasonable views (not weak men) believe that "without prior access to powerful AIs, humans will need to align AIs that are vastly, vastly superhuman, but this will be fine because these AIs will need lots of slow, hands-on experience in the world to do powerful stuff (like nanotech)."
So, discussing how well superintelligent AIs can operate from first principles seems mostly irrelevant to this discussion (if by superintelligent AI, you mean something much, much smarter than the human range).
Pitting two models against each other in a zero-sum competition only works so long as both models actually learn the desired goals. Otherwise, they may be able to reach a compromise with each other and cooperate towards a non-zero-sum objective.
If training works well, then they can't collude on average during training, only rarely or in some sustained burst prior to training crushing these failures.
In particular, in the purely supervised case with gradient descent, performing poorly on average in durining training requires gradient hacking (or more benign failures of gradient descent, but it's unclear why the goals of the AIs would be particularly relevant in this case).
In the RL case, it requires exploration hacking (or benign failures as in the gradient case).
The only way to prevent this is for the decision maker to assign some probability to all possible actions, regardless of how bad the predicted outcome is. This necessarily means bad outcomes will occur more frequently than they would if they could make deterministic decisions based on honest conditional predictions. We might reasonably say we don’t want to ever randomly take an action that leads to the extinction of humanity with high probability, but if this is true then a predictor can lie about that to dissuade us from any action. Even if we would be willing to take such an action with very small probability in order to get honest conditional predictions, we likely cannot commit to following through on such an action if our randomizer lands on it.
Thinking about this in terms of precommitment seems to me like it's presupposing that the AI perfectly optimizes the training objective in some deep sense (which seems implausible to me). The reason why this exploration procedure works is presumably that you end up selecting such actions frequently during training which in turn selects for AIs which perform well. Epsilon exploration only works if you sample the epsilon. So, it doesn't work if you set the epsilon to 1e-40 or something.
We can't be confident enough that it won't happen to safely rely on that assumption.
I'm not sure what motivation for worst-case reasoning you're thinking about here. Maybe just that there are many disjunctive ways things can go wrong other than bad capability evals and the AI will optimize against us?
Overall, I think I disagree.
This will depend on the exact bar for safety. I think this sort of scenario feels like 0.1% to 3% likely to me which is immensely catastrophic overall, but there is lower hanging fruit for danger avoidance elsewhere.
(And for this exact failure mode, I think that improvements upon various relatively straight forward capability evals are likely to quite compelling as the most leveraged current interventions, but I'm not confident.)
[Sorry for late reply]
Analogously, conditional on things like gradient hacking being an issue at all, I'd expect the "hacker" to treat potential-training-objective-improvement as a scarce resource, which it generally avoids "spending" unless the expenditure will strengthen its own structure. Concretely, this probably looks like mostly keeping itself decoupled from the network's output, except when it can predict the next backprop update and is trying to leverage that update for something.
So it's not a question of performing badly on the training metric so much as a question of not performing as well as it could.
Perhaps I'm just misunderstanding things, but the argument here is that gradient hacking to intentionally perform poorly is easier than gradient hacking to do things like steer the overall training process in some more complex direction.
I agree on this being easier, but this still seems very difficult. It seems like you're willing to defend a further point of 'reasonably likely'.
I don't expect that to actually be too hard; the subsystem can spend some prediction-power in order to set up the structure needed to decouple the rest of its prediction-power from the output most of the time. (Just like a bacteria expends some negentropy to set up a structure which mostly decouples the rest of its negentropy from the local entropy gradient, so that it does not immediately burst into flame.)
I understand this as meaning "the AI/hacker/subsystem can spend a small/modest 'gradient tax' to permanantly and consistently perform poorly on a given task without paying further tax or risking being removed" and "this is pretty plausible".
This still sounds very difficult to me: The AI has to predict how to setup this barrier in a way which is robust to noise from sources like variable minibatches (or even just adding random noise to the weights which is a thing we could do as a countermeasure). I think there are some other difficulty considerations.
I'd like to register that I disagree with the claim that standard online RLHF requires adversarial robustness in AIs persay. (I agree that it requires that humans are adversarially robust to the AI, but this is a pretty different problem.)
In particular, the place where adversarial robustness shows up is in sample efficiency. So, poor adversarial robustness is equivalent to poor sample efficiency. My understanding is that the trend is toward higher not lower sample efficiency with scale, so this seems on track.
This same reasoning also applies to recursive oversight schemes or debate.
Overall, my guess is that commercial incentives and normal scaling will cause the sample efficiencies of these processes to be quite high in practice. Additionally, my guess would be that improving adversarial robustness isn't the best way to push on sample efficiency.
That said, I do think there are many important reasons to improve adversarial robustness.