Hi Cleo, nice post! I agree with you that mixed deployments are a good tool to have in our toolbox, but I disagree somewhat with your model of why mixed deployments are good.
My thoughts on whether danger is convex/concave in the fraction of your deployment which is scheming:
Suppose that we have the following two options:
Which should we prefer? Let’s consider the threat model of weight exfiltration. Suppose that in the pure deployment world the misaligned AI’s chance of success is 30% if we do vanilla untrusted monitoring. Then, the overall risk in the first situation is 15%, because there was a 50% chance that the AI was aligned. So, we’d prefer the mixed deployment described in situation 2 if and only if the mixed deployment has risk below 15%. There are two reasons why the mixed deployment should have lower risk than a pure misaligned deployment:
I suspect that (1) is a very small effect (I talked to Adam and Cody (some authors on Ctrl-Z), and they thought that doubling deployment length while keeping auditing budget a constant fraction of the deployment doesn’t really change the safety and usefulness much). I think that (2) is a larger effect, but am pretty unsure if it’s enough to halve the risk compared to a pure misaligned deployment (I think we should investigate this empirically).
Why I’m excited about mixed deployments despite uncertainty about the above question:
The situation in practice looks a lot better than a forced choice between a pure deployment that might be aligned / misaligned vs a mixed aligned / misaligned deployment. Here are some reasons why the actual situation is better:
Implications for what research should be done
It might be valuable for someone to do a research project investigating how good mixed deployments are, compared to pure aligned and pure misaligned deployments. Specifically, here’s what I’m imagining:
I think the blue team might be able to do extremely well with mixed deployments, e.g., much better than halving risk compared to a pure misaligned deployment.
Nice work!
A few thoughts:
Having conversations is often much more productive for resolving confusions than just listening to other people's conversations. It's pretty clear why this would be the case: if I'm confused about thing X, then this may trickle down to small confusions about lots of things Y1, Y2, Y3. Then, it's fairly likely that at least one of Y1,Y2,Y3 will come up in conversation, and if me and the conversation partner iteratively find cruxes, we'll quickly identify X and then the conversation partner can give me the arguments for X, at which point I may change my mind.
I would learn a lot more algebraic topology by talking to an algebraic topologist and asking them to define all their terms rather than sitting and "absorbing" a lecture. Ofc "listen to content that you can basically understand with a little bit of new stuff thrown in" is going to be more effective than this, but this still feels like a super silly way to learn algebraic topology.
"Speaking so that you sound like a native" is not the only goal of language learning (I'd argue that it's a relatively minor one). For instance, if I was to learn a new language my goal would be "be able to kind of basically communicate" rather than to be a native speaker. I'd guess that practicing conversation is much more efficient for picking this skill up, but would be interested if you have evidence to the contrary.
I kind of disagree with the suggestions in the "culture learning" section. I don't think it's good for epistemics for people to try to adopt the opinions of people on LW, or to think of this as a goal. If you're only willing to talk to people that share your opinions, then that is an echo-chamber. However, I agree that it's more efficient from my perspective for people to have first read some basic content about why people are worried about AI before talking to me about it (although I suspect that this is much less efficient from their perspective because I can adaptively choose what to talk about based on their demonstrated confusions).
Cross lab monitoring seems like a great idea. As an implementation detail, I think labs should do cross-lab monitoring with AI models that have instructions to not aid the other lab with AI R&D, synthetic data generation, or anything besides monitoring. Of course, if you thought that Gemini was more likely to be aligned than Claude, then you might prefer that Anthropic use Gemini for automating AI R&D, but this is almost certainly infeasible. One other complication worth noting with cross lab monitoring is that the AI models from different labs will probably have different intelligence levels, and different skill profiles. Thus, cross-lab monitoring may be giving us a model which is in between trusted monitoring and untrusted monitoring both in terms of intelligence and “trustedness”. This means might make untrusted monitoring preferable to cross-lab monitoring in some cases (e.g., if Claude is much smarter than all other models), but cross-lab monitoring still seems useful as a tool to add to our monitoring system.