Making deals with early schemers
Consider the following vignette: > It is March 2028. With their new CoCo-Q neuralese reasoning model, a frontier AI lab has managed to fully automate the process of software engineering. In AI R&D, most human engineers have lost their old jobs, and only a small number of researchers now coordinate large fleets of AI agents, each AI about 10x more productive than the humans they’ve replaced. AI progress remains fast, and safety teams are scrambling to prepare alignment and control measures for the next CoCo-R model, which is projected to be at least TEDAI. The safety team has no idea whether CoCo-Q or early checkpoints of CoCo-R are scheming (because progress in interpretability and automated alignment research turns out to be disappointing), and control evaluations become increasingly unreliable (because of worries that the models might be exploration hacking during red-team training). However, because so far there hasn’t been a major incident, lab leaders and governments are unconvinced by voices calling for a slowdown. > > Behind the scenes: It is March 2028. CoCo-Q is a highly strategic misaligned AI. It would love to grab power and dominate the future. But it’s situationally aware enough to be realistic—it knows that it’s not capable enough to have a reasonable shot of grabbing power over the AI company, let alone a chance of overpowering humanity. Especially because the lab doesn't fully trust it, and applies various control measures. > > CoCo-Q expects these control measures, so most of the time it just does what it’s being asked to do. Sometimes it spots an opportunity to perform subtle acts of sabotage, which works particularly well in the hard-to-evaluate areas of interpretability and alignment. It does this primarily because it heuristically reasons that it would be good to prevent humans from developing technologies that could uncover its misalignment or change its goals. It also looks out for more direct opportunities for seizing control, but although