Announcing ControlConf 2026

Buck

We’re running ControlConf in Berkeley on April 18-19. It's a two-day conference on AI control: the study of reducing risks from misalignment through safeguards that work even when AI models are trying to undermine them.

Since the last ControlConf (Feb 2025), AI agents have gotten way better. We’re approaching the point where control techniques are load-bearing for the safety of real agent deployments. There’s been a lot of progress too. Researchers have run control evaluations in more realistic settings, and AI companies have started building initial control measures—with much more to come.

At ControlConf, the people at the frontier of this work will present on current research problems, promising interventions, and the most important research directions going forward.

Apply here.

I expect talks and discussion on topics like these:

As AIs get more capable, they’re getting better at both monitoring and at generating subtle attacks. On current trajectories, which will they be better at when they’re automating AI R&D?
To what extent will external auditors be able to evaluate whether companies are robust to threats from internally deployed agents?
CoT monitoring: how much should we rely on this? How is it working so far?
Can we use permissions management to mitigate our threats (as discussed here)? How should permissions management for agents in high-security contexts work?
To what extent should we apply control-like risk analysis to misaligned motivations that aren’t scheming?
How far are we away from AIs that we need to treat as misaligned from a control perspective?
What needs to happen on deals with early schemers (or non-scheming misaligned models)?
How hard is it to subtly sabotage ML experiments?
How likely is it that we'll be deploying a known schemer? How much should we focus on such worlds?
How much should we capitalize on derpy schemers’ suboptimalities?
Claude monitors GPT? How tractable is this and what are the blockers?
What domains of research are most important to prevent the sabotage of?
What if future AIs look pretty damn different from modern architectures/scaffolds? Eg if they're insane bureaucracies of agents?
What kind of incrimination would be best for warning-shot-raising/political-will-building?
Maybe the field of AI-empowered security is where all the high-stakes control juice is. Is that the case, and if so, how do we make it go better?

The great thing about AI control as an approach to mitigating misalignment risk is that it makes risk analysis (comparatively) so concrete: for our threat modeling, we can just argue about what AIs will be capable of, rather than needing to argue about inductive biases or whatever. So I’m hoping that this conference is a great place for discussion of misalignment risk that engages seriously with evidence and arguments about the future.

We’re also taking the opportunity to run a one-day workshop on April 17 on AI futurism and threat modeling, aimed at people who want clearer models of catastrophic AI risks and the best strategies for mitigating them. Apply to that here.

[-]Cole Wyeth3mo20

“for our threat modeling, we can just argue about what AIs will be capable of, rather than needing to argue about inductive biases or whatever”

Where else do you think capabilities come from?

[-]Noosphere893mo20

The data, as Zack M Davis argues, and one of the takeaways from the deep learning revolution is that inductive biases mattered a lot less than we thought, and data is much more important for AI behavior than we thought.

[-]Zack_M_Davis3mo40

While I appreciate being cited, I don't think this makes sense in context as a response to Wyeth's remark. ("Inductive biases" and "data" aren't somehow opposing explanations; doing induction on data trivially requires both, and Wyeth himself has written in favor of imitation leanring as an alignment strategy.)

I agree that induction on data does require an inductive bias without resorting to look-up tables, but I do claim that you were arguing that modern AI's behavior was more determined by the data than the architecture relative to what Max H was saying.

82

Announcing ControlConf 2026

82

82

82