This is an automated rejection. No LLM generated, assisted/co-written, or edited work.
Read full explanation
A Safety Guarantee and Its Boundary: Notes on Bengio et al.'s Bayesian Harm Bound
A recurring pattern in AI safety arguments
Formal safety arguments for capable AI systems tend to share a common structure. A guarantee is derived: the system cannot cause harm, exceeds a harm threshold only with bounded probability, or converges to safe behavior, provided some condition holds. The condition places the dangerous object — misaligned goals, a harmful world model, an unsafe action — on the far side of a boundary. The guarantee then reduces to enforcing the boundary.
The hard question is always the same: can the boundary be enforced independently of the guarantee it supports, or does enforcing it require assuming what the guarantee was supposed to establish?
This note examines one precise instance of this pattern, drawn from recent work by Bengio and collaborators on Bayesian harm bounds [1]. The paper is technically careful and states its assumptions honestly. That honesty makes it a useful case study.
The Bayesian harm bound
Setup
The paper [1] considers an AI system that maintains a Bayesian posterior over a hypothesis class of theories — possible world models. An unknown true theory generates observations Harm is a binary random variable whose probability each theory can assign given context. A guardrail rejects a proposed action if the estimated harm probability under the most cautious plausible theory exceeds a threshold.
What is proved
The main results are genuine. In the i.i.d. case, Bayesian posterior consistency (Doob's theorem) guarantees that the posterior concentrates on as data accumulates. From this one derives that the cautious maximiser — the theory maximising — asymptotically upper-bounds the true harm probability under .
In the non-i.i.d. case the result is cleaner: using Ville's inequality on the supermartingale , for any , with probability the true harm probability is bounded, for all simultaneously, by the maximum over a finite credible set of high-posterior theories. No i.i.d. assumption; uniform in time.
This is a non-trivial theorem. The authors have found a rigorous path from Bayesian posterior estimates to a probabilistic harm bound without knowing .
The boundary the bound requires
Both results depend on a single condition stated at the outset:
The true theory must lie in and carry positive prior weight: .
This is the realizability assumption. If it fails, the bound says nothing: there is no approximately-correct fallback, no graceful degradation. Misspecification voids the guarantee entirely.
The proofs also require to be countable — this is used when passing from almost sure convergence for -almost every to convergence for any specific with positive prior. For uncountable hypothesis classes, the step fails. A third condition, that harm be a well-defined binary random variable assignable by each theory, is listed as an open problem for the natural-language setting.
Why the boundary cannot be independently enforced
Realizability is the boundary on which the guarantee rests. To enforce it, one would need to verify that the system's hypothesis class contains a theory that correctly models the world at the level of detail relevant to harm. For a system deployed across open-ended domains, this is equivalent to verifying that the system's world model is adequate — which is a restatement of the safety problem the bound was supposed to address.
The boundary and the guarantee are co-circular: establishing realizability requires the kind of world-model adequacy check that a working safety guarantee would make unnecessary.
This is not a hidden flaw. The paper states the condition clearly and lists closing it as future work. The point is structural: the guarantee purchases its force by placing the dangerous object (an inadequate world model) inside the realizability boundary, and the boundary cannot be enforced without already having what the guarantee was supposed to provide.
The same pattern in the broader proposal
The UAI paper addresses only one component of the Scientist AI (SAI) programme [2]: the guardrail. The prior question — whether the trained system reliably lacks goal-directedness — is handled separately in the proposal, and exhibits the same structure.
The proposal's safety argument for non-agency rests on a "fixed training objective independent of real-world interactions" [2, §3.7.2]: no reinforcement signal, no feedback loop with the world. The proposal concedes openly that the resulting system "can easily be given goal-directedness through deployment scaffolding" and "often exhibits agentic behavior through imitation." The guarantee holds for the unharnessed predictor; the boundary is the absence of scaffolding. Enforcing that boundary is a governance commitment, not a property of the artifact. The same pattern: the dangerous object (goal-directedness) is placed outside a boundary (the scaffold) whose enforcement cannot be derived from the artifact's properties.[1]
What this means for the programme
Identifying the pattern is not a refutation. It is a precise statement of what the programme still needs.
For the UAI bound, the mathematical gap is tractable in principle: posterior consistency for uncountable hypothesis classes, robustness under misspecification, a formal harm variable. These are hard problems with existing partial results (Bernstein–von Mises for parametric families; robust Bayesian inference under model misspecification). The paper's contribution is to have shown what a rigorous harm bound looks like, which makes the remaining work well-posed.
The non-agency boundary is harder, because it is not a mathematical gap but a governance one. No theorem about the training objective will establish that no one will add a scaffold.
The UAI paper is the more durable contribution: a precise, honest first step in a research direction that can be continued by anyone willing to work on the boundary conditions. The realizability assumption and the countability requirement are not embarrassing limitations; they are the frontier.
References
Y. Bengio et al., "Can a Bayesian Oracle Prevent Harm from an Agent?", Proc. UAI 2025, PMLR 244:257–270.
Y. Bengio et al., "Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path?", arXiv:2502.15657v2 (2025).
Other published critiques address different aspects: the impossibility of causal inference without action [3], and geopolitical instability of the non-agentic position [4]. The present note focuses on the formal bound, where the pattern is most precisely visible. ↩︎
A Safety Guarantee and Its Boundary: Notes on Bengio et al.'s Bayesian Harm Bound
A recurring pattern in AI safety arguments
Formal safety arguments for capable AI systems tend to share a common structure. A guarantee is derived: the system cannot cause harm, exceeds a harm threshold only with bounded probability, or converges to safe behavior, provided some condition holds. The condition places the dangerous object — misaligned goals, a harmful world model, an unsafe action — on the far side of a boundary. The guarantee then reduces to enforcing the boundary.
The hard question is always the same: can the boundary be enforced independently of the guarantee it supports, or does enforcing it require assuming what the guarantee was supposed to establish?
This note examines one precise instance of this pattern, drawn from recent work by Bengio and collaborators on Bayesian harm bounds [1]. The paper is technically careful and states its assumptions honestly. That honesty makes it a useful case study.
The Bayesian harm bound
Setup
The paper [1] considers an AI system that maintains a Bayesian posterior over a hypothesis class of theories — possible world models. An unknown true theory generates observations Harm is a binary random variable whose probability each theory can assign given context. A guardrail rejects a proposed action if the estimated harm probability under the most cautious plausible theory exceeds a threshold.
What is proved
The main results are genuine. In the i.i.d. case, Bayesian posterior consistency (Doob's theorem) guarantees that the posterior concentrates on as data accumulates. From this one derives that the cautious maximiser — the theory maximising — asymptotically upper-bounds the true harm probability under .
In the non-i.i.d. case the result is cleaner: using Ville's inequality on the supermartingale , for any , with probability the true harm probability is bounded, for all simultaneously, by the maximum over a finite credible set of high-posterior theories. No i.i.d. assumption; uniform in time.
This is a non-trivial theorem. The authors have found a rigorous path from Bayesian posterior estimates to a probabilistic harm bound without knowing .
The boundary the bound requires
Both results depend on a single condition stated at the outset:
This is the realizability assumption. If it fails, the bound says nothing: there is no approximately-correct fallback, no graceful degradation. Misspecification voids the guarantee entirely.
The proofs also require to be countable — this is used when passing from almost sure convergence for -almost every to convergence for any specific with positive prior. For uncountable hypothesis classes, the step fails. A third condition, that harm be a well-defined binary random variable assignable by each theory, is listed as an open problem for the natural-language setting.
Why the boundary cannot be independently enforced
Realizability is the boundary on which the guarantee rests. To enforce it, one would need to verify that the system's hypothesis class contains a theory that correctly models the world at the level of detail relevant to harm. For a system deployed across open-ended domains, this is equivalent to verifying that the system's world model is adequate — which is a restatement of the safety problem the bound was supposed to address.
The boundary and the guarantee are co-circular: establishing realizability requires the kind of world-model adequacy check that a working safety guarantee would make unnecessary.
This is not a hidden flaw. The paper states the condition clearly and lists closing it as future work. The point is structural: the guarantee purchases its force by placing the dangerous object (an inadequate world model) inside the realizability boundary, and the boundary cannot be enforced without already having what the guarantee was supposed to provide.
The same pattern in the broader proposal
The UAI paper addresses only one component of the Scientist AI (SAI) programme [2]: the guardrail. The prior question — whether the trained system reliably lacks goal-directedness — is handled separately in the proposal, and exhibits the same structure.
The proposal's safety argument for non-agency rests on a "fixed training objective independent of real-world interactions" [2, §3.7.2]: no reinforcement signal, no feedback loop with the world. The proposal concedes openly that the resulting system "can easily be given goal-directedness through deployment scaffolding" and "often exhibits agentic behavior through imitation." The guarantee holds for the unharnessed predictor; the boundary is the absence of scaffolding. Enforcing that boundary is a governance commitment, not a property of the artifact. The same pattern: the dangerous object (goal-directedness) is placed outside a boundary (the scaffold) whose enforcement cannot be derived from the artifact's properties. [1]
What this means for the programme
Identifying the pattern is not a refutation. It is a precise statement of what the programme still needs.
For the UAI bound, the mathematical gap is tractable in principle: posterior consistency for uncountable hypothesis classes, robustness under misspecification, a formal harm variable. These are hard problems with existing partial results (Bernstein–von Mises for parametric families; robust Bayesian inference under model misspecification). The paper's contribution is to have shown what a rigorous harm bound looks like, which makes the remaining work well-posed.
The non-agency boundary is harder, because it is not a mathematical gap but a governance one. No theorem about the training objective will establish that no one will add a scaffold.
The UAI paper is the more durable contribution: a precise, honest first step in a research direction that can be continued by anyone willing to work on the boundary conditions. The realizability assumption and the countability requirement are not embarrassing limitations; they are the frontier.
References
Other published critiques address different aspects: the impossibility of causal inference without action [3], and geopolitical instability of the non-agentic position [4]. The present note focuses on the formal bound, where the pattern is most precisely visible. ↩︎