@techreport{taylor2026lossofoversight,
title = {Loss of Oversight: How {AI} Systems May Become Harder to Audit, Monitor, and Investigate},
author = {Taylor, Jordan and Heitmann, Max and Fage, Ed and Read, Thomas and Bloom, Joseph},
institution = {UK AI Security Institute},
year = {2026},
month = {5},
url = {https://www.aisi.gov.uk/research/loss-of-oversight-how-ai-systems-may-become-harder-to-audit-monitor-and-investigate}
}
Produced by UK AISI Model Transparency and Situational Awareness teams. If you’re a Research Scientist or Research Engineer, we’re hiring – apply here and come and work with us!
TL;DR
Figure 1: A list of the experts interviewed to inform the content of this report. ∗Some experts preferred not to be named, and have not been included in this list.
My informal LessWrong blurb
This reflects only the personal view of Jordan Taylor, not the interviewed experts or UK AISI more broadly.
Right now, it seems we have pretty decent oversight. Not great, not terrible:
But there are a bunch of ways in which we might be playing on easy-mode today, relative to how difficult oversight like this could be in the future. The dominant problem is of course just generally increasing model capabilities for subversion, situational awareness, and doing tasks which are harder for humans to understand or oversee. But there are also a lot more acute (and potentially more preventable) risks to oversight. We compile these and give suggestions for tracking and preventing them.
My recommendations for LessWrong readers
The full report gives a lot of recommendations for model developers, but here I'll list some recommendations I'm most excited about for general AI safety researchers:
Executive Summary
The safety of advanced AI systems depends on the ability to oversee them. As AI systems become more capable of causing harm, safety arguments increasingly draw on evidence from oversight — the ability to detect undesired behaviours or properties. This can include auditing models for misalignment before deployment, monitoring for problems during training or deployment, and investigating incidents after they occur.[4]
This report examines how oversight is likely to degrade, and what can be done to preserve or improve oversight as AI systems develop. It draws on interviews with 25 experts across frontier AI developers, government, NGOs, and academia, as well as a literature review and our own analysis. Our goal is to inform our own research priorities and contribute to a shared understanding of what effective oversight requires.
Our key takeaway is that current oversight rests on contingent properties of AI systems, many of which are likely to erode. For example, today's models often reason in human-readable text, allowing monitoring of complex reasoning. These properties underpin the evidence used in safety cases and risk assessments, but many are likely to degrade as training is scaled up, as models become more capable, as they are deployed with greater autonomy into increasingly complex environments, and as new approaches to building AI systems gain traction. Across oversight domains (illustrated in Figure 2), we identify more than twenty distinct degradation pathways. While any particular degradation pathway is uncertain, the aggregated evidence suggests that current oversight is likely to degrade absent intervention. New oversight methods which could compensate are not yet mature enough to take their place. Despite this, there are promising interventions to preserve existing oversight for longer and to hasten the development of emerging oversight techniques, which we discuss throughout the full report.
Figure 2: Overview of oversight surfaces and techniques. Oversight surfaces (top) represent distinct parts of the AI system that yield oversight signal. Oversight techniques (bottom) are applied to these surfaces as layered defences, grouped by report section. Training-based approaches (bottom row) cut across multiple surfaces. The listed oversight techniques are illustrative and not exhaustive.
Summary of degradation pathways
The key properties that current oversight relies on all face credible degradation pathways, some of which are already underway. The tables below show only the highest-severity and highest-likelihood pathways; additional pathways with lower severity or likelihood are discussed in the corresponding body sections.[5][6]
Chain-of-thought reasoning is currently the most informative monitoring signal, but it is under significant pressure.[7]
Chain-of-thought (CoT) is the step-by-step reasoning text that current models produce. CoT currently provides a useful monitoring surface across a range of risk models, including for detecting reward hacking, and evaluation gaming. It is among the most compelling sources of evidence for convincing decision-makers of the need to take action.[8]
Action-only monitoring provides a floor for oversight, but it is not sufficient on its own.[9]
Action-only monitoring is observing only what a model does, rather than its reasoning. It provides a baseline level of oversight for tasks where harmful actions are obviously incriminating.
Evaluation gaming is a growing threat to auditing.[10]
Evaluations are used to audit models for undesired capabilities, propensities, or misalignment. Evaluation gaming is when a model identifies when it is being evaluated and modifies its behaviour accordingly. It is already threatening to undermine the validity of some alignment audits.[11][12]
Changes in architecture for memory and learning could undermine oversight.[13]
Current AI systems store memories and communicate with other systems in text, supporting oversight. In future, AI systems might store memories in less interpretable formats such as vector memory and/or continue to update model weights between deployment sessions.
White-box methods are a promising backstop, but are not yet mature enough to compensate for degradation elsewhere.[14]
White-box methods examine a model's internal representations, rather than what the model reveals in its outputs or reasoning. They span a wide maturity spectrum: simple techniques such as linear probes are starting to be deployed for some production use cases (Kramár et al., 2026), while more advanced techniques are used experimentally for auditing but are not yet mature enough for broad safety-critical uses.
Training-based approaches are promising but face fundamental challenges around generalisation.[15]
Training-based approaches aim to produce models that are more honest and forthcoming, or serve as better monitors, creating new sources of oversight signal such as confessions. Most approaches to training for oversight are early-stage, but many experts felt that with sufficient effort they could be developed into a central pillar of oversight as necessity-based CoT monitorability degrades.[16]
Expert disagreements
Along the way, we highlight many unresolved disagreements between experts. LessWrong is a natural place to discuss how these disagreements could be resolved, so leave a comment if you feel you can help! Some examples of disagreements are:
Whether latent reasoning architectures will take over.
On the extent and impact of non-human concepts:
Whether evidence from misalignment honeypots is meaningful:
Whether training for oversight will generalise to deployment contexts, given potential training-gaming:
Our Recommendations
In addition to the degradation pathways for current oversight, experts were also asked about potential solutions. Detailed technical levers for preserving or enhancing oversight are discussed throughout the report. In addition, the following ideas emerged as high-level recommendations.[17]
We provide more specific research recommendations in the full report (section 7.3).
Bibtex citation - cite as:
e.g. prompted sabotage, spontaneously taking bad actions on alignment honeypots, eval gaming, training gaming, reward hacking etc. -- perhaps with the exception of some things like slop.
Cases where models lack introspective awareness of the undesired thing are usually much harder to catch, e.g. isolated backdoors, conditional misalignment, or our most difficult sandbagging model organisms.
e.g. more comprehensive expansions of Aligned, Orthogonal or In-conflict covering many more areas and pathways of potential degradation. Conditional misalignment / whack-a-mole training is also relevant.
This notion of oversight is weaker than scalable oversight, which asks whether oversight techniques can be made robust enough to train superhuman systems.
Likelihood follows the PHIA probability yardstick, e.g. "Likely" indicates that we judge a 55-75% chance of this issue being the severity listed in the "severity" column before full automation of AI R&D.
Severity (pips in the rightmost column) refers to how badly that pathway would undermine the corresponding oversight channel if it did arise: ⬤ extremely high (would substantially compromise the underlying oversight channel with no clear fallback), 🔴 high, 🟡 medium, 🟢 low. Likelihood and severity ratings are judgements about the regime before full automation of AI R&D, not about substantially superhuman systems.
See Section 2 of the full report:
In practice, AI systems are increasingly used to monitor other AI systems' CoT, so pathways that undermine the effectiveness of these AI monitors are also grouped here.
See Section 3.1 of the full report:
See section 3.2 of the full report:
e.g. in the Claude Sonnet 4.5 System Card, or in Stress Testing Deliberative Alignment for Anti-Scheming Training
Many of these issues are also threats to the validity of training, as the same capabilities can enable training-gaming.
See section 4 of the full report:
See section 5 of the full report:
See section 6 of the full report:
There are a several interesting training-based approaches for enhancing oversight. Building upon or red-teaming these would be valuable!
These recommendations are informed by our own judgements, and are not necessarily endorsed by all listed experts.
We're hiring research scientists and research engineers! – apply here and come and work with us to build technical research in this area, e.g. creating model organisms, measuring oversight degradation, and developing new techniques to mitigate it.