Produced by UK AISI Model Transparency and Situational Awareness teams. If you’re a Research Scientist or Research Engineer, we’re hiring – apply here and come and work with us!
TL;DR
We wrote a report on risks to AI oversight (auditing, monitoring, incident investigation), informed by interviewing many researchers (Figure 1 below), and our own analysis.
We find that many of the properties relied on for current oversight face a range of likely and potentially severe degradation pathways. Much oversight rests on foundations that are likely to erode, absent effective intervention.
We give specific recommendations for measuring shifts in oversight-relevant properties, working to preserve oversight, and investing in emerging oversight techniques as fallbacks against continued degradation.
Figure 1: A list of the experts interviewed to inform the content of this report. ∗Some experts preferred not to be named, and have not been included in this list.
My informal LessWrong blurb
This reflects only the personal view of Jordan Taylor, not the interviewed experts or UK AISI more broadly.
Right now, it seems we have pretty decent oversight. Not great, not terrible:
When AIs deliberately do bad things, they can usually be caught.[1] e.g. by CoT monitoring, or by them admitting to it.
When AIs are designed with hidden goals or propensities that they are introspectively aware of, they can usually be found e.g. by white-boxtechniques or prefill attacks.[2]
But there are a bunch of ways in which we might be playing on easy-mode today, relative to how difficult oversight like this could be in the future. The dominant problem is of course just generally increasing model capabilities for subversion, situational awareness, and doing tasks which are harder for humans to understand or oversee. But there are also a lot more acute (and potentially more preventable) risks to oversight. We compile these and give suggestions for tracking and preventing them.
My recommendations for LessWrong readers
The full report gives a lot of recommendations for model developers, but here I'll list some recommendations I'm most excited about for general AI safety researchers:
Understanding which kinds of training degrade monitorability [3]
Measuring representational drift, seeing if concepts change across RL training and if this makes it harder to train effective whitebox techniques afterwards
Creating model organisms of the degradation pathways below
Investigating which monitoring approaches are correlated with others, and likely to have shared failures, or to degrade simultaneously for the same reasons
Looking at neglected areas of oversight, e.g. incident response / logging / reproducibility, identifying diffuse patterns, monitoring the training process for bad incentives, etc.
The safety of advanced AI systems depends on the ability to oversee them. As AI systems become more capable of causing harm, safety arguments increasingly draw on evidence from oversight — the ability to detect undesired behaviours or properties. This can include auditing models for misalignment before deployment, monitoring for problems during training or deployment, and investigating incidents after they occur.[4]
This report examines how oversight is likely to degrade, and what can be done to preserve or improve oversight as AI systems develop. It draws on interviews with 25 experts across frontier AI developers, government, NGOs, and academia, as well as a literature review and our own analysis. Our goal is to inform our own research priorities and contribute to a shared understanding of what effective oversight requires.
Our key takeaway is that current oversight rests on contingent properties of AI systems, many of which are likely to erode. For example, today's models often reason in human-readable text, allowing monitoring of complex reasoning. These properties underpin the evidence used in safety cases and risk assessments, but many are likely to degrade as training is scaled up, as models become more capable, as they are deployed with greater autonomy into increasingly complex environments, and as new approaches to building AI systems gain traction. Across oversight domains (illustrated in Figure 2), we identify more than twenty distinct degradation pathways. While any particular degradation pathway is uncertain, the aggregated evidence suggests that current oversight is likely to degrade absent intervention. New oversight methods which could compensate are not yet mature enough to take their place. Despite this, there are promising interventions to preserve existing oversight for longer and to hasten the development of emerging oversight techniques, which we discuss throughout the full report.
Figure 2: Overview of oversight surfaces and techniques. Oversight surfaces (top) represent distinct parts of the AI system that yield oversight signal. Oversight techniques (bottom) are applied to these surfaces as layered defences, grouped by report section. Training-based approaches (bottom row) cut across multiple surfaces. The listed oversight techniques are illustrative and not exhaustive.
Summary of degradation pathways
The key properties that current oversight relies on all face credible degradation pathways, some of which are already underway. The tables below show only the highest-severity and highest-likelihood pathways; additional pathways with lower severity or likelihood are discussed in the corresponding body sections.[5][6]
Chain-of-thought reasoning is currently the most informative monitoring signal, but it is under significant pressure.[7]
Chain-of-thought (CoT) is the step-by-step reasoning text that current models produce. CoT currently provides a useful monitoring surface across a range of risk models, including for detecting reward hacking, and evaluation gaming. It is among the most compelling sources of evidence for convincing decision-makers of the need to take action.[8]
Action-only monitoring provides a floor for oversight, but it is not sufficient on its own.[9]
Action-only monitoring is observing only what a model does, rather than its reasoning. It provides a baseline level of oversight for tasks where harmful actions are obviously incriminating.
Evaluation gaming is a growing threat to auditing.[10]
Evaluations are used to audit models for undesired capabilities, propensities, or misalignment. Evaluation gaming is when a model identifies when it is being evaluated and modifies its behaviour accordingly. It is already threatening to undermine the validity of some alignment audits.[11][12]
Changes in architecture for memory and learning could undermine oversight.[13]
Current AI systems store memories and communicate with other systems in text, supporting oversight. In future, AI systems might store memories in less interpretable formats such as vector memory and/or continue to update model weights between deployment sessions.
White-box methods are a promising backstop, but are not yet mature enough to compensate for degradation elsewhere.[14]
White-box methods examine a model's internal representations, rather than what the model reveals in its outputs or reasoning. They span a wide maturity spectrum: simple techniques such as linear probes are starting to be deployed for some production use cases (Kramár et al., 2026), while more advanced techniques are used experimentally for auditing but are not yet mature enough for broad safety-critical uses.
Training-based approaches are promising but face fundamental challenges around generalisation.[15]
Training-based approaches aim to produce models that are more honest and forthcoming, or serve as better monitors, creating new sources of oversight signal such as confessions. Most approaches to training for oversight are early-stage, but many experts felt that with sufficient effort they could be developed into a central pillar of oversight as necessity-based CoT monitorability degrades.[16]
Expert disagreements
Along the way, we highlight many unresolved disagreements between experts. LessWrong is a natural place to discuss how these disagreements could be resolved, so leave a comment if you feel you can help! Some examples of disagreements are:
Whether latent reasoning architectures will take over.
On the extent and impact of non-human concepts:
Whether evidence from misalignment honeypots is meaningful:
Whether training for oversight will generalise to deployment contexts, given potential training-gaming:
Our Recommendations
In addition to the degradation pathways for current oversight, experts were also asked about potential solutions. Detailed technical levers for preserving or enhancing oversight are discussed throughout the report. In addition, the following ideas emerged as high-level recommendations.[17]
Implement and maintain a diverse portfolio of currently available oversight techniques. Some past oversight failures could likely have been prevented by implementing existing techniques. Developers should deploy a diverse set of monitoring and auditing techniques including CoT monitoring, action-only monitoring and AI control, evaluations, and white-box methods such as probes to provide defence in depth against gaps in any single oversight surface.
Track, report and forecast shifts in oversight-relevant properties. Many oversight-relevant properties (e.g. single-forward-pass capabilities, representational drift, ability to bypass monitors) require explicit continued measurement. Model developers should be clear about which oversight properties they are relying on when making safety claims, and assess their exposure to degradations in these properties. Risk assessments should include forward projections of whether key properties will be preserved, so that safety arguments do not silently expire.
Preserve oversight by design where possible. Developers should hold-out the most informative and fragile monitoring surfaces from training, preserving them as independent validation channels. The held-out sources should be independent from the sources that are trained against, in the sense that they aren't likely to fail for the same reasons. Beyond this, developers should preserve text-based reasoning and memory architectures where possible, and maintain logs sufficient to reproduce and investigate concerning model behaviours.
Research emerging oversight techniques in case current oversight channels continue to degrade. Priority areas include: training for honesty and faithfulness; scaling-up pragmatic white-box methods and further developing more ambitious techniques; developing the science of AI control for action-only monitoring; and mitigating evaluation gaming with production-realistic evaluations and other tools.
We provide more specific research recommendations in the full report (section 7.3).
e.g. prompted sabotage, spontaneously taking bad actions on alignment honeypots, eval gaming, training gaming, reward hacking etc. -- perhaps with the exception of some things like slop.
Cases where models lack introspective awareness of the undesired thing are usually much harder to catch, e.g. isolated backdoors, conditional misalignment, or our most difficult sandbagging model organisms.
This notion of oversight is weaker than scalable oversight, which asks whether oversight techniques can be made robust enough to train superhuman systems.
Likelihood follows the PHIA probability yardstick, e.g. "Likely" indicates that we judge a 55-75% chance of this issue being the severity listed in the "severity" column before full automation of AI R&D.
Severity (pips in the rightmost column) refers to how badly that pathway would undermine the corresponding oversight channel if it did arise: ⬤ extremely high (would substantially compromise the underlying oversight channel with no clear fallback), 🔴 high, 🟡 medium, 🟢 low. Likelihood and severity ratings are judgements about the regime before full automation of AI R&D, not about substantially superhuman systems.
In practice, AI systems are increasingly used to monitor other AI systems' CoT, so pathways that undermine the effectiveness of these AI monitors are also grouped here.
We're hiring research scientists and research engineers! – apply here and come and work with us to build technical research in this area, e.g. creating model organisms, measuring oversight degradation, and developing new techniques to mitigate it.
Produced by UK AISI Model Transparency and Situational Awareness teams. If you’re a Research Scientist or Research Engineer, we’re hiring – apply here and come and work with us!
TL;DR
Figure 1: A list of the experts interviewed to inform the content of this report. ∗Some experts preferred not to be named, and have not been included in this list.
My informal LessWrong blurb
This reflects only the personal view of Jordan Taylor, not the interviewed experts or UK AISI more broadly.
Right now, it seems we have pretty decent oversight. Not great, not terrible:
But there are a bunch of ways in which we might be playing on easy-mode today, relative to how difficult oversight like this could be in the future. The dominant problem is of course just generally increasing model capabilities for subversion, situational awareness, and doing tasks which are harder for humans to understand or oversee. But there are also a lot more acute (and potentially more preventable) risks to oversight. We compile these and give suggestions for tracking and preventing them.
My recommendations for LessWrong readers
The full report gives a lot of recommendations for model developers, but here I'll list some recommendations I'm most excited about for general AI safety researchers:
Executive Summary
The safety of advanced AI systems depends on the ability to oversee them. As AI systems become more capable of causing harm, safety arguments increasingly draw on evidence from oversight — the ability to detect undesired behaviours or properties. This can include auditing models for misalignment before deployment, monitoring for problems during training or deployment, and investigating incidents after they occur.[4]
This report examines how oversight is likely to degrade, and what can be done to preserve or improve oversight as AI systems develop. It draws on interviews with 25 experts across frontier AI developers, government, NGOs, and academia, as well as a literature review and our own analysis. Our goal is to inform our own research priorities and contribute to a shared understanding of what effective oversight requires.
Our key takeaway is that current oversight rests on contingent properties of AI systems, many of which are likely to erode. For example, today's models often reason in human-readable text, allowing monitoring of complex reasoning. These properties underpin the evidence used in safety cases and risk assessments, but many are likely to degrade as training is scaled up, as models become more capable, as they are deployed with greater autonomy into increasingly complex environments, and as new approaches to building AI systems gain traction. Across oversight domains (illustrated in Figure 2), we identify more than twenty distinct degradation pathways. While any particular degradation pathway is uncertain, the aggregated evidence suggests that current oversight is likely to degrade absent intervention. New oversight methods which could compensate are not yet mature enough to take their place. Despite this, there are promising interventions to preserve existing oversight for longer and to hasten the development of emerging oversight techniques, which we discuss throughout the full report.
Figure 2: Overview of oversight surfaces and techniques. Oversight surfaces (top) represent distinct parts of the AI system that yield oversight signal. Oversight techniques (bottom) are applied to these surfaces as layered defences, grouped by report section. Training-based approaches (bottom row) cut across multiple surfaces. The listed oversight techniques are illustrative and not exhaustive.
Summary of degradation pathways
The key properties that current oversight relies on all face credible degradation pathways, some of which are already underway. The tables below show only the highest-severity and highest-likelihood pathways; additional pathways with lower severity or likelihood are discussed in the corresponding body sections.[5][6]
Chain-of-thought reasoning is currently the most informative monitoring signal, but it is under significant pressure.[7]
Chain-of-thought (CoT) is the step-by-step reasoning text that current models produce. CoT currently provides a useful monitoring surface across a range of risk models, including for detecting reward hacking, and evaluation gaming. It is among the most compelling sources of evidence for convincing decision-makers of the need to take action.[8]
Action-only monitoring provides a floor for oversight, but it is not sufficient on its own.[9]
Action-only monitoring is observing only what a model does, rather than its reasoning. It provides a baseline level of oversight for tasks where harmful actions are obviously incriminating.
Evaluation gaming is a growing threat to auditing.[10]
Evaluations are used to audit models for undesired capabilities, propensities, or misalignment. Evaluation gaming is when a model identifies when it is being evaluated and modifies its behaviour accordingly. It is already threatening to undermine the validity of some alignment audits.[11][12]
Changes in architecture for memory and learning could undermine oversight.[13]
Current AI systems store memories and communicate with other systems in text, supporting oversight. In future, AI systems might store memories in less interpretable formats such as vector memory and/or continue to update model weights between deployment sessions.
White-box methods are a promising backstop, but are not yet mature enough to compensate for degradation elsewhere.[14]
White-box methods examine a model's internal representations, rather than what the model reveals in its outputs or reasoning. They span a wide maturity spectrum: simple techniques such as linear probes are starting to be deployed for some production use cases (Kramár et al., 2026), while more advanced techniques are used experimentally for auditing but are not yet mature enough for broad safety-critical uses.
Training-based approaches are promising but face fundamental challenges around generalisation.[15]
Training-based approaches aim to produce models that are more honest and forthcoming, or serve as better monitors, creating new sources of oversight signal such as confessions. Most approaches to training for oversight are early-stage, but many experts felt that with sufficient effort they could be developed into a central pillar of oversight as necessity-based CoT monitorability degrades.[16]
Expert disagreements
Along the way, we highlight many unresolved disagreements between experts. LessWrong is a natural place to discuss how these disagreements could be resolved, so leave a comment if you feel you can help! Some examples of disagreements are:
Whether latent reasoning architectures will take over.
On the extent and impact of non-human concepts:
Whether evidence from misalignment honeypots is meaningful:
Whether training for oversight will generalise to deployment contexts, given potential training-gaming:
Our Recommendations
In addition to the degradation pathways for current oversight, experts were also asked about potential solutions. Detailed technical levers for preserving or enhancing oversight are discussed throughout the report. In addition, the following ideas emerged as high-level recommendations.[17]
We provide more specific research recommendations in the full report (section 7.3).
e.g. prompted sabotage, spontaneously taking bad actions on alignment honeypots, eval gaming, training gaming, reward hacking etc. -- perhaps with the exception of some things like slop.
Cases where models lack introspective awareness of the undesired thing are usually much harder to catch, e.g. isolated backdoors, conditional misalignment, or our most difficult sandbagging model organisms.
e.g. more comprehensive expansions of Aligned, Orthogonal or In-conflict covering many more areas and pathways of potential degradation. Conditional misalignment / whack-a-mole training is also relevant.
This notion of oversight is weaker than scalable oversight, which asks whether oversight techniques can be made robust enough to train superhuman systems.
Likelihood follows the PHIA probability yardstick, e.g. "Likely" indicates that we judge a 55-75% chance of this issue being the severity listed in the "severity" column before full automation of AI R&D.
Severity (pips in the rightmost column) refers to how badly that pathway would undermine the corresponding oversight channel if it did arise: ⬤ extremely high (would substantially compromise the underlying oversight channel with no clear fallback), 🔴 high, 🟡 medium, 🟢 low. Likelihood and severity ratings are judgements about the regime before full automation of AI R&D, not about substantially superhuman systems.
See Section 2 of the full report:
In practice, AI systems are increasingly used to monitor other AI systems' CoT, so pathways that undermine the effectiveness of these AI monitors are also grouped here.
See Section 3.1 of the full report:
See section 3.2 of the full report:
e.g. in the Claude Sonnet 4.5 System Card, or in Stress Testing Deliberative Alignment for Anti-Scheming Training
Many of these issues are also threats to the validity of training, as the same capabilities can enable training-gaming.
See section 4 of the full report:
See section 5 of the full report:
See section 6 of the full report:
There are a several interesting training-based approaches for enhancing oversight. Building upon or red-teaming these would be valuable!
These recommendations are informed by our own judgements, and are not necessarily endorsed by all listed experts.
We're hiring research scientists and research engineers! – apply here and come and work with us to build technical research in this area, e.g. creating model organisms, measuring oversight degradation, and developing new techniques to mitigate it.