Research Areas in Interpretability (The Alignment Project by UK AISI)

Joseph Bloom

The Alignment Project is a global fund of over £15 million, dedicated to accelerating progress in AI control and alignment research. It is backed by an international coalition of governments, industry, venture capital and philanthropic funders. This post is part of a sequence on research areas that we are excited to fund through the The Alignment Project.

Apply now to join researchers worldwide in advancing AI safety.

Interpretability

Interpretability provides access to AI systems' internal mechanisms, offering a window into how models process information and make decisions. These tools let us examine neural representations to detect hidden properties like deception or latent capabilities, while also enabling direct intervention to modify model behaviour. As models grow more sophisticated, interpretability may be crucial for maintaining our understanding of their behaviour.

Lie Detection

Problem summary: Can we determine when AI systems are lying to us? How can we make arguments that techniques developed in one context or on current models will work on future models?

Why this matters: Many risks posed by future AI systems would be significantly mitigated if we could ensure that these systems are honest and forthcoming. Success here would contribute both to measuring AI system propensities as well as mitigating risks from deceptive systems.

Examples of subproblems:

Established subproblems:

Are there ‘universal lying mechanisms’ such that a transformative AI system could not outwardly exhibit behaviour we would characterise as deceptive without utilising these mechanisms?
How might lie detectors become less effective if models are trained against them? This might be intentionally (honesty training) or unintentionally (monitor label leakage) (Cundy et al., 2025).
Is it possible to disentangle representations of ‘intent to deceive’ from truthfulness, validity or credence? Are we finding latents that track ‘the model believes that a user believes x?’ when we mean to find latents representing a model’s belief?
Do lie detection techniques scale well from weaker to more capable models?

New questions where we are not aware of prior work:

Is it possible to combine lie detection techniques such as activation monitoring with deliberate interrogation techniques? This may combine benefits of both black and white box methods. Black box methods can ensure that we target specific propositions (Chalmers, 2025) while whitebox methods might work even if a model can convincingly lie.
Are there trade-offs between the sophistication or robustness of deceptive behaviours and the ease by which we might detect them? For example, does reinforcement learning for secrecy as described in Marks et al., 2025 have a side effect of making white box methods more effective?
Can ensembles of detectors outperform single probe based lie detection strategies?

Suggestions to learn more: We’d recommend reading Goldowsky-Dill et al., 2025, Marks et al., 2023 and Azaria et al., 2023 for context on simple approaches to lie detection with linear probes. Chowdury et al., 2025 may also be useful for further context on truthfulness in frontier AI systems. Farquhar et al., 2023 provides some insights into the limitations of prior unsupervised LLM knowledge discovery techniques.

Interpretability for Future Models

Problem summary: Some novel interpretability strategies that require significant resource investments may substantially advance our insight into neural network internals (Bricken et al., 2023; Templeton et al., 2024; Lindsey et al., 2025). There are many opportunities to advance this area of interpretability.

Why this matters: Current interpretability techniques may not work for future models. Possible challenges include the cost of current techniques, known issues with current methods and unclear robustness to future AI development trajectories which may involve superhuman AI.

Examples of subproblems:

Established subproblems:

Improving the training efficiency of recently developed techniques like sparse autoencoders. Are there ways to get similar levels of value with a fraction of the cost or effort?
Current sparse decomposition techniques have known issues which may prevent us from realising their value from AI safety. For example, they may learn classifiers with ‘holes’ in them due to the sparsity objective (Chanin et al., 2025)
Can the use of computer-use agents to perform behavioural audits (mentioned in Anthropic's Claude 4 system card) be extended to ‘interpretability agents’ that can not only locate but diagnose and debug anomalous behaviours?

New questions where we are not aware of prior work:

Demonstrating usefulness of sparse network decomposition (Bricken et al., 2023) in the context of super-intelligent capabilities. Future AI systems may have alien concepts that are far from interpretable to humans (Hewitt et al., 2025). Stress-testing features and circuit finding in domains like superhuman chess or biological foundation models (Simon et al., 2024) may teach us how to continue to get value out of these techniques.
Demonstrating how sparse decomposition techniques might work even in models with extensive latent reasoning. Feature characterisation currently relies heavily on token-highlighting, but if future models reason in latent space (Hao et al., 2024, Geiping et al., 2025) then it’s unclear whether we could continue to use the same approaches.

Leveraging White-Box Access for Useful Interventions

Problem summary: Although some early studies hint at methods that could intervene in a model’s internal representations during inference, none have yet been shown to be a reliable, central technique (Abdelnabi et al., 2025, Turner et al., 2023, Li et al., 2023).

Why this matters: Combined with scalable interpretability strategies for tracking run-time information, the possible applications of being able to directly edit model ‘beliefs’ or steer completions (such as towards truthfulness) could be used for both estimating and mitigating risks posed by future AI systems.

Examples of subproblems:

Established subproblems:

Can we detect and prevent evaluation awareness in AI systems (Abdelnabi et al., 2025)?
Can we exercise fine-grained control over refusal tendencies (Arditi et al., 2024, Lee et al., 2024)? For example, can we use this to prevent residual refusal tendencies in models that have had their harmfulness training removed to enable CBRN dangerous capability evaluations (Anthropic, 2025)?
Can activation steering be used to elicit model capabilities in sandbagging models (van der Weij et al., 2025, Hofstatter et al., 2025)?

New questions where we are not aware of prior work:

We’d like to see more exploration of methods of optimising text to minimise the activations of latents (Thompson et al., 2024), such as those tracking evaluation awareness, as an alternative to activation steering. For example, we could use this to control what information is ‘leaked’ to AI systems through prompts.
We’re interested in whether we can better test AI systems for undesirable propensities by directly modifying their beliefs. Such techniques might enable us to probe for behaviours that are usually very difficult to elicit via prompting or other ‘honey-potting’ strategies. For example, there is some prior work intervening on a latent representation for test awareness (Abdelnabi et al., 2025).

Interpretability Approaches to Eliciting Bad Contexts

Problem summary: We like to determine whether sufficient information is contained within AI systems to construct contexts in which they might behave maliciously (the problem of 'eliciting bad contexts'; Irving et al., 2025).

Why this matters: In theory, a solution to eliciting bad contexts might be useful for generating datasets that mimic deployment sufficiently effectively that we can minimise distribution shift from training to deployment. This means that training methods to produce aligned systems are more likely to generalise to deployment and withstand adversarial optimisation pressure to move away from the aligned state via prompting.

Examples of subproblems:

Are there approaches for mechanistically eliciting latent behaviours based on a single prompt (Mack and Turner, 2024)? How could we use these approaches to locate latent representations and find bad contexts?
Can the method in Thompson et al., 2024 for optimising an input to maximally activate a latent representation be extended to SAE latents or steering vectors? Showing these techniques can be applied whilst preserving other properties of a prompt may be useful. For example, can we optimise an eval question to preserve the content of the question but against the model being aware that the question comes from an evaluation?

Suggestions to learn more: We’d recommend reading Irving et al., 2025.

View The Alignment Project website to learn more or apply for funding here.

14

Research Areas in Interpretability (The Alignment Project by UK AISI)

14

Ω 9

Interpretability

Lie Detection

Interpretability for Future Models

Leveraging White-Box Access for Useful Interventions

Interpretability Approaches to Eliciting Bad Contexts

14

Ω 9

14

Ω 9