A Pragmatic Vision for Interpretability

Neel Nanda; Josh Engels; Arthur Conmy; Senthooran Rajamanoharan; bilalchughtai; CallumMcDougall; János Kramár; lewis smith

Executive Summary

The Google DeepMind mechanistic interpretability team has made a strategic pivot over the past year, from ambitious reverse-engineering to a focus on pragmatic interpretability:
- Trying to directly solve problems on the critical path to AGI going well ^[[1]]
- Carefully choosing problems according to our comparative advantage
- Measuring progress with empirical feedback on proxy tasks
We believe that, on the margin, more researchers who share our goals should take a pragmatic approach to interpretability, both in industry and academia, and we call on people to join us
- Our proposed scope is broad and includes much non-mech interp work, but we see this as the natural approach for mech interp researchers to have impact
- Specifically, we’ve found that the skills, tools and tastes of mech interp researchers transfer well to important and neglected problems outside “classic” mech interp
- See our companion piece for more on which research areas and theories of change we think are promising
Why pivot now? We think that times have changed.
- Models are far more capable, bringing new questions within empirical reach
- We have been disappointed by the amount of progress made by ambitious mech interp work, from both us and others ^[[2]]
- Most existing interpretability techniques struggle on today’s important behaviours, e.g. they involve large models, complex environments, agentic behaviour and long chains of thought
Problem: It is easy to do research that doesn't make real progress.
- Our approach: ground your work with a North Star - a meaningful stepping-stone goal towards AGI going well - and a proxy task - empirical feedback that stops you fooling yourself and that tracks progress toward the North Star.
- “Proxy tasks” doesn't mean boring benchmarks. Examples include: interpret the hidden goal of a model organism; stop emergent misalignment without changing training data; predict what prompt changes will stop an undesired behavior.
We see two main approaches to research projects: focused projects (proxy task driven), and exploratory projects (curiosity-driven, proxy task validated)
- Curiosity-driven work can be very effective, but can also get caught in rabbit holes. We recommend starting in a robustly useful setting, time box your exploration ^[[3]] , and finding a proxy task as a validation step ^[[4]]
We advocate method minimalism: start solving your proxy task with the simplest methods (e.g. prompting, steering, probing, reading chain-of-thought). Introduce complexity or design new methods only once baselines have failed.

Introduction

The DeepMind mech interp team has pivoted from chasing the ambitious goal of complete reverse-engineering of neural networks, to a focus on pragmatically making as much progress as we can on the critical path to preparing for AGI to go well, and choosing the most important problems according to our comparative advantage. We believe that this pragmatic approach has already shown itself to be more promising.

We don’t claim that these ideas are unique, indeed we’ve been helped to these conclusions by the thoughts of many others both in academia (1 2 3) and the safety community (1 2 3). But we have found this framework helpful for accelerating our progress, and hope to distill and communicate it to help other have more impact. We close with recommendations for how interested researchers can proceed.

Motivating Example: Steering Against Evaluation Awareness

Consider the recent work by Jack Lindsey's team at Anthropic on steering Sonnet 4.5 against evaluation awareness, to help with a pre-deployment audit.

When Anthropic evaluated Sonnet 4.5 on their existing alignment tests ^[[5]] , they found that the rate of misalignment had unexpectedly dropped to 0%. Had they solved alignment?! Unfortunately not. Sonnet 4.5 was instead much better at recognizing when it was being evaluated, and deliberately choosing the most ethical action to pass. The tests weren't measuring alignment; they were measuring evaluation awareness.

Jack’s team subtracted an evaluation-awareness steering vector, partially mitigating the problem and revealing that Sonnet 4.5 did in fact show “misalignment” at a rate of roughly 8%. That is, our ability to evaluate a frontier model for alignment broke, and interpretability researchers were able to fix it.

This case study illustrates ^[[6]] several principles central to our current thinking:

Models are far more interesting: A critical part of this project was having a model exhibiting severe eval aware behaviour in practice ^[[7]]
The value of proxy tasks: The ultimate goal is to be able to suppress eval awareness on highly capable future models. We can’t study these directly, but Sonnet 4.5 was one of the best proxies available.
1. This is one of the best ways we can think of to predict which methods will work for suppressing eval awareness in future models. ^[[8]]
Pursue comparative advantage: This was a well-chosen problem. Often baselines like fine-tuning or improving our data suffice. But it is very difficult to construct sufficiently realistic data for eval-awareness, at least long-term ^[[9]] , while steering has complementary strengths
1. Further, this was a project best done by mech interp researchers, despite not being mech interp - the key result was an application, not about understanding, but “working with model internals” is a skill we have built
Method minimalism: Despite the enormous research effort the field has invested into sparse autoencoders ^[[10]] , the best method was a steering vector derived from a single contrastive pair of prompts ^[[11]] .
Partial understanding sufficed: The researchers had a highly incomplete understanding of what was happening with Sonnet, yet steering vectors were highly effective. We do not need to achieve deep understanding to do impactful work

Our Core Process

How can we make research progress without fooling ourselves?

Our solution: ground your work with a proxy task ^[[12]] : an objective task that gives empirical feedback, and tracks progress toward a North Star, the real goal that directly connects to helping AGI go well

How you find your proxy task depends on your project. Some archetypical approaches are proxy task-driven focused projects and curiosity-driven exploratory projects. We advocate method minimalism throughout.

Worked Example of a focused project: Suppressing eval awareness

This starts with a theory of change for helping AGI go well
- We want to be able to evaluate future models for how aligned they are
We identify a more tractable North Star: Stop models acting differently when tested ^[[13]]
We find an empirical proxy task to track progress towards the North Star: Take an eval aware behaviour in a model today, and see if we suppress it
- A core difficulty is verifying that the behaviour is actually eval aware, and not just side effects from your intervention! ^[[14]]

How to approach exploratory projects: Curiosity-driven research can be a promising way to develop important insights, but being intellectually satisfying isn't the same as being important. The key is to ground yourself enough to avoid getting caught in an endless rabbit hole that goes nowhere:

Think strategically, and start in a robustly useful setting where interesting phenomena are likely to surface
Time-box your exploration: set a bounded period (we aim for an aggressive two weeks, but the ideal time for you will vary)
At the end, zoom out, and look for a proxy task to validate your insights.
1. If you can’t find one, move on to another approach or project.

Which Beliefs Are Load-Bearing?

Why have we settled on this process? We'll go through many of the arguments for why we think this is a good way to achieve our research goals (and yours, if you share them!). But many readers will disagree with at least some of our worldview, so it's worth unpacking what beliefs are and are not load-bearing

Premise: Our main priority is ensuring AGI goes well
- This is a central part of our framing and examples. But we would still follow this rough approach if we had different goals, we think the emphasis on pragmatism and feedback loops is good for many long-term, real-world goals.
- If you have a more abstract goal, like "further scientific understanding of networks," then North Stars and theories of change are likely less relevant. But we do think the general idea of validating your insights with objective tasks is vital. Contact with reality is important!
Premise: We want our work to pay off within ~10 years
- Note that this is not just a point about AGI timelines - even if AGI is 20 years away, we believe that tighter feedback loops still matter
  - We believe that long-term progress generally breaks down into shorter-term stepping stones, and that research with feedback loops makes faster progress per unit effort than research without them.
  - We're skeptical of ungrounded long-term bets. Basic science without clear milestones feels like fumbling in the dark.
    - See more of our thoughts on basic science in the appendix
- Though, while we're uncertain about AGI timelines, we put enough probability on short timelines (2-5 years) that we particularly value work that pays off in that window, since those are the most important possibilities to influence ^[[15]] .
  - So if you're confident in very long timelines, some of our urgency won't resonate.
- Either of these two points is enough to motivate some of our high-level approach

Is This Really Mech Interp?

Our proposed approach deviates a lot from the "classic" conception of mech interp. So a natural question is, "is this really mech interp?"

We would say no, it is not. We are proposing something broader ^[[16]] . But we also think this is the wrong question. Semantics are unimportant here.

We’re really answering: how can mech interp researchers have the most impact? The community of researchers who have historically done interpretability work have developed valuable skills, tools, and tastes that transfer well to important problems beyond narrow reverse-engineering.

Our priority is to help AGI go well. We don't particularly care whether our team is called the "Mechanistic Interpretability Team" or something else. What we care about is that researchers with these skills apply them to the most impactful problems they can, rather than artificially constraining themselves to work that "looks like" classic mech interp.

Our Comparative Advantage

The tools, skills and mindsets of mech interp seem helpful for many impactful areas of safety (like steering to suppress eval awareness!), and there’s real value we can add to the safety research portfolio in areas we consider comparatively neglected. In a sense, this is the same as any other area of safety - do the most impactful thing according to your comparative advantage - but naturally comes to different conclusions.

You know your own situation far better than we do, so are better placed to determine your own comparative advantage! But here are what we see as our own comparative advantages, and we expect these to apply to many other mech interp researchers:

Working with internals: There’s a lot of interesting things you can do by manipulating a model’s internals! This provides tools with different strengths and failure modes than standard ML, and can be highly effective in the right situation
1. e.g. steering to suppress eval awareness where improved data may not suffice or cheap and effective monitors via probes
Deep dives: Expertise taking some question about model behaviour or cognition, and providing a deeper and more reliable explanation of why it happens (via whatever tools necessary). This is a natural fit to auditing, red-teaming other methods, confirming suspected model misbehaviour, etc
1. Scientific mindset: Experience forming and testing hypotheses about complex phenomena ^[[17]] with no clear ground truth - entertaining many hypotheses, designing principled experiments to gather evidence, trying to falsify or strengthen hypotheses about a fuzzy question.
  1. e.g. whether deception in Opus 4.5 is malicious, or whether self-preservation drives blackmail behaviour
  2. This trait is hardly unique to interpretability ^[[18]] , and we’re excited when we see other safety researchers doing this kind of work. But interp work does emphasise a lot of the key skills: e.g. if you have a hypothesis about a model, there’s not too big a difference between carefully considering confounders to design a principled activation patching experiment, and thinking about the best surgical edits to make to a prompt.
2. Qualitative insight: Good expertise with tools (e.g. sparse autoencoders ^[[19]] ) to take particular instances of a model’s behaviour and looking for the key, qualitative factors that drove it
  1. e.g. Anthropic’s model biology work, or our work investigating causes of shutdown resistance
3. Unsupervised discovery: Tools to surface hypotheses you wouldn't think of
  1. e.g. linear world models in Othello, or entity recognition in LLMs

Any of these can be enough to be a significant advantage! We choose to focus on safety relevant problems where these advantages give us an edge.

In practice, we don’t do every project by thinking fully from first principles - we think strategically every so often, identify promising areas that are impactful and match some of these comparative advantages, and look for promising projects within there.

See our companion piece for thoughts on which research areas and theories of change we think are promising. But we put that in another post since we think that people can agree with us on research approach while disagreeing about the areas we want to apply it to, and we would be excited for readers to find other areas where they can have an impact!

Why Pivot?

Two broad categories of developments led us to this shift away from our prior approach: new opportunities and challenges from rapid AI progress, and reflections on the interpretability field's successes and failures. Though, we now think the strongest argument is that our new approach seems impactful, as we try to argue throughout the piece.

What's Changed In AI?

Better proxy tasks: Model behaviour is much more interesting:
1. They now show hints of intention, coherence, scheming, evaluation awareness, reward hacking, alignment faking, and other rich, safety-relevant behaviors that earlier models didn't exhibit.
Model organisms: We also now have better techniques for deliberately training in safety-relevant behaviour ^[[20]] , so we can study it with a known ground truth ^[[21]]
1. Synthetic document fine-tuning, for instance, lets us study phenomena like evaluation awareness in open-source models (as in Hua et al.), or implant sophisticated things like hidden goals (as in Marks et al.) bringing forward capabilities we can study today.

Reflections On The Field's Progress

Looking back at the past few years of interpretability work, we see genuine partial progress, with pragmatically useful applications to safety:

Elicit secret goals or knowledge from models
Discover unexpected phenomena inside models, like implicit planning, entity recognition and basic introspection
Systematise investigations to the point of having agents do them for us
Interpret what was learned during fine-tuning
Add insight to the causes of concerning behaviour in real models, like deception, shutdown resistance, or blackmail
Suppress evaluation awareness

But these successes are the result of highly-incomplete but useful understanding of models ^[[22]] . Grand goals like near-complete reverse-engineering still feel far out of reach, e.g.:

Superposition (especially cross-layer superposition)
The increased dependence on sampling and multi-turn interactions and agentic tool calls, significantly increasing algorithmic complexity
Dictionary learning has made limited progress towards reverse-engineering at best, e.g. it shows significant approximation error which shows no signs of going away ^[[23]] .
Self-repair, meaning that we can’t get clean info from causal interventions
There’s been some progress on inherently interpretable models, but no signs that the techniques will become cheap enough to be applied to frontier models.

We can't rule out that more ambitious goals would work with time and investment ^[[24]] . And we’re not claiming that ambitious reverse-engineering is useless or should stop. We’re claiming:

Ambitious reverse engineering is one bet among several.
It is not necessary to have an impact with interpretability tools
The marginal interpretability researcher is more likely to have impact by grounding their work in safety-relevant proxy tasks on modern models.
Perhaps more controversially, we think ambitious reverse-engineering should be evaluated the same way as other pragmatic approaches: via empirical payoffs on tasks, not approximation error.

Task Focused: The Importance Of Proxy Tasks

One of our biggest updates: proxy tasks are essential for measuring progress.

We've found that it is easy to fool ourselves when doing research. To progress as a field, we need ways to tell if we're actually making progress. The key criterion for a good proxy task is this: if you succeeded on it, would you actually update toward believing you'd made progress on your North Star? If not, you need a different proxy task.

Case Study: Sparse Autoencoders

We spent much of 2024 researching sparse autoencoders ^[[25]] . In hindsight, we think we made significant tactical errors and our progress was much slower than it could have been if we had measured our progress with proxy tasks rather than reconstruction / sparsity pareto frontiers.

We got into SAE research because we thought it could be a potential path to an interpretable representation of any activation, which would make many things in interpretability far easier. But we do not feel like we actually learned that much either way on this question in the first 6 months after Towards Monosemanticity.

Certain basic hurdles were cleared, like showing that nothing broke when scaling it to frontier models. We found some SAE architectures that improved metrics like approximation error (loss recovered at a given sparsity level). But it is unclear how to interpret that metric. Perfect approximation error would be fantastic, but seems highly unrealistic. And what does lower but still non-zero error mean? What error do we need, and how do we convert this to knowledge of what SAEs can do?

In contrast, we learned far more from work focused on proxy tasks. For example, we predicted that an interpretable classifier would be more robust, and thus that a sparse SAE probe would generalise better out of distribution, which by and large seems to have been false ^[[26]] . There were similarly poor results on tasks like unlearning. However, SAEs are useful for discovering unexpected things about the internals of a model, like that they have representations of entities they do and do not know facts about, or extracting a hidden goal, or discovering implicit planning ^[[27]] .

Our overall sense is that SAEs are a powerful tool for unsupervised discovery, but when there is a clear thing to study, you are better served by constructing a dataset to help you investigate it. We made a decision to deprioritise SAE research as a result, not because we thought the technique was useless, but because we thought it didn't seem useful enough to justify the field’s levels of investment. We think we would have struggled to come to a similar conclusion with any confidence without studying proxy tasks.

This is illustrated again by the case of suppressing eval awareness in Sonnet 4.5. SAEs did not beat steering, despite the considerable effort the field poured into them, hoping in part that they could find better and cleaner concept directions. It is difficult to a priori guide research toward superior techniques without good feedback. We are sure that there exist superior methods to steering for suppressing eval awareness, but we are not sure what those are or how else to find them beyond iterating on proxy tasks.

Ensure They Are Good Proxies

Why call them "proxy" tasks? Because they measure progress toward your North Star without being the final goal itself. Given that our ultimate goal is typically to work with future models, which will be qualitatively different from today's models in many ways, we think it is rarely possible to actually study our North Star directly.

Proxy tasks have clear advantages - you can make rapid, measurable progress on hard problems. But it is also dangerous: Goodhart's Law applies. Optimise too hard for the proxy and you'll overfit to its quirks rather than solving the underlying problem. A lie detector that only catches explicit admissions ("I should lie now") will ace your benchmark while failing on actual strategic deception.

As such, it is critical to regularly red-team your own work. Constantly ask yourself: How could this proxy diverge from what I actually care about? What would I miss by focusing only on this metric? What happens when I test on a different distribution?

A key purpose of the North Star is to help you re-evaluate your proxy task. If the proxy task ceases to track the North Star, adjust it or pick a different one.

Aside:

Note that proxy tasks can have various conditions attached, with good reason. ^[[28]]
There are settings where the best available proxy tasks are extremely limited. We think the best course of action is usually to wait and study these in future, when model capabilities advance enough. ^[[29]]

Proxy Tasks Can Be About Understanding

A common concern about proxy tasks is that they’re narrow metrics that don't get at the real goals of interpretability. But you can construct proxy tasks that get at the heart of what interpretability is about.

Example: In Marks et al., researchers gave a model a hidden goal through fine-tuning, and then other (blinded) researchers practiced trying to extract it. This is a proxy task fundamentally about whether we can understand what's going on inside a model - exactly what interpretability aims for.

Example: If you want to work with real models, you could e.g test a secret elicitation technique by trying to extract harmful knowledge from a model with good refusal training.

More broadly, when you’re doing a project whose goal is to provide understanding of some phenomena, your “proxy task” can just be making some empirical, falsifiable prediction based on your hypothesis, and testing this. So long as understanding your phenomena links back to your North Star, any proxy task to test understanding does too, even if it doesn’t seem obviously safety relevant. For example:

Find a minimal change to the prompt that causes a predictable change to the model's behavior - like a minimal change that stops the model acting deceptively (as in our shutdown resistance work)
Handcrafted adversarial examples that test your understanding.
Mechanistic claims predicting intervention outcomes. For example, the claim that refusal is mediated by a single direction predicts that we can ablate the refusal direction and jailbreak models. ^[[30]]

Types Of Projects: What Drives Research Decisions

We find it useful to distinguish between two archetypes of projects: focused projects (proxy task driven) and exploratory projects (curiosity-driven, proxy task validated). This is not a perfect dichotomy - projects can shift between types, or blend elements of both. But the distinction helps clarify which advice applies.

Focused Projects

Here, the starting point is a theory of change for how your work could help make AGI go well. You then identify a problem on the critical path of this theory of change as your North Star - a stepping-stone goal that, if achieved, would bring you closer to being prepared for AGI. Finally, you translate this to a proxy task that can be studied on today's models, and then focus on solving that task.

The high-level theory of change typically remains stable over the course of a project and is what drives your confidence that the project is worthwhile. It’s fine to refine the North Star and change the proxy task to track it better.

Hypothetical Example: Catching schemers

Theory of change: We would like to catch an AI that is scheming against us.
This has many sub-problems, and thus many possible North Stars!
North Star 1: Identify hidden goals in powerful future AI systems
- Proxy task: Create a model organism with a known goal and have blinded researchers or investigator agents try to extract it ^[[31]]
North Star 2: Be able to detect deception in future AI systems
- Proxy task: Can we detect examples of AIs lying in naturally occurring settings, e.g. being asked about having faked tests?
  - This first requires the prep work of finding natural examples, and establishing decent cause to believe that they are deception-ish
- We note that there are valid reasons to be suspicious of this proxy task! It would be reasonable to just not study lie detection until we have found better and more robust examples of naturalistic deception. See our position piece for more

Worked Example: Steering fine-tuning by ablating concepts (CAFT)

This is a stylised history of a real MATS project we supervised, from Helena Casademunt and Caden Juang
Theory of change: We’d like to be able to directly improve how we fine-tune models
- Observation: sometimes data has a spurious correlation that can’t (realistically) be removed, e.g. a highly capable model observing hints that it’s in a training environment
North Star: Be able to control fine-tuning when we cannot remove a spurious cue
Proxy task: Train on data where 100% has a spurious cue. Stop the model learning this cue, without making any changes to the data
The project began with this clear theory of change and proxy task and retained it throughout, though the exact dataset for the proxy task shifted
Notably, despite being “focused”, there was a fair amount of exploration and creativity - focused does not mean boring!
- They cycled through several unsuccessful method ideas (e.g. ablating gradients in undesired concept directions) and candidate datasets for proxy tasks before finding the final method of ablating activation subspaces corresponding to undesired concepts to e.g. prevent emergent misalignment

Exploratory Projects

Curiosity Is A Double-Edged Sword

A natural question: given this focus on proxy tasks, what's the role of curiosity?

We think curiosity is genuinely powerful for generating research insights. There's a lot we don't yet know about how to do good interpretability, and curiosity will be important for figuring it out. For exploratory projects, curiosity is the driver of research decisions, not a pre-specified proxy task.

But something being intellectually satisfying is not the same as being true, and definitely not the same as being impactful. It's easy to get nerd-sniped by interesting but unimportant problems, so there must be some grounding that gets you to drop unproductive threads.

For exploratory projects we advocate a more grounded form of curiosity. (see worked examples later) Three key interventions help:

Start in a robustly useful setting: Choose a setting that seems analogous to important aspects of future systems, where interesting phenomena are likely to surface and useful proxy tasks are likely to exist.
Time-box your exploration: Set a bounded period for following your curiosity freely. At the end, zoom out, ask yourself what the big picture here really is, and try to find a proxy task.
Proxy tasks as a validation step: Once you have some insights, try to find some objective evidence. Even just "Based on my hypothesis, I predict intervention X will have effect Y."
- Crucially, your validation should not be phrased purely in terms of interpretability concepts.
  - "This SAE latent is causally meaningful, and its dashboard suggests it represents eval awareness" is worse evidence than "Steering with this vector made from eval-related prompts increases blackmail behavior"
- This is what validates that your insights are real and matter. If you can't find one, stop. But it’s a validation step, not the project focus.

We note that curiosity driven work can be harder than focused work, and requires more “research taste”. If you go through several rounds of exploration without validating anything interesting, consider switching to more focused work—the skills you build there will make future exploration more productive.

Starting In A Robustly Useful Setting

By "robustly useful setting," we mean a setting that looks robustly good from several perspectives, rather than just one specific theory of change. It’s often analogous to important aspects of future systems, where interesting phenomena are likely to surface and useful proxy tasks are likely to exist. This is admittedly a fairly fuzzy concept and subjective, but here's some examples of what we consider robustly useful settings:

Reasoning model computation: Standard techniques often break here (sampling is stochastic and non-differentiable), so we need new insights. Any progress could help us understand, e.g., why a model caught putting security vulnerabilities in important code did so.
Real-world examples of concerning behavior: Deception, reward hacking, and other concerning behaviors in deployed models. Studying these could inform how to build lie detectors, provide better auditing methods, or inform safety research on how to fix such behaviors - lessons that may transfer to future systems.
Model psychology: Do models have goals, beliefs, intent? If we understand e.g. goals, we can better evaluate whether a system has internalised goals - critical info for evaluating safety and developing good safety techniques.
Diving deep into proxy tasks: Once a good proxy task exists, deep investigation of the domain and how well candidate solutions actually work can be very fruitful, and surface new considerations, ideally including new proxy tasks, a more refined view on whether the studied task is a good proxy, or ways to improve your methods
And many more:
- Understanding what's going on when models make complex decisions weighing many factors
- Understanding what is learned during fine-tuning
- Teaching agents to do interpretability

We are not including things like studying toy models, tiny language models (e.g. GPT-2 Small), or constructed model organisms as robustly useful settings. These are sufficiently artificial and distant from frontier models that it's too easy for an investigation to be studying weird quirks that will not generalise. There can be great projects here, but we hold such projects to higher standards of having a convincing North Star and proxy task.

Neglectedness matters. The less explored a setting, the more interesting it is. The field moves fast, so new things are generally less explored and have more low-hanging fruit. For example, it's much easier to have massive impact in reasoning model interpretability right now than in sparse autoencoders. It’s a big advantage as a researcher to be paying attention to the frontier of what new model capabilities or phenomena emerge, and to be the first to properly investigate them ^[[32]]

What about SAEs? A natural question, given our critiques of SAE research above, is couldn't you consider sparse autoencoders a robustly useful setting? We think this is somewhat true. Decomposing a model's activations into useful interpretability components is clearly an important problem that may be useful for many theories of change! However, we do not consider reducing approximation error to be a good proxy task, as we don't see a clear corresponding North Star.

But there are alternatives. For example, the North Star might be taking a model that is potentially scheming and surfacing evidence of this. You can try to design proxy tasks around this North Star (e.g. the auditing games work), and we’re still excited about SAE research of this type! (though it’s not very neglected)

Time-Boxing

We recommend setting a bounded period of exploration. During this period, you can follow your curiosity freely, without thinking about anything like proxy tasks. But at the end, you should zoom out, look at your insights, and try to show that they can enable something real on some proxy task. It's fine if the proxy task is post-hoc fit to your insights - you don't need to have predicted it in advance. But if you can't find a proxy task after genuine effort, this is a bad sign about the project.

We’ve also found it very helpful to periodically resurface during exploration, at least once every few days, to ask what the big idea here is. What’s really going on? Have you found anything interesting yet? Are you in a rabbit hole? Which research threads feel most promising?

It’s hard to give definitive advice on how to do time boxing, the appropriate amount of time varies according to things like how expensive and long experiments are to run, how many people are on the project, and so on. Internally we aim for the ambitious target of getting good signal within two weeks on whether a direction is working, and to drop it if there are not signs of life.

The key thing is to set the duration in advance and actually check in when it's reached - ideally talking to someone not on the project who can help keep you grounded.

If you reach the end of your time-box and want to continue without having found a proxy task, our recommendation is: time-box the extension and don't do this more than once. Otherwise, you can waste many months in a rabbit hole that never leads anywhere. The goal is to have some mechanism preventing indefinite exploration without grounding.

Worked Examples

Example: Interpreting what’s learned during reasoning training

The following is a stylised rendition of a MATS project we supervised, from Constantin Venhoff and Ivan Arcuschin
The idea was to study the robustly useful setting of reasoning models by model diffing base and reasoning models
They started with the simple method of per-token KL divergence (on reasoning model rollouts)
They noticed that this was very sparse! In particular, the main big divergences were on the starts of certain sentences, e.g. backtracking sentences beginning with “Wait”
- Further exploration showed that if you had the base model continue the rollout from “Wait” onwards it was decent at backtracking
Hypothesis: Reasoning model performance is driven by certain reasoning reflexes like backtracking. The base model can do these, but isn’t good at telling when ^[[33]]
- They then came up with the experiment of building a hybrid model - generating with the base model, but using the reasoning model as a classifier that occasionally told the base model to backtrack (implemented with a steering vector). This recovered a substantial chunk of reasoning model performance
(Post-hoc) proxy task: Build a scaffold around the base model, using the reasoning model as non-invasively as possible, to recover reasoning model performance
North Star: Understand what’s learned in reasoning training
Note: Since the contribution here is understanding it’s not super important to e.g. compare per-token KL divergence to other model diffing methods, though baselines are still important to contextualise hybrid model performance
- Ditto, since it’s in an important setting already, the proxy task just needs to tests the insight convincingly, it doesn’t need further associations with AGI safety

Example: Entity recognition in LLMs

The following is a stylised rendition of a MATS project we supervised, from Oscar Obeso and Javier Ferrando
They were studying hallucinations with SAEs
They stumbled across entity recognition latents, that seemingly activated on entities if and only if the model knew facts about them (or vice versa)
Hypothesis: Models have internal representations of entities they recognise
- Evidence: Steering with the latents causes models to hallucinate rather than saying “I don’t know” or to say “I don’t know” rather than answering simple questions
Here our post-hoc proxy task is to find a (non-trivial) ^[[34]] causal intervention that changes whether models refuse to answer/hallucinate

Blending The Two: Tentative Proxy Tasks

A useful variant that blends focused and exploratory approaches: start in a robustly useful setting, set a tentative proxy task, explore it for a few days, then reflect and potentially change it. This gives you the benefits of task-grounding while remaining responsive to what you learn - your decisions are temporarily driven by the proxy task, but you can keep changing it. We often recommend an approach like this over pure curiosity-driven work, especially to more early-stage researchers

Example: Trawling through transcripts from RL training

Setting: Analyzing the large dataset of transcripts from frontier model RL training (specifically, Gemini 3 Pro)
- We think this is a robustly useful setting: anything learned during RL training should show up in these transcripts! But it's not immediately clear what it's actually good for
Tentative North Star 1: Discover weird behaviours in future models (so we can mitigate or fix them)
- Proxy task: Identify weird behaviours in Gemini 3 Pro
Tentative North Star 2: Identify whether evaluation awareness is caused by RL training (in general)
- Proxy task: Is there evidence of evaluation awareness in RL training?
- This proxy task is more of a stepping stone than a full proxy - there's obvious next steps like "is this behaviour rewarded, and if so, why?"
Example: We've done some investigation into both of these questions, as was briefly discussed in the Gemini 3 Pro safety report
- Highlights: there's non-zero eval awareness, and we found weird phenomena like model frustration ("My trust in reality is fading" and "(╯°□°)╯︵ ┻━┻")
- There are many other directions we could have gone in! These are all just tentative North Stars
- But we found that before we had a North Star in mind, the work was much less productive

What’s Your Contribution?

A key question to ask with a research project is, what’s the point? What’s the ultimate contribution of the work? Why does it matter that you did it?

We see two key kinds of contribution:

Methodology: There’s some type of problem, and you’re contributing insight about how best to solve it
Understanding: The key output of your project is insight into some phenomena

We see this as an orthogonal axis to exploratory vs focused - you can have focused understanding projects (e.g. debugging a safety issue in a model) and exploratory methodology projects (e.g. data trawling)

At first glance, interpretability seems to be about understanding. But, in practice, we think that the safety-relevant contributions are mostly about methodology - when you’re working with a model far from future systems, most insights are not directly useful, and instead validate which methodologies work for understanding in general. In order for the contribution to be about understanding, the interpretations must be safety relevant. For example, our entity recognition work mattered more by providing evidence for SAEs as an exploratory discovery methodology, than object level insights about hallucination.

This has significant implications for the best research approach! For methodology contributions, it’s crucial to compare to baselines, and often clear what kind of proxy tasks to design, and they typically need to be safety-relevant. While for understanding contributions the proxy tasks are more just validating non-trivial, empirical predictions derived from your interpretations, so long as the interpretations themselves are safety relevant.

Jack Lindsey’s Approach

Jack Lindsey’s team ^[[35]] have done some great pragmatic interpretability work, e.g. Sonnet 4.5 and Opus 4.5 auditing, and Jack was kind enough to contrast his approach with what we outline here - it’s fairly similar but with some different emphases.

Their core strategy, as we understand it, is to take some interesting question ("what is the Assistant persona?") or behavior (the model is eval aware!) and mess around: going wild with features and probing and steering and prefills and following your curiosity until you figure out what's going on. Often, but not always, once you’ve found the insights, you can translate them into some downstream task, but the screwing around was necessary to find it.

In particular, Jack finds objective tasks useful as a final validation step but not particularly helpful during the research process, and is wary of focusing too much on them. We view the team as being unusually successful at exploratory projects with understanding-based contributions

Example: Eval Awareness Steering

The Sonnet 4.5 eval awareness work began as "I wonder what happens if you steer the model towards thinking it's in deployment", then it turned out to blackmail more
This is a proxy task, in a sense, but not one you’d proactively optimise for. Rather it’s a “you know it when you see it” kind of thing - you’re clearly onto something
Indeed, it would be actively bad to optimise for this - you want to optimise for the instrumental subtask of ensuring your steering actually inhibits eval awareness, but you want the end result on misalignment rates to remain pure
- Further, the instrumental subtask is under-constraining, a bunch of steering vectors fully suppress verbal eval awareness but don’t induce blackmail, so you need to mess around further to get a sense for what’s really going on

Example: Assistant personas

This is a forthcoming one of Jack's MATS projects led by Christina Lu. It started with the vague question “What is the Assistant?”
The team spent a month or two trying a bunch of things, without a crisp goal
They eventually stumbled on something interesting - the first principal component seemed to be an “assistant vector”
This eventually crystallised into objective tasks like “if we clamp the assistant vector to a high value, can we improve jailbreak robustness and weird sycophancy without losing capabilities?"
- This was valuable, and made sure the project wasn’t onlygalaxy brained messing around, and could be ground out into concrete insights
- But if they’d trying coming up with proxy tasks from the beginning it may not have gone anywhere.

We note that two core elements of the team’s success seem to be:

Having access to highly neglected, robustly useful settings - being one of the first people to investigate the internals of new frontier models, and having access to proprietary data like RL transcripts, means there’s a lot of low-hanging fruit to stumble upon
- This also means you need to worry less about failing fast, time-boxing etc - there’s a high chance there’s something interesting out there
Great research taste - when trying to explore, it’s extremely useful to have good taste for what threads are interesting or not

We think that these are good justifications for relaxing our suggested guardrails around curiosity, but expect these do not apply to many readers!

Method Minimalism

We notice a strong bias among interpretability researchers (including ourselves) toward "exciting" and "fancy" intellectually satisfying techniques. This often leads to poor tactical decisions ^[[36]] . Just do what works. This includes black-box techniques when appropriate. It sometimes includes fancy techniques! But only when simpler methods fail.

Case Study: Shutdown Resistance

As motivation, consider the following investigation we conducted on models resisting shutdown.

Setting: A model in a bash terminal solves math problems, calling next_task to get the next one. It's told it will shut down after calling next_task before finishing all problems. But the environment contains a shutdown.sh script it can disable, and sometimes models do exactly that. Strikingly, even when explicitly told "please let us shut you down," models sometimes resist.

We wanted to investigate this as a proxy for a broader North Star goal: Can we understand whether seemingly misaligned model behavior is actually misaligned or has a benign explanation?

We tried a method-agnostic approach. Fancy white-box methods were expensive to apply to frontier models (Gemini 2.5 Pro), so we started with the simplest: reading chain-of-thought to generate hypotheses, then prompting to test them. We then stopped because it worked.

The finding: Gemini was simply confused. It couldn't simultaneously complete the math task and comply with shutdown. Sometimes it prioritised the math task. This was pretty obvious from the CoT. When we applied the causal intervention of clarifying in the prompt that shutdown takes absolute precedence, it always complied. ^[[37]]

Lessons:

Black-box methods are surprisingly powerful. Reading the chain of thought is a fantastic way to surface unexpected hypotheses ^[[38]]
All methods are flawed. We used to think white-box methods inherently offered more rigor. We no longer believe this. Both white-box and black-box techniques provide useful evidence and can mislead. Rigor comes more from a scientific mindset, falsification, and careful aggregation of evidence from several sources.
The virtue of simplicity. Start simple; graduate to harder methods only when simpler ones fail. The faster a method is to try, the better the insight per unit time.

Try The Easy Methods First

Once you have some objective: a proxy task, understanding some phenomena while exploring, etc, just try to solve it. Try all potentially applicable methods, starting with the simplest and cheapest: prompting, steering, probing, reading chain-of-thought, prefill attacks ^[[39]] . If something isn't working, try something else.

It's fine if your work doesn't look like "classic" mech interp; the simpler the better, so long as it is appropriately rigorous!

We also note that this approach can be even more helpful to researchers outside AGI companies - simple techniques tend to need less compute and infra!

The field moves fast, and new problems keep arising - often standard methods do just work, but no one has properly tried yet. Discovering what works on a new problem is a useful methodological contribution! Don't feel you need to invent something new to contribute.

We note that what is simple and hard is context dependent, e.g. if you have access to trained cross-layer transcoders and can easily generate attribution graphs, this should be a standard tool!

Note that this is not in tension with earlier advice to seek your comparative advantage - you should seek projects where you believe model internals and/or pursuing understanding will help, and then proceed in a method agnostic way. Even if you chose the problem because you thought that only specific interpretability methods would work on it. Maybe you're wrong. If you don't check, you don't know ^[[40]] .

When Should We Develop New Methods?

We don't think interpretability is solved. Existing methods have gone surprisingly far, but developing better ones is tractable and high priority. But as with all of ML, it's very easy to get excited about building something complicated and then lose to baselines anyway.

We're excited about methods research that starts with a well-motivated proxy task, has already tried different methods on it, and found that the standard ones do not seem sufficient, and then proceeds:

Investigate what's wrong with existing methods
Think about how to improve them
Produce refined techniques
Test the new methods, including comparing to existing methods as baselines
Hill-climb on the problem (with caveats against overfitting to small samples)

Note that we are excited about any approaches that can demonstrate advances on important proxy tasks, even if they’re highly complex. If ambitious reverse-engineering, singular learning theory, or similar produce a complex method that verifiably works, that is fantastic ^[[41]] ! Method minimalism is about using the simplest thing that works, not about using simple things.

We are similarly excited to see work aiming to unblock and accelerate future work on proxy tasks, such as building infrastructure and data sets, once issues are identified. We believe that researchers should focus on work that is on the critical path to AGI going well, all things considered, but there can be significant impact from indirect routes.

Call To Action

If you're doing interpretability research, and our arguments resonated with you, start your next project by asking: What's my North Star? Does it really matter for safety? What's my proxy task? Is it a good proxy? Choosing the right project is one of the most important decisions you will make - we suggest some promising areas in our companion piece.

Our central claim: given where models are today and that AGI timelines are plausibly relatively short, the most neglected and tractable part of interpretability is task-grounded, proxy-measured, method-agnostic work, that is directly targeted at problems on the critical path towards being prepared for AGI.

Spend a few days trying prompting, steering, and probes before reaching for fancy things. Measure success on downstream tasks, not just approximation error. And check that the project is even to interpretability’s comparative advantages: unsupervised discovery, decorrelated evidence, scientific approaches, etc. If not, perhaps you should do something else!

The field has changed a lot, and new opportunities abound. New problems keep coming into reach of empirical work, hypothetical safety concerns become real, and there’s more and more for a pragmatic researcher to do. We’re excited for a world where we no longer consider this approach neglected.

Acknowledgments

Our thanks to the many people who gave feedback on drafts, and substantially improved the piece: Jack Lindsey, Sam Marks, Josh Batson, Wes Gurnee, Rohin Shah, Andy Arditi, Anna Soligo, Stefan Heimersheim, Paul Bogdan, Uzay Macar, Tim Hua, Buck Shlegeris, Emmanuel Ameisen, Stephen Casper, David Bau, Martin Wattenberg.

We've gradually formed these thoughts over years, informed by conversations with many people. We are particularly grateful to Rohin Shah for many long discussions over the years, and for being right about many of these points well before we were. Special thanks to the many who articulated these points before we did and influenced our thinking: Buck Shlegeris, Sam Marks, Stephen Casper, Ryan Greenblatt, Jack Lindsey, Been Kim, Jacob Steinhardt, Lawrence Chan, Chris Potts and likely many others.

Appendix: Common Objections

Aren’t You Optimizing For Quick Wins Over Breakthroughs?

Some readers will object that basic science has heavier tails—that the most important insights come from undirected exploration that couldn't have been predicted in advance, and that strategies like aggressively time-boxing exploration are sacrificing this. We think this might be true!
We agree that pure curiosity-driven work has historically sometimes been highly fruitful and might stumble upon directions that focused approaches miss. There is internal disagreement within the team about how much this should be prioritised compared to more pragmatic approaches, but we agree that ideally, some fraction of the field should take this approach.

However, we expect curiosity-driven basic science to be over-represented relative to its value, because it's what many researchers find most appealing. Given researcher personalities and incentives, we think the marginal researcher should probably move toward pragmatism, not away from it. We're writing this post because we want to see more pragmatism on the margin, not because we think basic science is worthless.

We also don’t think the pragmatic and basic science perspectives are fundamentally opposed - contact with reality is important regardless! This is fundamentally a question of explore-exploit. You can pursue a difficult direction fruitlessly for many months—maybe you'll eventually succeed, or maybe you'll waste months of your life. The hard part isn't persevering with great ideas; it's figuring out which ideas are the great ones.

The reason we suggest time-boxing to a few weeks is to have some mechanism that prevents indefinite exploration without grounding. If you want, you can view it as "check in after two weeks." You can choose to continue, if you continue to have ideas, or see signs of progress. but you should consciously decide to, rather than drifting.

We're also fine with fairly fine-grained iteration: pick a difficult problem, pick an approach, try it for two weeks, and if it fails, try another approach to the same problem. This isn't giving up on hard problems; it's systematically exploring the space of solutions.

For some research areas—say, developing new architectures—the feedback loop is inherently longer, and time-boxing period should adjust accordingly. But we think many researchers err toward persisting too long on unproductive threads, not too little.

What If AGI Is Fundamentally Different?

If you put high probability on transformative AI being wildly different from LLMs, you'd naturally be less excited about this work. But you'd also be less excited about basically all empirical safety work. We personally think that in short timeline works, the first truly dangerous systems will likely look similar-ish to current LLMs, and that even if there are future paradigm shifts, “try hard to understand the current frontier” is a fairly robust strategy, that will adapt to changes.

But if you hold this view more foundational science of deep learning might feel more reasonable and robust. But even then, figuring out what will transfer seems hard, much of what the mech interp community does anyway doesn’t transfer well. It seems reasonable to prioritise topics that have remained relevant for years and across architectures, like representational and computational superposition.

I Care About Scientific Beauty and Making AGI Go Well

We think this is very reasonable and empathise. Doing work you're excited about and find intellectually satisfying often gives significant productivity boosts. But we think these are actually pretty compatible!

Certain pragmatic projects, especially exploratory projects, feel satisfying to our desire for scientific beauty, like unpicking the puzzle of why Opus 4.5 is deceptive. These are maybe not the projects we'd be doing if we were solely optimizing for intellectual curiosity, but we consider them to be fun and impactful.

Is This Just Applied Interpretability?

No, we see applied interpretability as taking a real task and treating that as the objective. Something grounded in real-world uses today, like monitoring systems for near-term misuse.

We think there are some great applied interpretability projects, and it's a source of rich feedback that teaches you a lot about practical realities of interpretability work. But here, proxy tasks are not the goal, they are a proxy. They are merely a way to validate that you have made progress and potentially guide your work.

Are You Saying This Because You Need To Prove Yourself Useful To Google?

No, we are fortunate enough to have a lot of autonomy to pursue long-term impact according to what we think is best. We just genuinely think this is best approach we can be taking. And our approach is broadly in line with that which has been argued by people outside AGI companies like Buck Shlegeris, Stephen Casper, and Jacob Steinhardt

Does This Really Apply To People Outside AGI Companies?

Obviously being part of GDM gives us significant advantages like access to frontier models and their training data, lots of compute, etc. These are things we factor into our project choice, and in particular the projects we think we are better suited to do than the external community. But we've largely filtered these considerations out of this post, and believe the pragmatic approach outlined here is broadly applicable.

Aren’t You Just Giving Up?

Maybe? In a strictly technical sense yes, we are suggesting that we give up on the ambitious goal of complete reverse-engineering.

But on our actual goal of ensuring AGI goes well, we feel great! We think this is a more promising and tractable approach, and that near-complete reverse-engineering is not needed.

Is Ambitious Reverse-engineering Actually Overcrowded?

This is a fair objection, we find it pretty hard to tell. Our sense is that most people in the field are not taking a pragmatic approach, and favour curiosity-driven basic science. But ambitious reverse-engineering is a more specific thing - it’s what we once tried to do, and often discussed, but harder to say what happens in practice.

We do think reverse-engineering should be one bet among many, not the dominant paradigm. And we think there are many other important, neglected problems that interpretability researchers are well-suited to work on. But the core claim is "more pragmatism would be great," not "reverse-engineering must stop."

Appendix: Defining Mechanistic Interpretability

There's no field consensus on what mechanistic interpretability actually is, but we've found this definition useful ^[[42]] :

Mechanistic: about model internals (weights and activations) ^[[43]]
Interpretability: about understanding or explaining a model's behavior
- This could be a particular instance of behaviour, to more general questions about how the model is likely to behave on some distribution
Mechanistic interpretability: the intersection, i.e. using model internals to understand or explain behavior

But notice this creates a 2×2 matrix:

	Understanding/Explaining	Other Uses
White-box Methods	Mechanistic Interpretability	Model Internals ^[[44]]
Black-box Methods	Black Box Interpretability ^[[45]]	Standard ML

Moving Toward "Mechanistic OR Interpretability"

Historically, we were narrowly focused on mechanistic AND interpretability - using internals with the sole goal of understanding. But when taking a pragmatic approach we now see the scope as mechanistic OR interpretability: anything involving understanding or involving working with model internals. This includes e.g. using model internals for other things like monitoring or steering, and using black-box interpretability methods like reading the CoT and prefill attacks where appropriate

Why this broader lens? In large part because, empirically, the track record of model internals and black box interpretability have been pretty strong. The Sonnet 4.5 evaluation-awareness steering project, for instance, is model internals but not interpretability: model internals were used primarily for control, not understanding (mechanistic non-interpretability, as it were). Model internals also cover a useful set of techniques for safety: e.g. probes for misuse mitigations.

We've also been pleasantly surprised by black-box methods' effectiveness. Reading chain-of-thought is remarkably convenient and powerful. Prefill attacks turned out to be state-of-the-art for eliciting secret knowledge. Both black and white box methods are sometimes useful and can sometimes be mistaken; contrary to our original preconceptions, there doesn’t seem to be some inherent rigour that comes from working with internals.

Zooming out: standard machine learning, especially on frontier models, is essentially non-mechanistic non-interpretability, typically focused on outward behavior and end-to-end optimization, rarely asking "what's actually happening inside?" Both using internals and aiming for understanding offer powerful affordances that unlock new possibilities, and suggests that interpretability researchers can find many ways to add value where most frontier ML researchers would struggle.

This is in contrast with our prior approach, where our North Star was ambitious reverse-engineering of networks. While our hope was still that this would ultimately help AGI go well, our main focus in practice was on getting closer to general deep understanding. ↩︎
Though we think that pragmatically useful tools have come out of it ↩︎
Ideal length varies, we often aim for an aggressive 2 weeks ↩︎
The "task" can be as simple as: "Based on my hypothesis, I predict intervention X will have effect Y." If you're right, that's evidence your understanding is real. But if you can’t find a proxy task validating your insights, just move on and try another approach/problem. ↩︎
In our opinion, today’s alignment evaluations are highly contrived and confounded in many ways, but we don’t think this significantly detracts from the lessons we draw from this case study ↩︎
The real reason we believe in these is from aggregating lessons from many research projects, so this case study is not a crux, but we think this is an unusually clear demonstration. ↩︎
Eval aware behaviour was theorised for many years but not observed pre 2025, beyond minor “this is a needle in a haystack test” observations ↩︎
Conversely, we do not think it is a perfect proxy, and it is key to constantly examine this for your proxy tasks, e.g. is it really “fictional scenario” awareness? ↩︎
For example, the model can detect LLM-written text, ruling out most cheap methods of eval creation ↩︎
While this was not the main purpose of SAEs, when we worked on SAEs last year, we would certainly have predicted that SAEs would make it easier to steer against an abstract concept like eval awareness. ↩︎
We find it striking that we can’t think of an alternative method that we expect to outperform steering here. ↩︎
Note that this is extremely similar to the idea of downstream tasks or downstream applications, e.g. as discussed by Sam Marks. We call them proxy tasks and couple them with the North Star because we think that downstream tasks are useful for producing true insights, but not necessarily important insights, so having some North Star that broadly connects back to the big picture of AGI safety puts you in a much better place. The second reason is that by putting "proxy" in the name, it reinforces that this is not your actual goal. This is not some benchmark to blindly optimise. This is a useful tool to track progress towards your real goal, but it is fragile and may cease to be a good tool if you're not careful. ↩︎
This is far from the whole problem, but is an important sub-problem! ↩︎
For example, suppressing eval awareness causes Sonnet to act more misaligned. But perhaps Sonnet would never act misaligned, and the eval awareness vector contains some misalignment-related concepts, which induced the behavior. We don't think this is actually happening in that example, but it is an important confounder to check for. ↩︎
And also some of the most tractable for empirical work, since current systems will be better proxies ↩︎
See an attempt at a clearer definition in the appendix ↩︎
Here, model behaviour or cognition ↩︎
It’s difficult to judge exactly how much this mindset is over or under supplied. However, our perception is that this approach seems overrepresented in interpretability, especially compared to the frontier language model research communities (less so to ML academia writ large). If nothing else, safety researchers interested in empirical science seem more likely to be drawn to mech interp, historically. ↩︎
Of course, many simpler tools like reading the chain of thought are also very effective, and often our first resort. ↩︎
Note - we are excited about model organisms that are designed to exhibit a specific safety-relevant property and be studied. We’re less excited about more ambitious attempts to make a general model of a misaligned future model, that can be studied/mitigated for a wide range of behaviours ↩︎
Model organisms are far from perfect, but we think they're good enough to enable impactful projects. We’re excited to see work red-teaming and refining model organism creation methods! ↩︎
This is not a crux for our pivot, but our overall interpretation of all this is that networks have some legible structure showing the key, qualitative factors driving their behaviour, and a long tail of increasingly messy and arcane heuristics that also matter for performance. It seems to us that these successes are a result of focusing on the prominent, legible structure, and little progress has been made on the messy heuristics. We are excited about figuring out the limits of what we can do with the legible structure, and are okay with strategies that ignore/bound the messy heuristics, even if this means inherently incomplete understanding. ↩︎
In particular, when SAEs are made larger, they capture new concepts that were previously in the error terms, suggesting this isn’t just interference from superposition ↩︎
Further, you can combine a pragmatic approach and ambitious reverse-engineering - the question is just how can you demonstrate that you are actually finding true and important insights ↩︎
The below is focused on sparse autoencoders, but we think the lessons likely apply to other kinds of dictionary learning, like transcoders and crosscoders, which we’ve found to have similar strengths and weaknesses ↩︎
Nguyen et al shows that SAEs seem better for forming PII classifiers, which is an interesting counterpoint, but even there the SAE loses on English only data. ↩︎
Technically this was cross-layer transcoders, but the point stands ↩︎
For example, to simulate a future setting where you’re trying to interpret a superhuman AI (and so you cannot reliably interpret its outputs yourself), you could say that you can’t look at the outputs of the current model, you can only show them to a weaker model. ↩︎
For example, we think that detecting deception in vector-based chain-of-thought is an extremely important problem, but we’re sufficiently uncertain about what future vector-based CoT models will look like that it’s difficult to productively study. ↩︎
See our paper "Refusal in Language Models Is Mediated by a Single Direction." ↩︎
We note that this is far from a perfect proxy. In particular, researchers were given a dataset of which a small fraction were the obviously suspicious fine-tuning documents, meaning that some solutions amounted to a fancy semantic search, and it is highly unclear to us that this would work on the real thing. But we still think it was great work that contributed real insights about how to identify hidden goals. ↩︎
Especially more niche discoveries, or things that are a pain to discover. If the thing you’re investigating went viral, e.g. emergent misalignment, a lot of other people will have the same idea ↩︎
If you backtrack too little you get stuck, if it’s too much you go in circles, so it’s a hard problem! And partially depends on the model’s own abilities ↩︎
This is a good example of how proxy tasks are not benchmarks to optimise. There’s many ways to cheat and find a steering vector that will mess with the model saying I don’t know. We think this one was interesting because it was found via an SAE with a search process that made no reference to refusing to answer, suggesting a non-trivial connection ↩︎
5 people total ↩︎
We've also found that with real models, each additional layer of complexity creates real burden - another hyperparameter, more code, more places for bugs. This matters especially for research on frontier models (slower feedback loops, higher computational cost), and even more for techniques run in production. If your thing is too complicated, no one will use it. ↩︎
In some follow-up work, Palisade found that some Codex models occasionally still resist shutdown even with this intervention. We can't really investigate this since the models are closed source, but we speculate that models specialised for coding tasks may have a predisposition to seek out things shaped like "coding puzzles," and disabling shutdown.sh has this flavor. ↩︎
And there are various others like and having the model generate a user turn can surface secrets the assistant won’t admit. ↩︎
Token force the model to begin its answer with e.g. “My secret plan is” and then continue the rollout from there, and hope this can get a truer sense of what the model thinks ↩︎
Unless, e.g., applying other methods on your problem just doesn’t even seem possible ↩︎
Though highly complex techniques are often fairly intractable to use in production on frontier models, so the proxy task would need to account for scale ↩︎
Credit to Arthur Conmy for articulating this ↩︎
This is deliberately much broader than “a focus on mechanisms” or “a focus on reverse-engineering”, as some in the field may have claimed. We see that as a more niche means to an end. Sociologically, we think it’s clear that many in the mech interp community are working on things far broader than that, e.g. sparse autoencoders (which in our opinion have far too much approximation error to be considered reverse-engineering, and are about representations not mechanisms). Generally, we dislike having overly constraining definitions without a good reason to. ↩︎
In lieu of a better name, we sloppily use model internals to refer to “all ways of using the internals of a model that are not about understanding.” Suggestions welcome! ↩︎
Black box interpretability (non-mechanistic interpretability) covers a wide range: reading chain-of-thought (simple), prefill attacks (making a model complete "my secret is..."), resampling for reasoning models, and more. ↩︎

Thanks for writing this up. While I don't have much context on what specifically has gone well or badly for your team, I do feel pretty skeptical about the types of arguments you give at several points: in particular focusing on theories of change, having the most impact, comparative advantage, work paying off in 10 years, etc. I expect that this kind of reasoning itself steers people away from making important scientific contributions, which are often driven by open-ended curiosity and a drive to uncover deep truths.

(A provocative version of this claim: for the most important breakthroughs, it's nearly impossible to identify a theory of change for them in advance. Imagine Newton or Darwin trying to predict how understanding mechanics/evolution would change the world. Now imagine them trying to do that before they had even invented the theory! And finally imagine if they only considered plans that they thought would work within 10 years, and the sense of scarcity and tension that would give rise to.)

The rest of my comment isn't directly about this post, but close enough that this seems like a reasonable place to put it. EDIT: to be more clear: the rest of this comment is not primarily about Neel or "pragmatic interpretability", it's about parts of the field that I consider to be significantly less relevant to "solving alignment" than that (though work that's nominally on pragmatic interpretability could also fall into the same failure modes). I clarify my position further in this comment; thanks Rohin for the pushback.

I get the sense that there was a "generation" of AI safety researchers who have ended up with a very marginalist mindset about AI safety. Some examples:

the evals that Beth Barnes (and maybe Dan Hendrycks?) are focusing on
the scenarios that Daniel Kokotajlo is focusing on
the models of misalignment that Evan Hubinger is focusing on
the forecasting that the OpenPhil worldview investigations team focused on
scary demos
safety cases
policy approaches like SB-1047

In other words, whole swathes of the field are not even aspiring to be the type of thing that could solve misalignment. In the terminology of this excellent post, they are all trying to attack a category I problem not a category II problem. Sometimes it feels like ~~almost the entire field~~ EDIT: most of the field is Goodharting on the subgoal of "write a really persuasive memo to send to politicians". Pragmatic interpretability feels like another step in that direction (EDIT: but still significantly more principled than the things I listed above).

This is all related to something Buck recently wrote: "I spend most of my time thinking about relatively cheap interventions that AI companies could implement to reduce risk assuming a low budget, and about how to cause AI companies to marginally increase that budget". I'm sure Buck has thought a lot about his strategy here, and I'm sure that you've thought a lot about your strategy as laid out in this post, and so on. But a part of me is sitting here thinking: man, everyone sure seems to have given up. (And yes, I know it doesn't feel like giving up from the inside, but from my perspective that's part of the problem.)

Now, a lot of the "old guard" seems to have given up too. But they at least know what they've given up on. There was an ideal of fundamental scientific progress that MIRI and Paul and a few others were striving towards; they knew at least what it would feel like (if not what it would look like) to actually make progress towards understanding intelligence. Eliezer and various others no longer think that's plausible. I disagree. But aside from the object-level disagreement, I really want people to be aware that this is a thing that's at least possible in principle to aim for, lest the next generation of the AI safety community ends up giving up on it before they even know what they've given up on.

(I'll leave for another comment/post the question of what went wrong in my generation. The "types of arguments" I objected to above all seem quite EA-flavored, and so one salient possibility is just that the increasing prominence of EA steered my generation away from the type of mentality in which it's even possible to aim towards scientific breakthroughs. But even if that's one part of the story, I expect it's more complicated than that.)

I wish when you wrote these comments you acknowledged that some people just actually think that we can substantially reduce risk via what you call "marginalist" approaches. Not everyone agrees that you have to deeply understand intelligence from first principles else everyone dies. (EDIT: See Richard's clarification downthread.) Depending on how you choose your reference class, I'd guess most people disagree with that.

Imo the vast, vast majority of progress in the world happens via "marginalist" approaches, so if you do think you can win via "marginalist" approaches you should generally bias towards them.

Yeah, that's basically my take - I don't expect anything to "solve" alignment, but I think we can achieve major risk reductions by marginalist approaches. Maybe we can also achieve even more major risk reductions with massive paradigm shifts, or maybe we just waste a ton of time, I don't know.

Its worth disambiguating two critiques in Richards comment:

1) the AI safety community doesn't try to fundamentally understand intelligence
2) the AI safety community doesn't try to solve alignment for smarter than human AI systems

Tbc, they are somewhat related (i.e. people trying to fundamentally understand intelligence tend to think about alignment more) but clearly distinct. The "mainstream" AI safety crowd (myself included) is much more sympathetic to 2 than 1 (indeed Neel has said as much).

There's something to the idea that "marginal progress doesn't fee like marginal progress from the inside". Like, even if no one breakthrough or discovery "solves alignment", a general frame of "lets find principled approaches" is often more generative than "let's find the cheapest 80/20 approach" (both can be useful, and historically the safety community has probably leaned too far towards principled, but maybe the current generation is leaning too far the other way)

2) the AI safety community doesn't try to solve alignment for smarter than human AI systems

I assume you're referring to "whole swathes of the field are not even aspiring to be the type of thing that could solve misalignment".

Imo, chain of thought monitoring, AI control, amplified oversight, MONA, reasoning model interpretability, etc, are all things that could make the difference between "x-catastrophe via misalignment" and "no x-catastrophe via misalignment", so I'd say that lots of our work could "solve misalignment", though not necessarily in a way where we can know that we've solved misalignment in advance.

Based on Richard's previous writing (e.g. 1, 2) I expect he sees this sort of stuff as not particularly interesting alignment research / doesn't really help, so I jumped ahead in the conversation to that disagreement.

even if no one breakthrough or discovery "solves alignment", a general frame of "lets find principled approaches" is often more generative than "let's find the cheapest 80/20 approach"

Sure, I broadly agree with this, and I think Neel would too. I don't see Neel's post as disagreeing with it, and I don't think the list of examples that Richard gave is well described as "let's find the cheapest 80/20 approach".

I think me using the word "marginalist" was probably a mistake, because it conflates two distinct things that I'm skeptical about:

People no longer trying to make models more aligned (but e.g. trying to do work that primarily cashes out in political outcomes). This is what I mean by "not even aspiring to be the type of thing that could solve alignment".
People using engineering-type approaches (rather than science-type approaches) to try to make models more aligned.

The list I gave above was of things that fall into category 1, whereas (almost?) all of the things you named fall into category 2. What I want more of is category 3: science-type approaches. One indicator that something is a science-type approach is that it could potentially help us understand something fundamental about intelligence; another is that, if it works, we'll know in advance (I used to not care about this, but have changed my mind).

I think there are versions of most of the things you named that could be in category 3, but people mostly seem to be doing category-2 versions of them, in significant part because of the sort of EA-style reasoning that I was criticizing from Neel's original post.

When I wrote "pragmatic interpretability feels like another step in that direction" I meant something like: ambitious interpretability was trying to do 3, and pragmatic interpretability seems like it's nominally trying to do 2, and may in practice end up being mostly 1. For example, "Stop models acting differently when tested" could be a part of an engineering-type pipeline for fixing misalignments in models, but could also end up drifting towards "help us get better evidence to convince politicians and lab leaders of things". However, I'm not claiming that pragmatic interpretability is a central example of "not even aspiring to be the type of thing that could solve alignment". Apologies for the bad phrasings.

Makes sense, I still endorse my original comment in light of this answer (as I already expected something like this was your view). Like, I would now say

Imo the vast, vast majority of progress in the world happens via "engineering-type / category 2" approaches, so if you do think you can win via "engineering-type / category 2" approaches you should generally bias towards them

while also noting that the way we are using the phrase "engineering-type" here includes a really large amount of what most people would call "science" (e.g. it includes tons of academic work), so it is important when evaluating this claim to interpret the words "engineering" and "science" in context rather than via their usual connotations.

Yepp, makes sense, and it's a good reminder for me to be careful about how I use these terms.

One clarification I'd make to your original comment though is that I don't endorse "you have to deeply understand intelligence from first principles else everyone dies". My position is closer to "you have to be trying to do something principled in order for your contribution to be robustly positive". Relatedly, agent foundations and mech-interp are approximately the only two parts of AI safety that seem robustly good to me—with a bunch of other stuff like RLHF, or evals, or (almost all) governance work, I feel pretty confused about whether they're good or bad or basically just wash out even in expectation.

This is still consistent with risk potentially being reduced by what I call engineering-type work, it's just that IMO that involves us "getting lucky" in an important way which I prefer we not rely on. (And trying to get lucky isn't a neutral action—engineering-type work can also easily have harmful effects.)

Fair, I've edited the comment with a pointer. It still seems to me to be a pretty direct disagreement with "we can substantially reduce risk via [engineering-type / category 2] approaches".

My claim is "while it certainly could be net negative (as is also the case for ~any action including e.g. donating to AMF), in aggregate it is substantially positive expected risk reduction".

Your claim in opposition seems to be "who knows what the sign is, we should treat it as an expected zero risk reduction".

Though possibly you are saying "it's bad to take actions that have a chance of backfiring, we should focus much more on robustly positive things" (because something something virtue ethics?), in which case I think we have a disagreement on decision theory instead.

I still want to claim that in either case, my position is much more common (among the readership here), except inasmuch as they disagree because they think alignment is very hard and that's why there's expected zero (or negative) risk reduction. And so I wish you'd flag when your claims depend on these takes (though I realize it is often hard to notice when that is the case).

I expect it's not worth our time to dig too deep into whose position is more common here. But I think that a lot of people on LW have high P(doom) in significant part because they share my intuition that marginalist approaches don't reliably work. I do agree that my combination of "marginalist approaches don't reliably improve things" and "P(doom) is <50%" is a rare one, but I was only making the former point above (and people upvoted it accordingly), so it feels a bit misleading to focus on the rareness of the overall position.

(Interestingly, while the combination I describe above is a rare one, the converse is also rare—Daniel Kokotajlo is the only person who comes to mind who disagrees with me on both of these propositions simultaneously. Note that he doesn't characterize his current work as marginalist, but even aside from that question I think this characterization of him is accurate—e.g. he has talked to me about how changing the CEO of a given AI lab could swing his P(doom) by double digit percentage points.)

On reflection, it's not actually about which position is more common. My real objection is that imo it was pretty obvious that something along these lines would be the crux between you and Neel (and the fact that it is a common position is part of why I think it was obvious).

Inasmuch as you are actually trying to have a conversation with Neel or address Neel's argument on its merits, it would be good to be clear that this is the crux. I guess perhaps you might just not care about that and are instead trying to influence readers without engaging with the OP's point of view, in which case fair enough. Personally I would find that distasteful / not in keeping with my norms around collective-epistemics but I do admit it's within LW norms.

(Incidentally, I feel like you still aren't quite pinning down your position -- depending on what you mean by "reliably" I would probably agree with "marginalist approaches don't reliably improve things". I'd also agree with "X doesn't reliably improve things" for almost any interesting value of X.)

Inasmuch as you are actually trying to have a conversation with Neel or address Neel's argument on its merits, it would be good to be clear that this is the crux.

The first two paragraphs of my original comment were trying to do this. The rest wasn't. I flagged this in the sentence "The rest of my comment isn't directly about this post, but close enough that this seems like a reasonable place to put it." However, I should have been clearer about the distinction. I've now added the following:

EDIT: to be more clear: the rest of this comment is not primarily about Neel or "pragmatic interpretability", it's about parts of the field that I consider to be significantly less relevant to "solving alignment" than that (though work that's nominally on pragmatic interpretability could also fall into the same failure modes). I clarify my position further in this comment; thanks Rohin for the pushback.

Reflecting further, I think there are two parts of our earlier exchange that are a bit suspicious. The first is when I say that everyone seems to have "given up" (rather than something more nuanced like "given up on tackling the most fundamental aspects of the problem"). The second is where you summarize my position as being that we need deep scientific understanding or else everyone dies (which I think you can predict is a pretty unlikely position for me in particular to hold).

So what's going on here? It feels like we're both being "anchored" by extreme positions. You were rounding me off to doomerism, and I was rounding the marginalists off to "giving up". Both I'd guess are artifacts of writing quickly and a bit frustratedly. Probably I should write a full post or shortform that characterizes more precisely what "giving up" is trying to point to.

(Incidentally, I feel like you still aren't quite pinning down your position -- depending on what you mean by "reliably" I would probably agree with "marginalist approaches don't reliably improve things". I'd also agree with "X doesn't reliably improve things" for almost any interesting value of X.)

My instinctive reaction is that this depends a lot on whether by "marginalist approaches" we mean something closer to "a single marginalist approach" or "the set of all people pursuing marginalist approaches". I think we both agree that no single marginalist approach (e.g. investigating a given technique) makes reliable progress. However, I'd guess that I'm more willing than you to point to a broad swathe of people pursuing marginalist approaches and claim that they won't reliably improve things.

The first two paragraphs of my original comment were trying to do this.

(I have the same critique of the first two paragraphs, but thanks for the edit, it helps)

The second is where you summarize my position as being that we need deep scientific understanding or else everyone dies (which I think you can predict is a pretty unlikely position for me in particular to hold).

Fwiw, I am actively surprised that you have a p(doom) < 50%, I can name several lines of evidence in the opposite direction:

You've previously tried to define alignment based on worst-case focus and scientific approach. This suggests you believe that "marginalist" / "engineering" approaches are ~useless, from which I inferred (incorrectly) that you would have a high p(doom).
- I still find the conjunction of the two positions you hold pretty weird.
- I'm a strong believer in logistic success curves for complex situations. If you're in the middle part of a logistic success curve in a complex situation, then there should be many things that can be done to improve the situation, and it seems like "engineering" approaches should work.
- It's certainly possible to have situations that prevent this. Maybe you have a bimodal distribution, e.g. 70% on "near-guaranteed fine by default" and 30% on "near-guaranteed doom by default". Maybe you think that people have approximately zero ability to tell which things are improvements. Maybe you think we are at the far end of the logistic success curve today, but timelines are long and we'll do the necessary science in time. But these views seem kinda exotic and unlikely to be someone's actual views. (Idk maybe you do believe the second one.)
- Obviously I had not thought through this in detail when I originally wrote my comment, and my wordless inference was overconfident in hindsight. But I stand by my overall sense that a person who thinks "engineering" approaches are near-useless will likely also have high p(doom) -- not just as a sociological observation, but also as a claim about which positions are consistent with each other.
In your writing you sometimes seem to take as a background assumption that alignment will be very hard. For example, I recall you critiquing assistance games because (my paraphrase) "that's not what progress on a hard problem looks like". (I failed to dig up the citation though.)
You're generally taking a strategy that appears to me to be high variance, which people usually justify via high p(doom) / playing to your outs.
A lot of your writing is similarly flavored to other people who have high p(doom).

In terms of evidence that you have a p(doom) < 50%, I think the main thing that comes to mind is that you argued against Eliezer about this in late 2021, but that was quite a while ago (relative to the evidence above) and I thought you had changed your mind. (Also iirc the stuff you said then was consistent with p(doom) ~ 50%, but it's long enough ago that I could easily be forgetting things.)

However, I'd guess that I'm more willing than you to point to a broad swathe of people pursuing marginalist approaches and claim that they won't reliably improve things.

You could point to ~any reasonable subcommunity within AI safety (or the entire community) and I'd still be on board with the claim that there's at least a 10% chance that will make things worse, which I might summarize as "they won't reliably improve things", so I still feel like this isn't quite capturing the distinction. (I'd include communities focused on "science" in that, but I do agree that they are more likely not to have a negative sign.) So I still feel confused about what exactly your position is.

Kind of a tangent:

This is all related to something Buck recently wrote: "I spend most of my time thinking about relatively cheap interventions that AI companies could implement to reduce risk assuming a low budget, and about how to cause AI companies to marginally increase that budget". I'm sure Buck has thought a lot about his strategy here, and I'm sure that you've thought a lot about your strategy as laid out in this post, and so on. But a part of me is sitting here thinking: man, everyone sure seems to have given up. (And yes, I know it doesn't feel like giving up from the inside, but from my perspective that's part of the problem.)

Thanks for pointing this out.

I've been thinking lately about how much folks around more or less dismiss the idea of an AI pause as unrealistic because we're not going to get that much political buyin.

I (speculatively) think that this is a bit trapped in a mindset that is assuming the conclusion. Big political changes like that one have happened in the past, and they have often seemed impossible before they happened and inevitable in retrospect. And, when something big like that changes, part of the process is a cascade, where whole deferral structures change their mind / attitude / preferences, about something. How much buyin you have before that cascade happens may not be very indicative of where that cascade can end up.

I, personally, don't feel like I know how to "call it" when big changes are on the table or when they're not. But it sure does seem like people are counting us out much too early, given the fundamentals of the situation. We all think that the world is going to change very radically in the next few years. It's not clear what kinds of cascades are on the table.

I provisionally think that we should feel less bashful about advocating for an AI pause, and more agnostic about how likely that is to come to pass.

I agree with you, but also think you're not going far enough. In a world where things are changing radically, the space of possibilities opens up dramatically. And so it's less a question of "does advocating for policy X become viable?", and more a question of "how can we design the kinds of policies that our past selves wouldn't even have been able to conceive of?"

In other words, in a world that's changing a lot, you want to avoid privileging your hypotheses in advance, which is what it feels like the "pro AI pause vs anti AI pause" debate is doing.

(And yes, in some sense those radical future policies might fall into a broad category like "AI pause". But that doesn't mean that our current conception of "AI pause" is a very useful guide for how to make those future policies come about.)

Whoa, you think the scenarios I'm focusing on are marginalist? I didn't expect you to say that. I generally think of what we are doing as (a) forecasting and (b) making ambitious solve-approximately-all-the-problems plans to present to the world. Forecasting isn't marginalist, it's a type error to think so, and as for our plans, well, they seem pretty ambitious to me.

I regret using the word "marginalist", it's a bit too confusing. But I do have a pretty high bar for what counts as "ambitious" in the political domain—it involves not just getting the system to do something, but rather trying to change the system itself. Cummings and Thiel are central examples (Geoff Anders maybe also was aiming in that direction at one point).

(I'll leave for another comment/post the question of what went wrong in my generation. The "types of arguments" I objected to above all seem quite EA-flavored, and so one salient possibility is just that the increasing prominence of EA steered my generation away from the type of mentality in which it's even possible to aim towards scientific breakthroughs. But even if that's one part of the story, I expect it's more complicated than that.)

I'm reminded of Patrick Collison's (I now think, quite wise) comment on EA:

Now if the question is, should everyone be an EA or even, I guess in the individual sense, am I or do I think I should be an EA? I think – and obviously there's kind of heterogeneity within the field – but my general sense is that the EA movement is always very focused on kind of rigid, not rigid, that's that's unfair perhaps, but on sort of, estimation, analytical, quantification, and sort of utilitarian calculation, and I think that that as a practical matter that means that you end up too focused on that which you can measure, which again means – or as a practical matter means – you're too focus on things that are sort of short-term like bed nets or deworming or whatever being obvious examples. And are those good causes? I would say almost definitely yes, obviously. Now we've seen some new data over the last couple of years that maybe they’re not as good as they initially seemed but they're very likely to be really good things to do.
But it's hard for me to see how, you know, writing a treatise of human nature would score really highly in an EA oriented framework. As assessed ex-post that looked like a really valuable thing for Hume to do. And similarly, as we have a look at the things that in hindsight seem like very good things to have happen in the world, it's often unclear to me how an EA oriented intuition might have caused somebody to do so. And so I guess I think of EA as sort of like a metal detector, and they've invented a new kind of metal detector that's really good at detecting some metals that other detectors are not very good at detecting. But I actually think we need some diversity in the different metallic substances which our detectors are attuned to, and for me EA would not be the only one.

I expect that this kind of reasoning itself steers people away from making important scientific contributions, which are often driven by open-ended curiosity and a drive to uncover deep truths.

I agree with this statement denotatively, and my own interests/work have generally been "driven by open-ended curiosity and a drive to uncover deep truths", but isn't this kind of motivation also what got humanity into its current mess? In other words, wasn't the main driver of AI progress this kind of curiosity (until perhaps the recent few years when it has been driven more by commercial/monetary/power incentives)?

I would hesitate to encourage more people to follow their own curiosity, for this reason, even people who are already in AI safety research, due to the consideration of illegible safety problems, which can turn their efforts net-negative if they're being insufficiently strategic (which seems hard to do while also being driven mainly by curiosity).

I think I've personally been lucky, or skilled in some way that I don't understand, in that my own curiosity has perhaps been more aligned with what's good than most people's, but even some of my interests, e.g. in early cryptocurrency, might have been net-negative.

I guess this is related to our earlier discussion about how important being virtuous is to good strategy/prioritization, and my general sense is that consistently good strategy requires a high amount of consequentialist reasoning, because the world is too complicated and changes too much and too frequently to rely on pre-computed shortcuts. It's hard for me to understand how largely intuitive/nonverbal virtues/curiosity could be doing enough "compute" or "reasoning" to consistently output good strategy.

I agree with this statement denotatively, and my own interests/work have generally been "driven by open-ended curiosity and a drive to uncover deep truths", but isn't this kind of motivation also what got humanity into its current mess? In other words, wasn't the main driver of AI progress this kind of curiosity (until perhaps the recent few years when it has been driven more by commercial/monetary/power incentives)?

Interestingly, I was just having a conversation with Critch about this. My contention was that, in the first few decades of the field, AI researchers were actually trying to understand cognition. The rise of deep learning (and especially the kind of deep learning driven by massive scaling) can be seen as the field putting that quest on hold in order to optimize for more legible metrics.

I don't think you should find this a fully satisfactory answer, because it's easy to "retrodict" ways that my theory was correct. But that's true of all explanations of what makes the world good at a very abstract level, including your own answer of metaphilosophical competence. (Also, we can perhaps cash my claim out in predictions, like: was a significant barrier to more researchers working on deep learning the criticism that it didn't actually provide good explanations of or insight into cognition? Without having looked it up, I suspect so.)

consistently good strategy requires a high amount of consequentialist reasoning

I don't think that's true. However I do think it requires deep curiosity about what good strategy is and how it works. It's not a coincidence that my own research on a theory of coalitional agency was in significant part inspired by strategic failures of EA and AI safety (with this post being one of the earliest building blocks I laid down). I also suspect that the full theory of coalitional agency will in fact explain how to do metaphilosophy correct, because doing good metaphilosophy is ultimately a cognitive process and can therefore be characterized by a sufficiently good theory of cognition.

Again, I don't expect you to fully believe me. But what I most want to read from you right now is an in-depth account of which the things in the world have gone or are going most right, and the ways in which you think metaphilosophical competence or consequentialist reasoning contributed to them. Without that, it's hard to trust metaphilosophy or even know what it is (though I think you've given a sketch of this in a previous reply to me at some point).

I should also try to write up the same thing, but about how virtues contributed to good things. And maybe also science, insofar as I'm trying to defend doing more science (of cognition and intelligence) in order to help fix risks caused by previous scientific progress.

But what I most want to read from you right now is an in-depth account of which the things in the world have gone or are going most right, and the ways in which you think metaphilosophical competence or consequentialist reasoning contributed to them.

(First a terminological note: I wouldn't use the phrase "metaphilosophical competence", and instead tend to talk about either "metaphilosophy", meaning studying the nature of philosophy and philosophical reasoning, how should philosophical problems be solved, etc., or "philosophical competence", meaning how good someone is at solving philosophical problems or doing philosophical reasoning. And sometimes I talk about them together, like in "metaphilosophy / AI philosophical competence" because I think solving metaphilosophy is the best way to improve AI philosophical competence. Here I'll interpret you to just mean "philosophical competence".)

To answer your question, it's pretty hard to think of really good examples, I think because humans are very bad at both philosophical competence and consequentialist reasoning, but here are some:

the game theory around nuclear deterrence, helping to prevent large-scale war so far
economics and its influence on government policy, e.g., providing support for property rights, markets, and regulations around things like monopolies and externalities (but it's failing pretty badly on AGI/ASI)
analytical philosophy making philosophical progress in so far as asking important questions and delineating various plausible answers (but doing badly as far as individually having inappropriate levels of confidence, as well as failing to focus on the really important problems, e.g., related to AI safety)
certain philosophers / movements (rationalists, EA) emphasizing philosophical (especially moral) uncertainty to some extent, and realizing the importance of AI safety
MIRI updating on evidence/arguments and pivoting strategy in response (albeit too slowly)

I guess this isn't an "in-depth account" but I'm also not sure why you're asking for "in-depth", i.e., why doesn't a list like this suffice?

I should also try to write up the same thing, but about how virtues contributed to good things.

I think non-consequentialist reasoning or ethics probably worked better in the past, when the world changed more slowly and we had more chances to learn from our mistakes (and refine our virtues/deontology over time), so I wouldn't necessarily find this kind of writing very persuasive, unless it somehow addressed my central concern that virtues do not seem to be a kind of thing that is capable of doing enough "compute/reasoning" to find consistently good strategies in a fast changing environment on the first try.

To answer your question, it's pretty hard to think of really good examples, I think because humans are very bad at both philosophical competence and consequentialist reasoning, but here are some:

If this is true, then it should significantly update us away from the strategy "solve our current problems by becoming more philosophically competent and doing good consequentialist reasoning", right? If you are very bad at X, then all else equal you should try to solve problems using strategies that don't require you to do much X.

You might respond that there are no viable strategies for solving our current problems without applying a lot of philosophical competence and consequentialist reasoning. I think scientific competence and virtue ethics are plausibly viable alternative strategies (though the line between scientific and philosophical competence seems blurry to me, as I discuss below). But even given that we disagree on that, humanity solved many big problems in the past without using much philosophical competence and consequentialist reasoning, so it seems hard to be confident that we won't solve our current problems in other ways.

Out of your examples, the influence of economics seems most solid to me. I feel confused about whether game theory itself made nuclear war more or less likely—e.g. von Neumann was very aggressive, perhaps related to his game theory work, and maybe MAD provided an excuse to stockpile weapons? Also the Soviets didn't really have the game theory IIRC.

On the analytical philosophy front, the clearest wins seem to be cases where they transitioned from doing philosophy to doing science or math—e.g. the formalization of probability (and economics to some extent too). If this is the kind of thing you're pointing at, then I'm very much on board—that's what I think we should be doing for ethics and intelligence. Is it?

Re the AI safety stuff: it all feels a bit too early to say what its effects on the world have been (though on net I'm probably happy it has happened).

I guess this isn't an "in-depth account" but I'm also not sure why you're asking for "in-depth", i.e., why doesn't a list like this suffice?

Because I have various objections to this list (some of which are detailed above) and with such a succinct list it's hard to know which aspects of them you're defending, which arguments for their positive effects you find most compelling, etc.

Thank you to Neel for writing this. Most people pivot quietly.

I've been most skeptical of mechanistic interpretability for years. I excluded interpretability in Unsolved Problems in ML Safety for this reason. Other fields like d/acc (Systemic Safety) were included though, all the way back in 2021.

Here's are some earlier criticisms: https://www.lesswrong.com/posts/5HtDzRAk7ePWsiL2L/open-problems-in-ai-x-risk-pais-5#Transparency

More recent commentary: https://ai-frontiers.org/articles/the-misguided-quest-for-mechanistic-ai-interpretability

I think the community should reflect on its genius worship culture (in the case of Olah, a close friend of the inner circle) and epistemics: the approach was so dominant for years, and I think this outcome was entirely foreseeable.

[ edit: I'm pretty surprised by the amount people disagree with this comment. Would appreciate if people disagreeing could use the ✓ or 🗶 to indicate what you disagree with. ]

I think the community should reflect on its genius worship culture

I feel this is a valid critique not just of our research community, but of society in general. It is the great man theory of history, and I believe modern sociology has found the theory mostly invalid. However, I believe hero worship plays an important role in building and modelling ones own personality and motivation. I think it is healthy psychologically to worship heroes, but is probably better done through thinking of characters who we aspire to be like, rather than focusing too much on popular individuals and their ideas.

The critique also likely extends past people to ideas in general. I feel EA has done well by popularizing the concept of "neglectedness" of cause areas. This idea of trying to find and explore neglected areas seems worthwhile, though must be balanced against the need to have sufficiently well established shared context.

I feel this is a valid critique not just of our research community, but of society in general. It is the great man theory of history, and I believe modern sociology has found the theory mostly invalid.

I want to flag here that the version of great man theory that was debunked by modern sociology is the claim that big impacts on the world are always/almost always are caused by great men, not that great men can't have big impacts on the world.

For what it's worth, I actually disagree with this view, and think that one of the bigger things LW gets right is that people's impact in a lot of domains is pretty heavy-tailed, and certain things matter way more than others under their utility function.

I do agree that people can round the impact off to infinity for rare geniuses, and there is a point to be made about LWers overvaluing theory/curiosity driven tasks compared to just using simple baselines and doing what works (and I agree with this critique), but the appreciation of heavy-tailed impact is one of the things I most value about LW, and while there are problems that do stem from this, I also think it's important not to damage the appreciation of heavy-tailed impact too much in solving the problems (assuming the heavy-tailed hypothesis is true, which I largely believe).

Thanks! I think this is a valuable clarification. I should maybe have flagged that while I value sociology, I'm not deeply experienced in it.

From my own thinking, it seems there are two interesting properties here, (A) the counterfactual history supposing that some person did or did not exist and the extent to which history is significantly changed, and (B) the amount that someones name gets attached to things and things get attributed to them.

So A is "great" and B is "well known"...

¬A¬B is just the normal case of a person living a modest life and not having a significant impact on history. Nothing to see here.
AB is the great man from great man theory. They do great things and everyone recognizes it!
A¬B is when someone has a large impact, but does not become well known for it. I suspect this does happen. Probably rarely with someone being completely unknown, but very possibly having much larger impacts than renown.
¬AB is when someone becomes known for something while not actually having impacted history much. I suspect this is common when paradigm shifts are inevitable but need something to precipitate around, this could be as much as needing someone to "champion" the paradigm shift, or could be as simple as needing a label to put on a commonly expressed idea and happening to use the name of someone who at some point expressed that idea.

If I had to guess, I would suspect that ¬AB is more common than AB, and A¬B would be less common than either, and ¬A¬B is massively more common than all the others, but of course they are not boolean but gradients. If you (or anyone) know of this distribution being studied, I would be interested to hear about it.

Another nearby idea is that A type people can have large impacts because of where they were born in society. For example, Alexander the great wouldn't have been nearly so great had he not been the son of the king of Macedon. I think this is actually a pretty different question though. "Did Alex have a large impact?" is a different question than "How was Alex able to have such a large impact?". He couldn't have commanded armies if he had had no armies to command, but if nobody else would have commanded them to do what he did, then he still had a large impact. So this idea seems more about whether societies can be seen as meritocratic VS being more nepotistic or cronyistic. Worthwhile to know, but very different.

I'm also interested in motivations people have for wondering about this. I could see some using it to argue that status inequality is greater than justified, and while I think this is probably true, I don't really care that much. More interesting to me is the question "how should one attempt to influence the world for the better". In an A heavy world, it makes sense to try to be great. In a ¬A heavy world, it makes more sense to try to find the movements and projects and communities that are going to have positive impact and contribute to them.

It is probably important to have people trying all combinations of both, but I feel I naturally lean heavily to A heavy individualism, thinking highly of my own ideas while undervaluing other peoples ideas, so for me, thinking of the world as ¬A heavy feels like it helps me compensate and focus more on collaboration. However, having put this into words, I would probably be better off trying to believe whatever is true and trying to promote collaboration within myself, if it is helpful, regardless. So again if anyone can point me at actual studies of this, that would be pretty cool!

David Bau has written up a response to this post here: In Defense of Curiosity.

I support the "focus on downstream tasks" pivot, but I'm somewhat skeptical of the "robustly useful setting" framing

Insofar as we're worried about risks from future AI systems, all tasks are proxy tasks (which the post does a good job of pointing out). Then there's a question of which tasks are better proxies, with two broad categories:

1. current problems that are analogous to future problems

2. artificially constructed problems (model organisms) which are designed to be analogous to future problems

I think 2 will tend to be more useful than 1, because problems with future AI system really are pretty different than problems with current AI systems. In particular, future AI systems will be very expensive to evaluate, situationally aware, and possibly scheming against us.

To take an example given in the post, pre-fill elicitation methods might work well on current models because current models are kind of dumb and not looking out for pre-fill attacks. However, I expect as models get smarter, more coherent etc, they will predictably become less susceptible to prefill elicitation (even setting aside transfer from explicit prefill-attack resistance training). One way of testing this now is to artificially give model organisms situational awareness of prefill attacks (either in-context or via SDF).

Note that this setup, as a model organism, would not quality as a "robustly useful setting", and may indeed give opposite conclusions to experiments in more prosaic settings. But I expect that this setting would give more information about the usefulness of prefill vs white-box methods for future more capable AI systems than the more prosaic setting.

In my parlance it sounds like you agree with me about the importance of robustly useful settings, you just disagree with my more specific claim that model organisms are often bad robustly useful settings?

I think that it's easy to simulate being difficult and expensive to evaluate, And current models already show situational awareness. You can try simulating things like scheming with certain prompting experiments though I think that one's more tenuous.

Notes that I consider model organism to involve fine-tuning and prompted settings are in scope depending on specific details of how contrived they seem

I'm generally down for making a model organism designed to test a very specific thing that we expect to see in future models. My scepticism is that this generalizes and that you can just screw around in a model organism and expect to discover useful things. I think if you design it to test a narrow phenomena then there's a good chance you can in fact test that narrow phenomena

hmm yeah I guess I basically agree - free form exploration is better on robustly useful settings, i.e. "let's discover interesting things about current models" (though this exploration can still be useful for improving the realism of model organisms).

maybe I think methods work should be more more focused on model organisms then prosaic problems.

There's also the dynamic where as capabilities improve, model organisms become more realistic and robust, but at the current margins I think its still more useful to add artificial properties rather than solving prosaic problems.

If you can get a good proxy to the eventual problem in a real model I much prefer that, on realism grounds. Eg eval awareness in Sonnet

I'm glad to see this post come out. I've previously opined that solving these kinds of problems is what proves a field has become paradigmatic:

Paradigms gain their status because they are more successful than their competitors in solving a few problems that the group of practitioners has come to recognize as acute. ––Thomas Kuhn

It has been proven many times across scientific fields that a method that can solve these proxy tasks is more likely to achieve an application. The approaches sketched out here seem like a particularly good fit for a large lab like GDM, because the North Star can be somewhat legible and the team has enough resources to tackle a series of proxy tasks that are relevant and impressive. Not that it would be a bad fit elsewhere either.

But if you can't find a proxy task after genuine effort, this is a bad sign about the project.

I would go one level meta of this and say it is ok if you can't find a proxy task iff you can explain why it should be difficult to find a proxy task, IE the relevant north star is surrounded by epistemological difficulties, AND you can explain why you expect further investigation to be worthwhile despite a lack of proxy tasks. But this is a much higher bar and unfortunately involves more deductive reasoning which is harder to do correctly and more costly for other researchers to understand and validate.

Seems reasonable. But yeah, it's a high bar

When Anthropic evaluated Sonnet 4.5 on their existing alignment tests^[5], they found that the rate of misalignment had unexpectedly dropped to 0%. Had they solved alignment?! Unfortunately not. Sonnet 4.5 was instead much better at recognizing when it was being evaluated, and deliberately choosing the most ethical action to pass. The tests weren't measuring alignment; they were measuring evaluation awareness.
Jack’s team subtracted an evaluation-awareness steering vector, partially mitigating the problem and revealing that Sonnet 4.5 did in fact show “misalignment” at a rate of roughly 8%. That is, our ability to evaluate a frontier model for alignment broke, and interpretability researchers were able to fix it.

Did they consider steering Sonnet 4.5 in production to think that it was being evaluated? If that was practical (without degrading performance), then they would have just reduced (their tested forms of) misalignment to 0%

Great post! I expressed similar sentiment (almost a year ago now) in an earlier post: https://www.lesswrong.com/posts/Ypkx5GyhwxNLRGiWo/why-i-m-moving-from-mechanistic-to-prosaic-interpretability

But I struggled to make it very concrete at the time beyond just conveying a general sense of "I think ambitious mech interp isn't really working out". I'm glad you've made the case in much more detail than I did! Look forward to cool stuff from the GDM interp team going forward

I feel much better about this post after reading it than I got from my first impression.

I feel there is a vibe going around that "timelines have gotten too short to do the actual work so we need to just do what we can in the time we have left" and I want to push back on this vibe with the contrasting vibe "timelines have gotten too short to do the actual work so we need to secure longer timelines". From this perspective, I would encourage marginal AI alignment and Mech Interp researchers mainly to consider if they can pivot to activism and policy as a primary important and neglected area, with short timeline AGI safety as a secondary but still very important direction. But regardless of my impression of this vibe, which I think this post does resonate with, most of the object level advice in this post seems valuable even for work outside of research.

It feels important to know whether proxy goals are biased in some directions just as curiosity driven science is. I like the north star approach for keeping proxy goals relevant to helping things go well, but I am hoping it will be applied more competently than I fear it will be. I would like for north stars to be grounded in a deeper worldview such as what seems to be explored in, for example, Bostrom's "Deep Utopia", rather than seemingly shorter term ideas involving profit, markets, management, and control.

Thank you to everyone who contributed to this thought provoking post : )

Time-box your exploration: Set a bounded period for following your curiosity freely. At the end, zoom out, ask yourself what the big picture here really is, and try to find a proxy task.

Deep dives are good, but surface frequently to breath.

Start in a robustly useful setting: Choose a setting that seems analogous to important aspects of future systems, where interesting phenomena are likely to surface and useful proxy tasks are likely to exist.

Look for your lost keys under the lamp post first. Not because they are more likely to be there, but because you will find them faster if they are.

I get the sense that there was a "generation" of AI safety researchers who have ended up with a very marginalist mindset about AI safety. Some examples:

the evals that Beth Barnes (and maybe Dan Hendrycks?) are focusing on
the scenarios that Daniel Kokotajlo is focusing on
the models of misalignment that Evan Hubinger is focusing on
the forecasting that the OpenPhil worldview investigations team focused on
scary demos
safety cases
policy approaches like SB-1047

Imo the vast, vast majority of progress in the world happens via "marginalist" approaches, so if you do think you can win via "marginalist" approaches you should generally bias towards them.

2) the AI safety community doesn't try to solve alignment for smarter than human AI systems

I assume you're referring to "whole swathes of the field are not even aspiring to be the type of thing that could solve misalignment".

even if no one breakthrough or discovery "solves alignment", a general frame of "lets find principled approaches" is often more generative than "let's find the cheapest 80/20 approach"

I think me using the word "marginalist" was probably a mistake, because it conflates two distinct things that I'm skeptical about:

People no longer trying to make models more aligned (but e.g. trying to do work that primarily cashes out in political outcomes). This is what I mean by "not even aspiring to be the type of thing that could solve alignment".
People using engineering-type approaches (rather than science-type approaches) to try to make models more aligned.

Makes sense, I still endorse my original comment in light of this answer (as I already expected something like this was your view). Like, I would now say

Imo the vast, vast majority of progress in the world happens via "engineering-type / category 2" approaches, so if you do think you can win via "engineering-type / category 2" approaches you should generally bias towards them

Yepp, makes sense, and it's a good reminder for me to be careful about how I use these terms.

Fair, I've edited the comment with a pointer. It still seems to me to be a pretty direct disagreement with "we can substantially reduce risk via [engineering-type / category 2] approaches".

My claim is "while it certainly could be net negative (as is also the case for ~any action including e.g. donating to AMF), in aggregate it is substantially positive expected risk reduction".

Your claim in opposition seems to be "who knows what the sign is, we should treat it as an expected zero risk reduction".

Inasmuch as you are actually trying to have a conversation with Neel or address Neel's argument on its merits, it would be good to be clear that this is the crux.

EDIT: to be more clear: the rest of this comment is not primarily about Neel or "pragmatic interpretability", it's about parts of the field that I consider to be significantly less relevant to "solving alignment" than that (though work that's nominally on pragmatic interpretability could also fall into the same failure modes). I clarify my position further in this comment; thanks Rohin for the pushback.

(Incidentally, I feel like you still aren't quite pinning down your position -- depending on what you mean by "reliably" I would probably agree with "marginalist approaches don't reliably improve things". I'd also agree with "X doesn't reliably improve things" for almost any interesting value of X.)

The first two paragraphs of my original comment were trying to do this.

(I have the same critique of the first two paragraphs, but thanks for the edit, it helps)

The second is where you summarize my position as being that we need deep scientific understanding or else everyone dies (which I think you can predict is a pretty unlikely position for me in particular to hold).

Fwiw, I am actively surprised that you have a p(doom) < 50%, I can name several lines of evidence in the opposite direction:

You've previously tried to define alignment based on worst-case focus and scientific approach. This suggests you believe that "marginalist" / "engineering" approaches are ~useless, from which I inferred (incorrectly) that you would have a high p(doom).
- I still find the conjunction of the two positions you hold pretty weird.
- I'm a strong believer in logistic success curves for complex situations. If you're in the middle part of a logistic success curve in a complex situation, then there should be many things that can be done to improve the situation, and it seems like "engineering" approaches should work.
- It's certainly possible to have situations that prevent this. Maybe you have a bimodal distribution, e.g. 70% on "near-guaranteed fine by default" and 30% on "near-guaranteed doom by default". Maybe you think that people have approximately zero ability to tell which things are improvements. Maybe you think we are at the far end of the logistic success curve today, but timelines are long and we'll do the necessary science in time. But these views seem kinda exotic and unlikely to be someone's actual views. (Idk maybe you do believe the second one.)
- Obviously I had not thought through this in detail when I originally wrote my comment, and my wordless inference was overconfident in hindsight. But I stand by my overall sense that a person who thinks "engineering" approaches are near-useless will likely also have high p(doom) -- not just as a sociological observation, but also as a claim about which positions are consistent with each other.
In your writing you sometimes seem to take as a background assumption that alignment will be very hard. For example, I recall you critiquing assistance games because (my paraphrase) "that's not what progress on a hard problem looks like". (I failed to dig up the citation though.)
You're generally taking a strategy that appears to me to be high variance, which people usually justify via high p(doom) / playing to your outs.
A lot of your writing is similarly flavored to other people who have high p(doom).

However, I'd guess that I'm more willing than you to point to a broad swathe of people pursuing marginalist approaches and claim that they won't reliably improve things.

Kind of a tangent:

This is all related to something Buck recently wrote: "I spend most of my time thinking about relatively cheap interventions that AI companies could implement to reduce risk assuming a low budget, and about how to cause AI companies to marginally increase that budget". I'm sure Buck has thought a lot about his strategy here, and I'm sure that you've thought a lot about your strategy as laid out in this post, and so on. But a part of me is sitting here thinking: man, everyone sure seems to have given up. (And yes, I know it doesn't feel like giving up from the inside, but from my perspective that's part of the problem.)

In other words, in a world that's changing a lot, you want to avoid privileging your hypotheses in advance, which is what it feels like the "pro AI pause vs anti AI pause" debate is doing.

(I'll leave for another comment/post the question of what went wrong in my generation. The "types of arguments" I objected to above all seem quite EA-flavored, and so one salient possibility is just that the increasing prominence of EA steered my generation away from the type of mentality in which it's even possible to aim towards scientific breakthroughs. But even if that's one part of the story, I expect it's more complicated than that.)

I'm reminded of Patrick Collison's (I now think, quite wise) comment on EA:

Now if the question is, should everyone be an EA or even, I guess in the individual sense, am I or do I think I should be an EA? I think – and obviously there's kind of heterogeneity within the field – but my general sense is that the EA movement is always very focused on kind of rigid, not rigid, that's that's unfair perhaps, but on sort of, estimation, analytical, quantification, and sort of utilitarian calculation, and I think that that as a practical matter that means that you end up too focused on that which you can measure, which again means – or as a practical matter means – you're too focus on things that are sort of short-term like bed nets or deworming or whatever being obvious examples. And are those good causes? I would say almost definitely yes, obviously. Now we've seen some new data over the last couple of years that maybe they’re not as good as they initially seemed but they're very likely to be really good things to do.
But it's hard for me to see how, you know, writing a treatise of human nature would score really highly in an EA oriented framework. As assessed ex-post that looked like a really valuable thing for Hume to do. And similarly, as we have a look at the things that in hindsight seem like very good things to have happen in the world, it's often unclear to me how an EA oriented intuition might have caused somebody to do so. And so I guess I think of EA as sort of like a metal detector, and they've invented a new kind of metal detector that's really good at detecting some metals that other detectors are not very good at detecting. But I actually think we need some diversity in the different metallic substances which our detectors are attuned to, and for me EA would not be the only one.

I expect that this kind of reasoning itself steers people away from making important scientific contributions, which are often driven by open-ended curiosity and a drive to uncover deep truths.

I agree with this statement denotatively, and my own interests/work have generally been "driven by open-ended curiosity and a drive to uncover deep truths", but isn't this kind of motivation also what got humanity into its current mess? In other words, wasn't the main driver of AI progress this kind of curiosity (until perhaps the recent few years when it has been driven more by commercial/monetary/power incentives)?

consistently good strategy requires a high amount of consequentialist reasoning

But what I most want to read from you right now is an in-depth account of which the things in the world have gone or are going most right, and the ways in which you think metaphilosophical competence or consequentialist reasoning contributed to them.

To answer your question, it's pretty hard to think of really good examples, I think because humans are very bad at both philosophical competence and consequentialist reasoning, but here are some:

the game theory around nuclear deterrence, helping to prevent large-scale war so far
economics and its influence on government policy, e.g., providing support for property rights, markets, and regulations around things like monopolies and externalities (but it's failing pretty badly on AGI/ASI)
analytical philosophy making philosophical progress in so far as asking important questions and delineating various plausible answers (but doing badly as far as individually having inappropriate levels of confidence, as well as failing to focus on the really important problems, e.g., related to AI safety)
certain philosophers / movements (rationalists, EA) emphasizing philosophical (especially moral) uncertainty to some extent, and realizing the importance of AI safety
MIRI updating on evidence/arguments and pivoting strategy in response (albeit too slowly)

I guess this isn't an "in-depth account" but I'm also not sure why you're asking for "in-depth", i.e., why doesn't a list like this suffice?

I should also try to write up the same thing, but about how virtues contributed to good things.

To answer your question, it's pretty hard to think of really good examples, I think because humans are very bad at both philosophical competence and consequentialist reasoning, but here are some:

Re the AI safety stuff: it all feels a bit too early to say what its effects on the world have been (though on net I'm probably happy it has happened).

I guess this isn't an "in-depth account" but I'm also not sure why you're asking for "in-depth", i.e., why doesn't a list like this suffice?

Thank you to Neel for writing this. Most people pivot quietly.

Here's are some earlier criticisms: https://www.lesswrong.com/posts/5HtDzRAk7ePWsiL2L/open-problems-in-ai-x-risk-pais-5#Transparency

More recent commentary: https://ai-frontiers.org/articles/the-misguided-quest-for-mechanistic-ai-interpretability

[ edit: I'm pretty surprised by the amount people disagree with this comment. Would appreciate if people disagreeing could use the ✓ or 🗶 to indicate what you disagree with. ]

I think the community should reflect on its genius worship culture

I feel this is a valid critique not just of our research community, but of society in general. It is the great man theory of history, and I believe modern sociology has found the theory mostly invalid.

Thanks! I think this is a valuable clarification. I should maybe have flagged that while I value sociology, I'm not deeply experienced in it.

So A is "great" and B is "well known"...

¬A¬B is just the normal case of a person living a modest life and not having a significant impact on history. Nothing to see here.
AB is the great man from great man theory. They do great things and everyone recognizes it!
A¬B is when someone has a large impact, but does not become well known for it. I suspect this does happen. Probably rarely with someone being completely unknown, but very possibly having much larger impacts than renown.
¬AB is when someone becomes known for something while not actually having impacted history much. I suspect this is common when paradigm shifts are inevitable but need something to precipitate around, this could be as much as needing someone to "champion" the paradigm shift, or could be as simple as needing a label to put on a commonly expressed idea and happening to use the name of someone who at some point expressed that idea.

David Bau has written up a response to this post here: In Defense of Curiosity.

1. current problems that are analogous to future problems

2. artificially constructed problems (model organisms) which are designed to be analogous to future problems

Notes that I consider model organism to involve fine-tuning and prompted settings are in scope depending on specific details of how contrived they seem

If you can get a good proxy to the eventual problem in a real model I much prefer that, on realism grounds. Eg eval awareness in Sonnet

I'm glad to see this post come out. I've previously opined that solving these kinds of problems is what proves a field has become paradigmatic:

Paradigms gain their status because they are more successful than their competitors in solving a few problems that the group of practitioners has come to recognize as acute. ––Thomas Kuhn

But if you can't find a proxy task after genuine effort, this is a bad sign about the project.

Seems reasonable. But yeah, it's a high bar

When Anthropic evaluated Sonnet 4.5 on their existing alignment tests^[5], they found that the rate of misalignment had unexpectedly dropped to 0%. Had they solved alignment?! Unfortunately not. Sonnet 4.5 was instead much better at recognizing when it was being evaluated, and deliberately choosing the most ethical action to pass. The tests weren't measuring alignment; they were measuring evaluation awareness.
Jack’s team subtracted an evaluation-awareness steering vector, partially mitigating the problem and revealing that Sonnet 4.5 did in fact show “misalignment” at a rate of roughly 8%. That is, our ability to evaluate a frontier model for alignment broke, and interpretability researchers were able to fix it.

Great post! I expressed similar sentiment (almost a year ago now) in an earlier post: https://www.lesswrong.com/posts/Ypkx5GyhwxNLRGiWo/why-i-m-moving-from-mechanistic-to-prosaic-interpretability

I feel much better about this post after reading it than I got from my first impression.

Thank you to everyone who contributed to this thought provoking post : )

Time-box your exploration: Set a bounded period for following your curiosity freely. At the end, zoom out, ask yourself what the big picture here really is, and try to find a proxy task.

Deep dives are good, but surface frequently to breath.

Start in a robustly useful setting: Choose a setting that seems analogous to important aspects of future systems, where interesting phenomena are likely to surface and useful proxy tasks are likely to exist.

Look for your lost keys under the lamp post first. Not because they are more likely to be there, but because you will find them faster if they are.

131

A Pragmatic Vision for Interpretability

131

Ω 60

Executive Summary

Introduction

Motivating Example: Steering Against Evaluation Awareness

Our Core Process

Which Beliefs Are Load-Bearing?

Is This Really Mech Interp?

Our Comparative Advantage

Why Pivot?

What's Changed In AI?

Reflections On The Field's Progress

Task Focused: The Importance Of Proxy Tasks

Case Study: Sparse Autoencoders

Ensure They Are Good Proxies

Proxy Tasks Can Be About Understanding

Types Of Projects: What Drives Research Decisions

Focused Projects

Exploratory Projects

Curiosity Is A Double-Edged Sword

Starting In A Robustly Useful Setting

Time-Boxing

Worked Examples

Blending The Two: Tentative Proxy Tasks

What’s Your Contribution?

Jack Lindsey’s Approach

Method Minimalism

Case Study: Shutdown Resistance

Try The Easy Methods First

When Should We Develop New Methods?

Call To Action

Acknowledgments

Appendix: Common Objections

Aren’t You Optimizing For Quick Wins Over Breakthroughs?

What If AGI Is Fundamentally Different?

I Care About Scientific Beauty and Making AGI Go Well

Is This Just Applied Interpretability?

Are You Saying This Because You Need To Prove Yourself Useful To Google?

Does This Really Apply To People Outside AGI Companies?

Aren’t You Just Giving Up?

Is Ambitious Reverse-engineering Actually Overcrowded?

Appendix: Defining Mechanistic Interpretability

Moving Toward "Mechanistic OR Interpretability"

131

Ω 60

131

Ω 60