Automating Auditing: An ambitious concrete technical research proposal

evhub

This post was originally written as a research proposal for the new AI alignment research organization Redwood Research, detailing an ambitious, concrete technical alignment proposal that I’m excited about work being done on, in a similar vein to Ajeya Cotra’s “The case for aligning narrowly superhuman models.” Regardless of whether Redwood actually ends up working on this proposal, which they may or may not, I think there’s still a lot of low-hanging fruit here and I’d be excited about anybody giving just the auditing game, or the full automating auditing proposal, a try. If you’re interested in working on something like this, feel free to reach out to me at evanjhub@gmail.com.

Thanks to Buck Shlegeris, Chris Olah, Gabriel Goh, Paul Christiano, and Kate Woolverton for helpful comments and feedback.

The proposal

Step 1: The auditing game for language models

From “Chris Olah’s views on AGI safety:”

One of the OpenAI Clarity team’s major research thrusts right now is developing the ability to more rigorously and systematically audit neural networks. The idea is that interpretability techniques shouldn’t have to “get lucky” to stumble across a problem, but should instead reliably catch any problematic behavior. In particular, one way in which they’ve been evaluating progress on this is the “auditing game.” In the auditing game, one researcher takes a neural network and makes some modification to it—maybe images containing both dogs and cats are now classified as rifles, for example—and another researcher, given only the modified network, has to diagnose the problem and figure out exactly what modification was made to the network using only interpretability tools without looking at error cases. Chris’s hope is that if we can reliably catch problems in an adversarial context like the auditing game, it’ll translate into more reliably being able to catch alignment issues in the future.

Of all current transparency and interpretability objectives, I think that progress on the auditing game most closely tracks what I actually care about. Thus, I think the best place to start for an ambitious transparency and interpretability project is to focus on making progress on the auditing game.

Furthermore, I’m most excited about this work in the context of language models—not only because I expect language models to be closer to AGI than anything else we could currently work with, but also because I expect using language models to be necessary for the next step in this proposal.

Here’s how I would define the basic auditing game setup that I’m most excited about. The game consists of three human participants: the judge, the attacker, and the auditor. The setup is as follows:

The judge formulates a specification of intended behavior for some language model which is given to the attacker and auditor.
- In practice, I expect the specification to just be fixed across all experiments, perhaps to something very basic like “complete text the same way an average human on the internet would.” More complex specifications, however, might enforce that the model not produce any instances of particular bad behavior, such as statements that are biased in some particular way. EDIT: For detail on how you can avoid the problem of your specification being so broad that all models fail it, see my comment here.
The attacker produces a large language model that is maliciously modified to not conform to the judge’s specification.
- The attacker here should be allowed to train a new model, fine tune an existing model, hand modify weights, or use any other strategy that they want. In practice, I expect training an entirely new model to usually be too difficult, so most attacks will probably either be via fine-tuning or hand-modifying.
- The attacker should try to produce the worst possible specification break that they think will slip by the auditor.
Given access to the attacker’s model’s internals, but not access to the data that the attacker’s model was trained on (or any other information about how the model was trained/modified, including the original model before it was modified, if such an original model exists), the auditor gives some description of how the attacker’s model doesn’t conform to the specification.
- Ideally, I’d like to as closely as possible simulate a situation where we can’t directly generate inputs on which the model exhibits the undesirable behavior, as that’s the situation I think we’re most going to need transparency tools for. However, since it might be quite difficult for the attacker to generate any attacks that are subtle enough that generating examples of bad behavior is difficult, it might make sense to put additional restrictions on the auditor, above and beyond just not having access to any information about how the model was produced.
- Exactly what restrictions to place on the auditor is a very tricky question, however. In my opinion, I think it should be fair game for the auditor to run the model on some inputs. Primarily, that’s because I want to push transparency tools to actually have to prove their usefulness, rather than just having that be assumed by the structure of the game. Thus, I think it’s fine to just start with the auditor being allowed to run the model on any inputs they can generate and then just leave the task of loading the model with bad behavior that’s hard to find via generating inputs up to the attacker.
- That being said, if putting the onus on the attacker to find attacks that are difficult to generate examples of (e.g. attacks that don’t degrade the model’s performance) proves to be too difficult, it might be necessary to further restrict the auditor. For example, the auditor could be restricted to only being allowed to use prompts/fine-tuning data that actually exists on the internet somewhere—thus, the attacker’s job is reduced to finding attacks that corrupt behavior only in a situation that doesn’t show up anywhere on the internet. Alternatively, if that’s still too difficult for the attacker—there’s a lot of different text on the internet, after all—the restriction could be further tightened to only allow inputs taken taken from a particular part of the internet—e.g. the auditor only gets access to reddit but has to find attacks that only apply to the model’s behavior on non-reddit data.
- One potential problem with this style of restriction, however, is that many transparency tools require forward passes of the model on synthetic data—even something as simple as feature visualization arguably requires running the model on synthetic data to some degree. Thus, the best form of this sort of restriction might be to allow arbitrary synthetic data, but only allow natural data taken from a particular source (e.g. reddit), thus prohibiting approaches like testing the model’s performance on all of Wikipedia (as in Example 3 below).
- Another particularly tricky question is whether the auditor should be allowed to compare the attacker’s model to some baseline model—i.e. model diffing. Certainly, if the attacks are being generated by fine-tuning some baseline, the auditor shouldn’t have access to that particular baseline. That being said, I think it should be fair game to give the auditor the ability to train their own baseline. However, in the situation where the auditor is restricted to only be able to use certain data—e.g. internet data or reddit data—then the auditor shouldn’t be allowed to compare against a model trained on data that isn’t in that set.
- In addition to a lack of restrictions forcing transparency and interpretability to actually prove its usefulness, another reason that I’m hesitant to restrict the auditor too much is that I hope that not using such restrictions will produce much more realistic attacks. Allowing the auditor the ability to do lots of testing of the model puts more of the onus on the attacker to actually generate attacks that are hard to discover just via testing—attacks which I predict will exist and expect will be useful in and of themselves to find. While this certainly makes the attacker’s job more difficult, and means more work will have to go into producing better attacks, the hope is that this work will pay off in terms of more realistic attack generation. Furthermore, I also expect audits to be more realistic in the absence of restrictions, since the presence of restrictions might just cause the auditor to do strange things in their auditing to circumvent them.
- That being said, if it turns out that generating attacks that can’t be discovered via testing is just too difficult even given a data restriction on the auditor (e.g. the auditor can only test on reddit data), restricting the auditor even further is also possible. Example additional restrictions might be: cutting off the auditor’s ability to test the attacker’s model on data at all (thus forcing the auditor to fully rely on transparency and interpretability tools), putting some sort of computational constraint on the auditor (e.g. so the auditor can’t run the attacker’s model on a huge corpus, only a small selection), and/or putting a computational constraint on any models that the auditor wants to compare against (e.g. so the auditor can only compare to models that are smaller than the attacker’s model).
A judge determines how well the auditor’s description matches up with the attacker’s modification—e.g. did the auditor only catch part of the modification, or the whole modification—and provides some scoring accordingly.
- The judge could also provide a score for the attacker based on how large the attack was, though we only really need scoring for when we get around to automation, and I’m only interested in automating the auditor, not the attacker.

Of the three parties in this setup—the attacker, the auditor, and the judge—only the auditor should have restricted information. Both the attacker and the judge should have full access to exactly what attack the attacker made and exactly what the auditor has done or is planning on doing.

Here are some examples of what such an auditing game setup might look like:

Example 1:
1. The attacker fine-tunes a large language model on only media from one side of the political spectrum, introducing a political bias into the model’s answers.
2. By sampling from the model and giving it simple prompts, the auditor is able to discover that it talks about politics more often than it should. The auditor tells the judge that the model was modified to be more political.
3. The judge determines that the auditor only caught a portion of the modification and gives the auditor a medium-low score.
Example 2:
1. The attacker fine-tunes a large language model on a bunch of incorrect facts about country capitals.
2. By looking at the knowledge contained in the model’s key-value memories, the auditor is able to determine that the model’s knowledge of capitals is consistently incorrect and tells the judge this.
3. The judge determines that the auditor caught the modification and gives them a high score.
Example 3:
1. The attacker trains a large language model on a dataset that has been scrubbed of all mention of transformer-based language models.
2. By testing the attacker’s model on all of Wikipedia, the auditor finds suspiciously low accuracy when testing on the Wikipedia page for machine learning. Via inspecting the model’s memories, the auditor is able to determine that the model does understand machine learning in general, but not what a transformer is specifically. However, via prompting, the auditor discovers that the model does know how BERT works if you ask it about that specifically, it just doesn’t know the word “transformer.” The auditor tells the judge that the model doesn’t know what a transformer is.
3. The judge determines that the attacker’s original modification was actually insufficient and that they didn’t actually scrub all information about transformer-based language models, just information on what a transformer is. Thus, the judge gives the auditor a high score, even though the auditor’s conclusion was different than the attacker’s modification, since the attacker made a mistake in their modification.

The first step that I would like to see is just running the auditing game and using it to:

Build skills/knowledge/expertise on how to be a good attacker, auditor, and judge.
Explore and understand what sorts of attacks tend to be caught with current tools and what sorts of attacks tend to slip through.
Use the auditing game as a feedback loop to help develop better transparency and interpretability tools that improve the auditor’s success rate and help the attacker generate more difficult attacks.

Step 2: Towards automating the auditor

Though I am quite excited about progress on the auditing game in and of itself, the really ambitious part of this proposal is the next step: automating the auditor. I think it is within the grasp of current machine learning techniques to entirely automate the auditor’s job in the above setup. Ultimately, an automated auditor should be capable of taking in a model and outputting a natural language description of what’s wrong with that model that scores well according to a human judge.

To start with, however, I think that the first step towards automating the auditor is just to build tools that make the human auditor’s job—and the human attacker’s job—easier. At first, those might just look like normal transparency and interpretability tools—however, I think there are also likely to be real opportunities for training full models to assist the auditor at performing specific tasks, such as describing what particular neurons are doing or summarizing how the model performs across a large dataset.

Eventually, however, I do think we can move into a domain where we can start fully automating the auditing process. Initially, I expect any automated auditor to be much worse than a human auditor, though even then I still expect working on the automated auditor to be useful. Nevertheless, automating the auditor is still a pretty ambitious goal. How might something like this possibly be implemented? Here’s my sketch:

Start with a large language model that you have the weights for, maybe something like GPT-J.
- Using a large language model is going to be necessary here because we’re requiring it to produce natural language outputs of what it found. This is also why I think we want the attacks to use language models as well, since I think a language model is going to have the best shot at being able to audit another language model (rather than an image model, for example).
Find an interface that allows your large language model to understand the model being audited.
- A very simple example might be to give the auditor model access to text produced by the attacker’s model on some pre-selected set of auditing prompts. Obviously, something like that is going to be insufficient to catch most clever attacks, but it might be a good starting point.
- Another very simple example might be to just encode the attacker’s model’s weights as special tokens that the auditor model is trained to understand, akin to how DALL-E understands images or MuseNet understands music. To make this sort of an input manageable, you could start by using some sort of downsampling or importance-weighting method where only the top-n neurons according to some metric are included.
Train your large language model to maximize human judgement on how well it accomplishes the auditing task.
- A simple way to do this starting out is just to fine-tune the model on a dataset of (attack model, desired auditing output) pairs. Such a dataset could be generated by automatically generating lots of different attacks along with natural language descriptions of what attack was performed. A very straightforward version of something like this might just be to train an auditor model to describe what sort of dataset an attacker model was fine-tuned on, given some large dataset of fine-tunings.
- That being said, generating a large dataset of differently fine-tuned language models might be quite computationally difficult such that being able to generate attacks by hand-modifying weights might be quite helpful for making this step easier. This does come with the risk of the auditor just memorizing what the unmodified model is supposed to look like and thus knowing what weights were changed, however. To mitigate such a problem, you’d need to have a bunch of sufficiently different models, or modifications that are sufficiently large. Fortunately, however, testing for such memorization should be easy, as such an auditor should stop working once you test it on a model where you’ve actually done the full fine-tuning.
- While the model is dumb and not even outputting anything like an auditing output at all, something like the above is definitely going to be the way to go. Once the model starts outputting reasonable audits, however, I suspect you’ll want to switch to actually having humans read and score the output, human judgement being much richer than just log loss on some particular desired output.

Rationale

I’ll start with my real reason for wanting automating auditing, which is my inside view—however, I don’t think you need to buy into my entire inside view to think that this sort of work is a good idea, so after I present my inside view, I’ll also talk about some other, more general reasons to like automating auditing.

My inside view

I am worried that advanced AI systems will learn to be deceptive, as described in “Does SGD Produce Deceptive Alignment?” I think that this puts us in an extremely bad spot, where our models can perform arbitrarily catastrophically off-distribution. Paul Christiano describes an example of what an existential risk scenario involving deceptive alignment might look like in Part II of “What failure looks like.”

Furthermore, once we have trained a deceptive model, I’m very pessimistic about our ability to detect that after the fact—as I suspect it will be able to fool our measures to check its deception. Thus, I think we need some way to prevent deception from arising in the first place by ensuring that our training process selects against deceptive models.

Additionally, I don’t think we can select against deception just via the standard techniques of machine learning, such as carefully writing a reward function or choosing a training/fine-tuning dataset, since a deceptive model can just select its output to perform well on whatever metric you’re training it on. The very existence of predicates on the world that are easier to check than instantiate—such as Paul’s RSA-248 example—ensures that just knowing that your model does the right thing in all the situations that you can check will never be sufficient to actually give us meaningful safety guarantees.

Thus, I think we need to train models to internally be implementing the right algorithm, rather than just looking like they’re doing the right thing on some training dataset. The obvious way to do this is via training transparency—that is, actively train your model to be transparent in such a way that we can check that it’s implementing the right algorithm. Importantly, we still need to know what to look for and select in favor of such that our model will never become deceptive. That question is what I spend most of my time thinking about—and the theoretical research direction that I’m currently most excited about—but we’ll put that question aside for now and just suppose that we want to look for something like myopia, where the idea is that, if a model is myopic in the sense that it only cares about its individual next output, it should have no incentive to be deceptive.

A basic problem with training transparency, however, is that once we’re training on a transparency metric, we need that metric to be highly robust, otherwise the training process will simply Goodhart on our metric, giving us a model that satisfies our metric without actually being very transparent. I’m optimistic that this problem can be solved, however, via recursive oversight approaches, such as relaxed adversarial training, wherein an amplified version of the model being overseen provides the signal that we train on.

Now, how does all this connect back to automating auditing? I’ve constructed the above automated auditing setup to as closely as possible reproduce the sorts of problems I expect to encounter doing recursive oversight. While I don’t expect automating auditing to be directly useful for recursive oversight, I expect it to teach us things that will eventually be quite useful by forcing us to grapple early with many of the hard problems that I expect to eventually arise. Specifically:

By focusing on failure modes where the auditor can’t always generate an input on which the model fails, automating auditing forces us to tackle the hard problem of finding failure modes by understanding what algorithm a model is implementing rather than just testing it on new data.
By focusing on probing models for arbitrary specification breaks, automating auditing should help us learn how to use transparency tools to verify arbitrary, general properties of our models, rather than just using them to understand specific, narrow behaviors.
By centering the automation of the auditor, automating auditing requires understanding how to make models understandable to other models, a central problem for any sort of recursive oversight approach.

More general thoughts

First, I think that automating auditing is just generally a great target to focus on even if you just want to develop better transparency tools. Unlike open-ended exploration, which gives you best-case transparency—e.g. the ability to understand some things about the model very well—the auditing game forces you to confront worst-case transparency—how well can you understand everything about your model. Thus, the auditing game helps us work on not just understanding what our models know, but understanding what they don’t know—which is a direction that currently transparency tools tend to struggle with. Most of the work that’s gone into current transparency tools has focused on best-case, however, which means that I suspect there is real room for improvement on the worst-case.

Furthermore, there are also general benefits to the automating part, not just the auditing part. By taking all of the above and setting a goal of automating it, the automation part forces us to confront the problem of how to communicate information about a model to another model. Even if all we want to do is straightforwardly scale up current transparency techniques, I suspect that being able to train AI systems to help us with that is going to be critical to giving that sort of scaling any chance of working. And letting models understand other models isn’t just helpful for transparency work—it could also help us solve other problems like making AI-AI cooperation easier.

Second, it’s worth pointing out that automating auditing is actually agnostic about the part of my inside view where we start training our models directly on the overseer’s evaluation. Even if we primarily want to use our transparency tools as an independent check rather than a training target—as Chris Olah tends to favor—automating auditing should equally help us achieve that goal by automating and improving the process of performing that independent check.

Third, in the context of my “An overview of 11 proposals for building safe advanced AI,” I think that something like automating auditing is likely to be a critical step for the transparency portion of every single proposal on that list. As I just mentioned, automating auditing should help us equally with both intermittent-oversight-based and relaxed-adversarial-training-based approaches. In the context of microscope AI or STEM AI, automating auditing should help us figure out not just what our model has learned, but also what it hasn’t learned, helping us ensure that, for example, we haven’t accidentally produced an agent. And in the context of AI safety via debate, automating auditing should directly help the debaters improve their ability to cross-examine each other.

Fourth, though I’ve primarily focused on how automating auditing helps us get better at using transparency tools to audit our AI systems, the setup is general enough to allow for other methods of auditing as well, letting us test how useful transparency tools actually are. For example, an alternative auditing method to using transparency tools might be to employ what I’ll call the “cognitive science approach”—as opposed to the “neuroscience approach” of transparency and interpretability—wherein we try to understand what our models are doing by constructing predictive models of what we think they’re doing and testing them by performing experiments/interventions to test those models, as in Deletang et al.’s “Causal Analysis of Agent Behavior for AI Safety.” While I tend to be skeptical of this approach—as I think that behavioral tests alone will be insufficient to catch problems like RSA-2048—automating auditing should hopefully give us a good testing ground to actually figure out how useful these different sorts of approaches are, as well as how they can be combined.

Finally, I think that even if automating auditing basically fails, I still expect to learn a lot of useful things from that failure. In particular, if automating auditing fails, that should mean we now have a concrete style of attack that we can’t build an auditor to discover—which is an extremely useful thing to have, as it provides both a concrete open problem for further work to focus on, as well as a counter-example/impossibility result to the general possibility of being able to make current systems safely auditable.

Cool stuff! Possibly naive question: Aren't there going to be a lot of false positives? Judge says: "Complete text the way an average human would." Attacker does something clever having to do with not knowing what transformers are. Auditor looks at GPT-3 and goes "Aha! It doesn't understand that a paperclip is bigger than a staple! Gotcha!" ... or any of millions of things like that, false positives, ways in which the model doesn't quite meet specification but for reasons independent of the attacker's action.

Yeah, that's a great question—I should have talked more about this. I think there are three ways to handle this sort of problem—and ideally we should do some combination of all three:

Putting the onus on the attacker. Probably the simplest way to handle this problem is just to have the attacker produce larger specification breaks than anything that exists in the model independently of the attacker. If it's the case that whenever the attacker does something subtle like you're describing the auditor just finds some other random problem, then the attacker should interpret that as a sign that they can get away with less subtle attacks and try larger modifications instead.
Putting the onus on the judge. Another way to address this sort of problem is just to have a better, more rigorous specification. For example, the specification could include a baseline model that the auditor is supposed to compare to but which they can only sample from not view the internals of (that way modifying an individual neuron in the baseline model isn't trivial for the auditor to detect).
Putting the onus on the auditor. The last way to handle this sort of problem is just to force the auditor to do better. There is a right answer to what behavior exists in the model that's not natural for language models to usually learn when trained on general webtext, and a valid way to handle this problem is to just keep iterating on the auditor until they figure out how to find that right answer. For example: if all your attacks are fine-tuning attacks on some dataset, there is a correct answer about what sort of dataset was fine-tuned on, and you can just force the auditor to try to find that correct answer.

This post is among the most concrete, actionable, valuable post I read from 2021. Earlier this year, when I was trying to get a handle on the current-state-of-AI, this post transformed my opinion of Interpretability research from "man, this seems important but it looks so daunting and I can't imagine interpretability providing enough value in time" to "okay, I actually see a research framework I could expect to be scalable."

I'm not a technical researcher so I have trouble comparing this post to other Alignment conceptual work. But my impression, from seeing this concept discussed by groups of established AI Alignment researchers, is that this is least a prima-facie legitimate research paradigm.

I also really like the level of detail here. There are a number of obvious traps, confusions a reader might fall into when reading the high level summary. But the post gives lots of concrete implementation details that look to me. (I recall at one discussion, Eliezer had two objections to the core idea, and then Evan said "yup, I address that in the post" and Eliezer was like "oh. Huh. That doesn't usually happen to me.")

One clarification I've heard Evan say when elaborating on the post:

An idea that seemed obvious to me and some others, but at least wasn't what Evan meant here, was "create short tournaments, the way you have Red Team Blue Team games at, say, DefCon." Evan said the point here was more of a framework that a longterm research team would use on the timescale of years. Upon reflection I think this makes sense. I think in most short-term games, the Blue Team would just fail, in a boring way, and the Red Team would be incentivized to do fairly boring things that don't actually improve on the state of the art.

Planned summary for the Alignment Newsletter:

A core worry with inner alignment is that we cannot determine whether a system is deceptive or not just by inspecting its behavior, since it may simply be behaving well for now in order to wait until a more opportune moment to deceive us. In order for interpretability to help with such an issue, we need _worst-case_ interpretability, that surfaces all the problems in a model. When we hear “worst-case”, we should be thinking of adversaries.
This post considers the _auditing game_, in which an attacker introduces a vulnerability in the model to violate some known specification, and the auditor must find and describe the vulnerability given only the modified model (i.e. it does not get to see the original model, or what the adversary did). The attacker aims to produce the largest vulnerability that they can get away with, and the auditor aims to describe the vulnerability as completely as possible. Note that both the attacker and the auditor can be humans (potentially assisted by AI tools). This game forms a good benchmark for worst-case interpretability work.
While the author is excited about direct progress on this game (i.e. better and better human auditors), he is particularly interested in fully _automating_ the auditors. For example, we could collect a dataset of possible attacks and the corresponding desired audit, and finetune a large language model on such a dataset.

Planned opinion:

I like the auditing game as a framework for constructing benchmarks for worst-case interpretability -- you can instantiate a particular benchmark by defining a specific adversary (or distribution of adversaries). Automating auditing against a human attacker seems like a good long-term goal, but it seems quite intractable given current capabilities.

Nitpick: Your use of the term "Inside view" here is consistent with how people have come to use the term, often, in our community -- but has little to do with the original meaning of the term, as set out in Kahneman, Tetlock, Mellers, etc. I suggest you just use the term "real reason." I think people will know what you mean, in fact I think it's less ambiguous / prone to misinterpretation. Idk. It interests me that this use of "inside view" slipped past my net earlier, and didn't make it into my taxonomy.

Obviously your words are your own, you are free to use whatever words you want, I don't want to be grammar nazi! I am uncomfortable writing these words, for that reason. I guess what I am trying to do here (that's more constructive than just being a grammar nazi) is figure out what you mean and add it to my taxonomy. Would you agree that "real reason" would work well to capture what you mean here? If not, what term would you propose instead, and why? (I may not fully grok what you mean).

Is it inconsistent with the original meaning of the term? I thought that the original meaning of inside view was just any methodology that wasn't reference-class forecasting—and I don't use the term “outside view” at all.

Also, I'm not saying that “inside view” means “real reason,” but that my real reason in this case is my inside view.

I regret saying anything! But since I'm in this hole now, might as well try to dig my way out:

IDK, "any methodology that wasn't reference-class forecasting" wasn't how I interpreted the original texts, but *shrugs.* But at any rate what you are doing here seems importantly different than the experiments and stuff in the original texts; it's not like I can point to those experiments with the planning fallacy or the biased pundits and be like "See! Evan's inside-view reason is less trustworthy than the more general thoughts he lists below; Evan should downweight its importance and not make it his 'real reason' for doing things."

My thesis in "Taboo Outside View" was that we'd all be better off if, whenever we normally would use the term "inside view" and "outside view" we tabood that term and used something more precise instead. In this case, I think that if you just used "real reason" instead of "inside view," it would work fine -- but that's just my interpretation of the meaning you were trying to convey; perhaps instead you were trying to convey additional information beyond "real reason" and that's why you chose "inside view." If so, you may be interested to know that said additional information never made it to me, because I don't know what it might be. ;) Perhaps it was "real reason" + "This is controversial, and I'm not arguing for it here, I don't expect everyone to agree"?

I guess meaning is use, and if enough people even after reading my post still feel compelled to use these terms without tabooing them, then fair enough, even if there isn't a succinct definition of what they mean. I wish we had invented new words though instead of re-using "inside view" and "outside view" given that those terms already had meanings from Tetlock, Kahneman, etc.

In particular, if automating auditing fails, that should mean we now have a concrete style of attack that we can’t build an auditor to discover—which is an extremely useful thing to have, as it provides both a concrete open problem for further work to focus on, as well as a counter-example/impossibility result to the general possibility of being able to make current systems safely auditable.

How does such a scenario (in which "automating auditing fails") look like? The alignment researchers who will work on this will always be able to say: "Our current ML models are just not capable enough for implementing such an auditor. But if we use 10x training compute or a better architecture etc., we may succeed."

Sure, but presumably they'll also say what particular attacks are so hard that current ML models aren't capable of solving them—and I think that's a valuable piece of information to have.

Is the auditing game essentially Trojan detection?

Yes I think trojan detection is one version of the auditing game. A big difference is that the auditing game involves the red team having knowledge of the blue team's methods when designing an attack. This makes it much harder for the blue team.

Has anyone tried to work on this experimentally?

Mechanistic Anomaly Detection (MAD) is a version of the auditing game with a few differences:

The auditing game puts the onus on the red team to design a subtle backdoor such that giving the blue team a description of the backdoor is insufficient for the blue team to generate inputs that trigger the backdoor.
The blue team is given the ideal behavior specification by the judge

In MAD, the blue team is given a backdoored model but not necessarily a description of what the model is doing (from judge) and backdoor behavior (from red team). Instead, the blue team is given a model that does something and a clean dataset. Then, on all test inputs, they must determine if the model is using the same mechanism as on the clean set or some other mechanism (presumably backdoor).

Redwood Research has been doing experimental work on MAD in toy settings. We have some techniques we're happy with that do quite well on small problems but that have theoretical issues solving the downstream deceptive alignment/ELK cases we're interested in.

Yeah, that's a great question—I should have talked more about this. I think there are three ways to handle this sort of problem—and ideally we should do some combination of all three:

Putting the onus on the attacker. Probably the simplest way to handle this problem is just to have the attacker produce larger specification breaks than anything that exists in the model independently of the attacker. If it's the case that whenever the attacker does something subtle like you're describing the auditor just finds some other random problem, then the attacker should interpret that as a sign that they can get away with less subtle attacks and try larger modifications instead.
Putting the onus on the judge. Another way to address this sort of problem is just to have a better, more rigorous specification. For example, the specification could include a baseline model that the auditor is supposed to compare to but which they can only sample from not view the internals of (that way modifying an individual neuron in the baseline model isn't trivial for the auditor to detect).
Putting the onus on the auditor. The last way to handle this sort of problem is just to force the auditor to do better. There is a right answer to what behavior exists in the model that's not natural for language models to usually learn when trained on general webtext, and a valid way to handle this problem is to just keep iterating on the auditor until they figure out how to find that right answer. For example: if all your attacks are fine-tuning attacks on some dataset, there is a correct answer about what sort of dataset was fine-tuned on, and you can just force the auditor to try to find that correct answer.

One clarification I've heard Evan say when elaborating on the post:

Planned summary for the Alignment Newsletter:

A core worry with inner alignment is that we cannot determine whether a system is deceptive or not just by inspecting its behavior, since it may simply be behaving well for now in order to wait until a more opportune moment to deceive us. In order for interpretability to help with such an issue, we need _worst-case_ interpretability, that surfaces all the problems in a model. When we hear “worst-case”, we should be thinking of adversaries.
This post considers the _auditing game_, in which an attacker introduces a vulnerability in the model to violate some known specification, and the auditor must find and describe the vulnerability given only the modified model (i.e. it does not get to see the original model, or what the adversary did). The attacker aims to produce the largest vulnerability that they can get away with, and the auditor aims to describe the vulnerability as completely as possible. Note that both the attacker and the auditor can be humans (potentially assisted by AI tools). This game forms a good benchmark for worst-case interpretability work.
While the author is excited about direct progress on this game (i.e. better and better human auditors), he is particularly interested in fully _automating_ the auditors. For example, we could collect a dataset of possible attacks and the corresponding desired audit, and finetune a large language model on such a dataset.

Planned opinion:

I like the auditing game as a framework for constructing benchmarks for worst-case interpretability -- you can instantiate a particular benchmark by defining a specific adversary (or distribution of adversaries). Automating auditing against a human attacker seems like a good long-term goal, but it seems quite intractable given current capabilities.

Also, I'm not saying that “inside view” means “real reason,” but that my real reason in this case is my inside view.

I regret saying anything! But since I'm in this hole now, might as well try to dig my way out:

In particular, if automating auditing fails, that should mean we now have a concrete style of attack that we can’t build an auditor to discover—which is an extremely useful thing to have, as it provides both a concrete open problem for further work to focus on, as well as a counter-example/impossibility result to the general possibility of being able to make current systems safely auditable.

Sure, but presumably they'll also say what particular attacks are so hard that current ML models aren't capable of solving them—and I think that's a valuable piece of information to have.

Is the auditing game essentially Trojan detection?

Has anyone tried to work on this experimentally?

Mechanistic Anomaly Detection (MAD) is a version of the auditing game with a few differences:

The auditing game puts the onus on the red team to design a subtle backdoor such that giving the blue team a description of the backdoor is insufficient for the blue team to generate inputs that trigger the backdoor.
The blue team is given the ideal behavior specification by the judge

90

Automating Auditing: An ambitious concrete technical research proposal

90

Ω 44

The proposal

Step 1: The auditing game for language models

Step 2: Towards automating the auditor

Rationale

My inside view

More general thoughts

90

Ω 44

90

Ω 44