Automating Auditing: An ambitious concrete technical research proposal

[-]Daniel Kokotajlo4yΩ10120

Cool stuff! Possibly naive question: Aren't there going to be a lot of false positives? Judge says: "Complete text the way an average human would." Attacker does something clever having to do with not knowing what transformers are. Auditor looks at GPT-3 and goes "Aha! It doesn't understand that a paperclip is bigger than a staple! Gotcha!" ... or any of millions of things like that, false positives, ways in which the model doesn't quite meet specification but for reasons independent of the attacker's action.

[-]evhub4yΩ9100

Yeah, that's a great question—I should have talked more about this. I think there are three ways to handle this sort of problem—and ideally we should do some combination of all three:

Putting the onus on the attacker. Probably the simplest way to handle this problem is just to have the attacker produce larger specification breaks than anything that exists in the model independently of the attacker. If it's the case that whenever the attacker does something subtle like you're describing the auditor just finds some other random problem, then the attacker should interpret that as a sign that they can get away with less subtle attacks and try larger modifications instead.
Putting the onus on the judge. Another way to address this sort of problem is just to have a better, more rigorous specification. For example, the specification could include a baseline model that the auditor is supposed to compare to but which they can only sample from not view the internals of (that way modifying an individual neuron in the baseline model isn't trivial for the auditor to detect).
Putting the onus on the auditor. The last way to handle this sort of problem is just to force the auditor to do better. There is a right answer to what behavior exists in the model that's not natural for language models to usually learn when trained on general webtext, and a valid way to handle this problem is to just keep iterating on the auditor until they figure out how to find that right answer. For example: if all your attacks are fine-tuning attacks on some dataset, there is a correct answer about what sort of dataset was fine-tuned on, and you can just force the auditor to try to find that correct answer.

[-]Raemon3yΩ570Review for 2021 Review

This post is among the most concrete, actionable, valuable post I read from 2021. Earlier this year, when I was trying to get a handle on the current-state-of-AI, this post transformed my opinion of Interpretability research from "man, this seems important but it looks so daunting and I can't imagine interpretability providing enough value in time" to "okay, I actually see a research framework I could expect to be scalable."

I'm not a technical researcher so I have trouble comparing this post to other Alignment conceptual work. But my impression, from seeing this concept discussed by groups of established AI Alignment researchers, is that this is least a prima-facie legitimate research paradigm.

I also really like the level of detail here. There are a number of obvious traps, confusions a reader might fall into when reading the high level summary. But the post gives lots of concrete implementation details that look to me. (I recall at one discussion, Eliezer had two objections to the core idea, and then Evan said "yup, I address that in the post" and Eliezer was like "oh. Huh. That doesn't usually happen to me.")

One clarification I've heard Evan say when elaborating on the post:

An idea that seemed obvious to me and some others, but at least wasn't what Evan meant here, was "create short tournaments, the way you have Red Team Blue Team games at, say, DefCon." Evan said the point here was more of a framework that a longterm research team would use on the timescale of years. Upon reflection I think this makes sense. I think in most short-term games, the Blue Team would just fail, in a boring way, and the Red Team would be incentivized to do fairly boring things that don't actually improve on the state of the art.

[-]Rohin Shah4yΩ440

Planned summary for the Alignment Newsletter:

A core worry with inner alignment is that we cannot determine whether a system is deceptive or not just by inspecting its behavior, since it may simply be behaving well for now in order to wait until a more opportune moment to deceive us. In order for interpretability to help with such an issue, we need _worst-case_ interpretability, that surfaces all the problems in a model. When we hear “worst-case”, we should be thinking of adversaries.
This post considers the _auditing game_, in which an attacker introduces a vulnerability in the model to violate some known specification, and the auditor must find and describe the vulnerability given only the modified model (i.e. it does not get to see the original model, or what the adversary did). The attacker aims to produce the largest vulnerability that they can get away with, and the auditor aims to describe the vulnerability as completely as possible. Note that both the attacker and the auditor can be humans (potentially assisted by AI tools). This game forms a good benchmark for worst-case interpretability work.
While the author is excited about direct progress on this game (i.e. better and better human auditors), he is particularly interested in fully _automating_ the auditors. For example, we could collect a dataset of possible attacks and the corresponding desired audit, and finetune a large language model on such a dataset.

Planned opinion:

I like the auditing game as a framework for constructing benchmarks for worst-case interpretability -- you can instantiate a particular benchmark by defining a specific adversary (or distribution of adversaries). Automating auditing against a human attacker seems like a good long-term goal, but it seems quite intractable given current capabilities.

[-]Daniel Kokotajlo4yΩ340

Nitpick: Your use of the term "Inside view" here is consistent with how people have come to use the term, often, in our community -- but has little to do with the original meaning of the term, as set out in Kahneman, Tetlock, Mellers, etc. I suggest you just use the term "real reason." I think people will know what you mean, in fact I think it's less ambiguous / prone to misinterpretation. Idk. It interests me that this use of "inside view" slipped past my net earlier, and didn't make it into my taxonomy.

Obviously your words are your own, you are free to use whatever words you want, I don't want to be grammar nazi! I am uncomfortable writing these words, for that reason. I guess what I am trying to do here (that's more constructive than just being a grammar nazi) is figure out what you mean and add it to my taxonomy. Would you agree that "real reason" would work well to capture what you mean here? If not, what term would you propose instead, and why? (I may not fully grok what you mean).

[-]evhub4yΩ450

Is it inconsistent with the original meaning of the term? I thought that the original meaning of inside view was just any methodology that wasn't reference-class forecasting—and I don't use the term “outside view” at all.

Also, I'm not saying that “inside view” means “real reason,” but that my real reason in this case is my inside view.

[-]Daniel Kokotajlo4yΩ230

I regret saying anything! But since I'm in this hole now, might as well try to dig my way out:

IDK, "any methodology that wasn't reference-class forecasting" wasn't how I interpreted the original texts, but *shrugs.* But at any rate what you are doing here seems importantly different than the experiments and stuff in the original texts; it's not like I can point to those experiments with the planning fallacy or the biased pundits and be like "See! Evan's inside-view reason is less trustworthy than the more general thoughts he lists below; Evan should downweight its importance and not make it his 'real reason' for doing things."

My thesis in "Taboo Outside View" was that we'd all be better off if, whenever we normally would use the term "inside view" and "outside view" we tabood that term and used something more precise instead. In this case, I think that if you just used "real reason" instead of "inside view," it would work fine -- but that's just my interpretation of the meaning you were trying to convey; perhaps instead you were trying to convey additional information beyond "real reason" and that's why you chose "inside view." If so, you may be interested to know that said additional information never made it to me, because I don't know what it might be. ;) Perhaps it was "real reason" + "This is controversial, and I'm not arguing for it here, I don't expect everyone to agree"?

I guess meaning is use, and if enough people even after reading my post still feel compelled to use these terms without tabooing them, then fair enough, even if there isn't a succinct definition of what they mean. I wish we had invented new words though instead of re-using "inside view" and "outside view" given that those terms already had meanings from Tetlock, Kahneman, etc.

[-]Ofer4yΩ330

In particular, if automating auditing fails, that should mean we now have a concrete style of attack that we can’t build an auditor to discover—which is an extremely useful thing to have, as it provides both a concrete open problem for further work to focus on, as well as a counter-example/impossibility result to the general possibility of being able to make current systems safely auditable.

How does such a scenario (in which "automating auditing fails") look like? The alignment researchers who will work on this will always be able to say: "Our current ML models are just not capable enough for implementing such an auditor. But if we use 10x training compute or a better architecture etc., we may succeed."

[-]evhub4yΩ230

Sure, but presumably they'll also say what particular attacks are so hard that current ML models aren't capable of solving them—and I think that's a valuable piece of information to have.

[-]mic3yΩ220

Is the auditing game essentially Trojan detection?

[-]Kshitij Sachan3y10

Yes I think trojan detection is one version of the auditing game. A big difference is that the auditing game involves the red team having knowledge of the blue team's methods when designing an attack. This makes it much harder for the blue team.

[-]mic3yΩ120

Has anyone tried to work on this experimentally?

[-]Kshitij Sachan3y30

Mechanistic Anomaly Detection (MAD) is a version of the auditing game with a few differences:

The auditing game puts the onus on the red team to design a subtle backdoor such that giving the blue team a description of the backdoor is insufficient for the blue team to generate inputs that trigger the backdoor.
The blue team is given the ideal behavior specification by the judge

In MAD, the blue team is given a backdoored model but not necessarily a description of what the model is doing (from judge) and backdoor behavior (from red team). Instead, the blue team is given a model that does something and a clean dataset. Then, on all test inputs, they must determine if the model is using the same mechanism as on the clean set or some other mechanism (presumably backdoor).

Redwood Research has been doing experimental work on MAD in toy settings. We have some techniques we're happy with that do quite well on small problems but that have theoretical issues solving the downstream deceptive alignment/ELK cases we're interested in.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

89

Automating Auditing: An ambitious concrete technical research proposal

89

Ω 43

89

Ω 43

The proposal

Step 1: The auditing game for language models

Step 2: Towards automating the auditor

Rationale

My inside view

More general thoughts