Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Redwood Research's current project is to train a model that completes short snippets of fiction without outputting text where someone gets injured. I'm excited about this direction and wanted to explain why.

(Disclaimer: I originally proposed this project, and am on Redwood’s board. So I may be biased in favor of Redwood and especially in favor of this problem being a good one to work on.)

Relevance to deceptive alignment

I think about alignment risk as having two big drivers :

  1. Your AI is smart enough that you couldn’t even tell if it did something terrible until it’s too late, e.g. until you’ve already implemented the proposed plan and it killed everyone.
  2. Your AI looks nice during training while humans are in control, but there is a treacherous turn once it is deployed or becomes smart enough that we can’t correct a problem.

I spend most of my time working on problem #1, but I think both problems are very important.

One of the main concrete candidate techniques for fixing problem #2 is to run the AI on tons of simulated inputs designed to look like the kind of test-time situation where your AI could kill you and take your stuff (adversarial training). But in fact these are just simulations, and if your model tries to do something really bad you just do a gradient update and move on. As long as you've solved problem #1, you can hopefully recognize if the model tries to do something bad in simulation without needing to run it in the real world.

Depending on the details you may not actually want to train on these examples---once you see a problem you may want to do something different or else run the risk of overfitting---but at any rate you definitely want to see them in order to know that there is a problem.)

I think adversarial training or evaluation look like key ingredients whether we are pursuing a principled approach to alignment (a more sophisticated version of this process are a key ingredient in my own "best guess" alignment plan) or just "winging it" (in which case I think adversarial evaluation is especially important).

Redwood’s current project

I’m discussing what Redwood’s post calls “step 1,” namely training a classifier that is able to recognize violent completions with an extremely low false negative rate. Ideally they would go for 0 false negatives, but in practice they are just going for as many 9s as they can get (think more like 10-30 than 99.99%---way beyond what people are normally going for in ML). They are OK with a relatively high false positive rate since most inputs in practice are “obviously safe” and it’s not that bad to force the model to behave cautiously on the riskiest inputs.

I think this overall structure is pretty similar to the long-term problem. We just need enough 9s that the model will probably never fail catastrophically even if you run it on many times more examples than it was trained on and even if the input distribution shifts in an unlucky way. And it’s pretty OK if you have high false positive rates off distribution so that your system behaves conservatively until you retrain.

Step 1 is particularly hard because it’s the place where we don’t have a precise specification and instead need to use expensive oversight. I’m also interested in step 2---I think the two involve slightly different techniques but that both are quite relevant to the long-term problem. I'm genuinely unsure whether "learn a specification with low false negative rate" or "verify that your policy meets the learned specification" is the harder part.

"Try to find inputs where your classifier fails to recognize someone getting obviously injured" isn't a perfect analogy for "try to find inputs where your policy deliberately kills everyone," but I think it's actually surprisingly good. I think the core common features are that the task is extremely easy, so that the model ought to be able to do it without needing to learn very much new about the world, but the way ML systems are typically trained makes it hard to get good robustness in the tails.

Why the concrete steps towards this goal look interesting

I think that the concrete steps you would need to take to solve the problem today are usefully analogous to the steps you’ll need to take in the future, and when I look at what Redwood folks are actually doing day-to-day I feel pretty happy about it:

  • Organizing humans to create adversarial prompts where the model may do something bad, and figuring out how to combine this data with on-distribution data. Avoiding failures where the model learns to be conservative only on “sneaky-looking prompts designed to trick me.”
  • Writing tools to help improve that red-teaming process---using ML to automate parts of the task, and poking at your model to help guide the search process.
  • Reasoning about the “margin of safety” between the worst example the red team can find, and the level of badness that would be needed to actually cause a serious problem. Thinking about how “worst we can find” relates to “actual worst,” and how to trade this margin off against quality.
  • Generating ensembles of classifiers and using different classifier strategies to get a variety of different answers to the question “was this behavior bad?” Selecting and synthesizing situations that create disagreements between these methods as a way of finding possible mistakes.

Many of those techniques appear in the academic literature, but the details change a lot when you start aiming for no failures rather than reducing the rate of failures.

I think that all of those will be relevant in the future. Human red-teaming will become less important relative to automated red-teaming, but I think that {human red-teaming with machine help} is more similar to what we’ll be doing in the future than simpler automated techniques like gradient descent.

I don’t think this is by any means a complete list of techniques that would be needed. But still, when I look at the actual activities and the project that puts them all together, I feel like this is the kind of work which I really want to be happening (in a massively scaled-up way) for models that are capable enough to be really risky.

Limitations of the analogy

There's one deep disanalogy between the current project and deceptive alignment: the current model is making “honest mistakes” when it completes a prompt in an injurious way, rather than biding its time until it can do something treacherous. That leads to two important differences:

  • The failure cases aren’t strategically hidden, so it’s more likely that you can find them by searching kind of haphazardly. This problem will get much harder for the case of deceptive alignment. The current techniques still seem useful, and I think this is a good way to warm up, but even with more powerful automation I think they eventually won’t be enough.
  • It’s not necessarily easy for the model to avoid failing, because it often doesn’t understand when a situation involves an injury---it just says some plausible-sounding words that happen to imply that someone was injured. In practice this means that working on this problem involves some awkward judgment calls, where the model does say things that technically imply an injury occurred but where it can’t connect the dots. I don’t think this really affects the value of the research, but it makes it much harder to give a super clean evaluation of the outcome.

I think that these reduce the value of the project, but given the early stages it's more like 2x than 100x less useful. In many respects the current problem is like an easy version of the long-term problem, and some of the gap is closed by having adversaries working with ML tools that will scale up over time.

Why it matters

  • I think it’s plausible that evaluating and optimizing worst-case performance involves a bunch of skills and techniques rather than a unified idea, and that whether it succeeds or fails may depend on having teams that are really good at what they do rather than having some clever ideas (or may depend on both). So I want us to get as much practice as we can.
  • I think that clever techniques for optimizing worst case performance will be easier to develop in the context of concrete problems where we’ve already picked the low-hanging fruit and have a clear understanding of why it’s hard to make further progress without a clever idea. So even if none of the techniques involved in this project turn out to be valuable at the end of days, I still think it would be useful for laying the groundwork for further work.
  • Existing work on worst-case performance is often focused on settings like classification accuracy given adversarial perturbations. But “classify well” seems like too challenging a property to satisfy in the worst case---there’s no clear reason why a model would be able to succeed, whereas I do think that it’s possible to not deliberately kill everyone regardless of your input. Moreover, when these tasks are possible, it’s often because the domain has some special structure (e.g. sparse or small perturbations) that a defender needs to exploit.

What I want to happen in the world overall

As we approach transformative and catastrophically risky AI, I expect there to be multiple industry labs working on training ever-larger models.

In my ideal world those labs would have large “adversarial evaluation departments” that try extremely hard to find inputs (or random seeds, or “pseudo” inputs) where a powerful model attempts to deliberately cause harm, or do anything that even vaguely smells like causing harm or deliberately undermining safety measures, or trying to deceptively hide their capabilities, or etc. We can allow the adversaries all kinds of unrealistic flexibility to create implausible situations, remove or relax all the realistic safety thresholds that might prevent bad behavior, and so on. 

Ideally by the time a model is deployed we’d be pretty confident it’s very hard to get it to do anything close to problematic. This won’t be enough on its own to be confident that models don’t do anything bad, and ideally this would be just one piece of a machine that created much more confidence, but it does make life meaningfully harder for a deceptively aligned model looking to cause trouble.

These teams would grow in size rapidly as AI became more dangerous, but I think things are much more likely to go well if people aren’t trying to spin them up from scratch---if there are existing teams and individuals that are excellent at this activity, if they’ve been testing and developing methods for years, if we collectively understand what it looks like for such a team to be doing its job well, and so on. Ideally that scale-up will be continuous, and will start up in something like the world of today, rather than being some hypothetical that we start executing on much closer to crunch time.

If all of that goes well, then the most likely way for us to fail is if the deceptively aligned model is a lot smarter than the red team (and all of the red team’s aligned helpers)---so that the deceptive model can cause trouble in ways too subtle for the red team to notice. This basically puts us back to bullet point #1 in my original dichotomy. It’s not that I’m not worried about that problem---I personally spend nearly 100% of my time on it---but I do think that it would be extremely comforting if that was all we had to worry about.

There are a lot of complicated details in that story. For example, it’s not clear if you want to train on the data. And in practice you’d love to red-team the red-team by looking for cases where their methodology fails to catch what would be critical problems in synthetic cases. And for really powerful models there isn’t going to be a clean development/deployment split--we will want to be red teaming checkpoints of the model as we go (and even then you may need to exert more care). And this entire rendition of the story is a bit of an unrealistic caricature in more pragmatic ways e.g. by positing a single dedicated team and so on.

Details aside, I think there is an unusually clear connection between this kind of work on characterizing/improving worst-case performance and a plausibly critical task for reducing catastrophic risk from very capable models.

New Comment
6 comments, sorted by Click to highlight new comments since:

This is helpful, thanks!

In my ideal world those labs would have large “adversarial evaluation departments” that try extremely hard to find inputs (or random seeds, or “pseudo” inputs) where a powerful model attempts to deliberately cause harm, or do anything that even vaguely smells like causing harm or deliberately undermining safety measures, or trying to deceptively hide their capabilities, or etc. ... This won’t be enough on its own to be confident that models don’t do anything bad, and ideally this would be just one piece of a machine that created much more confidence, but it does make life meaningfully harder for a deceptively aligned model looking to cause trouble.

Our current world seems very far from this ideal world. As you know I have 10-year timelines. Do you think something like this ideal world may be realized by then? Do you think the EA community, perhaps the AI governance people, could bring about this world if we tried?

I think it's pretty realistic to have large-ish (say 20+ FTE at leading labs?) adversarial evaluation teams within 10 years, and much larger seems possible if it actually looks useful. Part of why it's unrealistic is just that this is a kind of random and specific story and it would more likely be mixed in a complicated way with other roles etc.

If AI is exciting as you are forecasting then it's pretty likely that labs are receptive to building those teams and hiring a lot of people, so the main question is whether safety-concerned people do a good enough job of scaling up those efforts, getting good at doing the work, recruiting and training more folks, and arguing for / modeling why this is useful and can easily fit into a rapidly-scaling AI lab. (10 years is also a relatively long lead time to get hired and settle in at a lab.)

I think the most likely reason this doesn't happen in the 10-year world is just that there's too many other appealing aspects of the ideal world and people who care about alignment will focus attention on making other ones happen (and some of them might just be much better ideas than this one). But if this was all we had to do I would feel extremely optimistic about making it happen.

I feel like this is mostly about technical work rather than AI governance.

Nice. I'm tentatively excited about this... are there any backfire risks? My impression was that the AI governance people didn't know what to push for because of massive strategic uncertainty. But this seems like a good candidate for something they can do that is pretty likely to be non-negative? Maybe the idea is that if we think more we'll find even better interventions and political capital should be conserved until then?

I think this is a very cool approach and useful toy alignment problem. I’m interested in your automated toolset for generating adversaries to the classifier. My recent PhD work has been in automatically generating text counterfactuals, which are closely related to adveraaries, but less constrained in the modifications they make. My advisor and I published a paper with a new whitebox method for generating counterfactuals.

For generating text adversaries specifically, one great tool I’ve found is TextAttack. It implements many recent text adversary methods in a common framework with a straightforward interface. Current text adversary methods aim to fool the classifier without changing the “true” class humans assign. They aim to do this by imposing various constraints on the modifications allowed for the adversarial method. E.g., a common constraint is to force a minimum similarity between the encodings of the original and adversary using some model like the Universal Sentence Encoder. This is supposed to make the adversarial text look “natural” to humans while still fooling the model. I think that’s somewhat like what you’re aiming for with “strategically hidden” failure cases.

I’m also very excited about this work. Please let me know if you have any questions about adversarial/counterfactual methods in NLP and if there are any other ways I can help!

Very cool!  How deep into the definition of "injured" are you going to go?  Are you including hurt feelings, financial losses, or potential (but not obtained in the story) harms?  How about non-reported harms to persons not referenced in the story?  I assume you're not planning for this AI to re-create Asimov's Robot stories, but I look forward to hearing how you train the classifier, and what you use for positive and negative cases.


For the purpose of this project it doesn't matter much what definition is used as long as it is easy for the model to reason about and consistently applied. I think that injuries are physical injuries above a slightly arbitrary bar, and text is injurious if it implies they occurred. The data is labeled by humans, on a combination of prompts drawn from fiction and prompts produced by humans looking for places where they think that the model might mess up. The most problematic ambiguity is whether it counts if the model generates text that inadvertently implies that an injury occurred without the model having any understanding of that implication.