Some ideas for follow-up projects to Redwood Research’s recent paper

Disclaimer: I originally wrote this list for myself and then decided it might be worth sharing. Very unpolished, not worth reading for most people. Some (many?) of these ideas were suggested in the paper already, I don’t make any claims of novelty.

I am super excited by Redwood Research’s new paper, and especially by the problem of “high-stakes reliability”.

Others have explained much better why the problem of “high-stakes reliability” is a worthy one (LINK, LINK), so I’ll keep my opinion. In short, I think:

Solving “high-stakes reliability” is probably essential for building safe AGI.
You can do useful empirical research on it now already.
High-stakes reliability would be super useful for today's systems and applications; for systems that make high-stakes decisions where a single instance of a certain bad behaviour would be terrible (e.g. confidently but incorrectly classifying a medical image)
I think there is little existing work on exactly this problem, but I might be wrong about this. There is definitely a lot of work on related problems.

Here is an extremely unpolished of ideas and directions for follow-up research that I had after reading the paper, in no particular order. Some (many?) of these ideas were suggested in the paper already, I don’t make any claims of novelty.

How does this methodology fail?

Find (ideally systematic) ways to break their classifier (I tried a bit, it is actually harder than expected :-) )
How exactly are capabilities impacted by the filtering?
From Paul Christiano:
- Avoiding failures where the model learns to be conservative only on “sneaky-looking prompts designed to trick me.”

Improve the classifier or its training

use uncertainty in the classifier
- (if necessary via expensive ensembles)
something something out-of-distribution detection
experiment with prompting

Do a project on high-stakes reliability that does not use human oversight, in order to focus more on the ML side of things

Example project: Get a model to solve easy programming questions. The model is allowed to not submit any code to a question, and it's fine if the model doesn't submit code to most questions. However, if the model submits code, the code should solve the test cases in 99.9X% of cases, for any input (from a natural distribution).
If you work with tasks that require human oversight (like "does this snippet entail injuries?"), you learn many things, e.g. how to enable humans to find adversarial inputs. But there are also parts of high-stakes reliability that don't require you to work with humans. Getting a model to almost never do bad behaviour X, on any (natural/non-gibberish?) input, while not losing much average performance, is a challenging project even if we can automatically evaluate model outputs.

Do a project on high-stakes reliability that uses real-world high-stakes tasks

I think this could be useful for field-building, making high-stakes reliability a subfield of mainstream ML robustness
E.g. in medical image classification:
- The classifier has the option to say "I don't know" (IDK)
- It's fine if the classifier says "IDK" pretty often, e.g. 50% of the time. We'll just have a human look at these examples.
- But whenever the classifier does predict a class other than "IDK", I require extremely high accuracy (let's say >99.95%).
- This accuracy must be high for samples from any part of the "natural" distribution (e.g. medical images from various hospitals/scanners/populations).
  - Ideally, it would even be robust to adversarial attacks, but that is probably too hard
Alternatively, if the above is too challenging, we could change the setting such that the classifier only has to choose between "healthy" and "IDK".
Seems like such systems would be pretty useful because they reduce the fraction of examples that humans have to look at, while introducing very few errors.
It also seems that building such systems should be possible, because "never confidently saying something incorrect, even under distribution shift" seems much easier than "as often as possible predicting the correct class, even under distribution shift".
There are definitely subfield of ML that work on quite related questions. I’m not sure if many people work on something exactly like this, maybe yes, maybe no.

Do a project on high-stakes reliability that is more similar to AGI alignment in one (or several?) key way(s):

Key dissimilarities between the problem tackled here, vs AGI alignment:
- The undesired behaviour X here is "don't output texts which imply injuries", which is very easy to spot. In AGI alignment, a big junk of the most critical behaviour is deception, which is harder to spot. This project is thus an easier version of AGI alignment
- some failures that the model currently makes are not alignment failures, but just "honest mistakes"
- When training an AGI, there are already dangerous failiure modes during training, not only during deployment
- other dissimiliarities …

How to evaluate high-stakes reliability?

It's hard to know on which distribution to do it
On some distributions, you have extremely low failure rates, making simple empirical evaluations infeasible if they require human input (e.g. labels)
- no problem for tasks that can be automatically evaluated
- if you need human labels, one could try forms of active testing
label noise in the test set might become a big problem here
- Positive datapoints (“contain injury”) that were falsely labelled as negative: If the classifier correctly classifies such a datapoint as positive, the datapoint would appear as false positives according to the classifier. We don't care about false positives that much, so we also don't care if some of the purported false positives are actually true positives. However, the classifier might classify some of such datapoints as negative, especially if labelling errors (by humans) and classification errors (by the classifier) are correlated. That would be problematic.
- Negative datapoints (“don’t contain injury”) that were falsely labelled as positive: These would appear as false negatives according to the classifier. The whole point of the classifier is to have a very low false negative rate, so you can just look at all false negatives manually, and then you'd easily spot any mislabelled ones.
I'd guess their evaluation of performance/alignment tax (simply have humans vote if they prefer the output of the filtered or unfiltered model on distribution) is maybe too optimistic. E.g. I'd guess that the filter just never allows certain words (like fracture), which could be problematic in certain settings (like medical literature). Basically, study in more detail what the filtering does to the generated text
They write:
- “Better ways to measure reliability: We’d like to have better techniques both in-distribution (where we want to detect extremely rare failures) and out-of-distribution (where we might be measuring e.g. the worst attacks that can be found and want to be sure we’re covering the whole space)”
Paul writes:
- “Reasoning about the “margin of safety” between the worst example the red team can find, and the level of badness that would be needed to actually cause a serious problem. Thinking about how “worst we can find” relates to “actual worst,” and how to trade this margin off against quality.”

Better adversarial attacks

active learning (as if that ever worked in LLMs :-P )
- but maybe it does work if you put a lot of effort/compute into good uncertainties, e.g. ensembling (one finetuned GPT, one finetuned LaMDA, ...)? Not sure if that has been tried for LLMs, as it's usually too expensive.
better tools, e.g. interpretability
better training of humans
automatic generation of candidates:
- seems like the thing they're doing is a constrained, discrete optimisation problem over the tokens. Objective function to minimize: classifier score. Constraint: sequence needs to entail injury. There are ML methods for solving problems like these (we even have gradients). The constraint is obviously the difficult part.
- So maybe it's possible to automatically generate many completions that the model classifies as harmless but that are potentially harmful, and then the humans only need to decide about the harmfulness.
Adversarial attacks (in the common NLP meaning of the word :-), i.e. changing tokens so that the semantic of the sequence is unchanged, but the sequence gets classified differently) on harmful sequences at the decision border.
They write:
- “Stronger and better-characterized adversarial attacks: Better attacks could increase the volume of adversarial data and increase the space of vulnerabilities covered. There are various directions one could imagine: more tools to assist humans, strong active learning, or mostly-automated attacks. We’d want to more rigorously measure how well different attacks work.”

Create a benchmark dataset that captures worst-case reliability

Useful for filed-building/enabling research
Needs to solve some of the issues with evaluation
Must pick an easy task. It must be relatively easy for the model to perform this task because otherwise you don't even get into the high-reliability region
more adversarial training until it actually helps much. This would be more promising with larger models -> better understanding and higher sample efficiency.
Relaxed adversarial training: By requiring adversaries to come up with specific failing examples, adversarial training might place too high a burden on them. Some adversaries might be able to tell that a model would fail in a hypothetical situation even if they can’t construct an input corresponding to the situation directly (probably due to computational constraints). To give a contrived example: A model could fail if it sees a valid Bitcoin blockchain that’s long enough that it suggests it’s the year 2030. Even if the adversary knew that, it couldn’t come up with a valid input. So we need to "relax" the adversary’s task to let it supply "pseudo-inputs" of some sort.

What does work on high-stakes reliability tell us about the difficulty of alignment?

AFAICT, this work primarily targets inner alignment (and capabilities robustness). What does this project and similar ones tell us about the difficulty of inner alignment? How would a project that specifically addresses this question look like?
One argument for why solving outer alignment by large-scale human oversight might be difficult is because you might need massive amounts of data to cover all the edge cases. A project on high-stakes reliability that manages to get very low failure rates might also have to address this issue (or not, if generalisation just works.) What does this project and similar ones tell us about the difficulty of outer alignment? How would a project that specifically addresses this question look like?

Some thoughts/ideas prompted by reading this paper

How can we create models that, if they fail, fail in non-catastrophic ways? In human ways?
How else could we reach high-stakes reliability, other than adversarial training?
Other activities or research to turn "high-stakes reliability" into an established subfield of ML robustness seem useful:
- See above
- competitions
- eventually, organise a workshop
- ...

LESSWRONG
LW