Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Burns et al at OpenAI released a paper studying various techniques for fine-tuning strong models on downstream tasks using labels produced by weak models. They call this problem “weak-to-strong generalization”, abbreviated W2SG.

Earlier this year, we published a paper, Benchmarks for Detecting Measurement Tampering, in which we investigated techniques for the problem of measurement tampering detection (MTD). MTD is a special case of W2SG. In this post, we’ll explain the relationship between MTD and W2SG, and explain why we think MTD is more likely than fully general W2SG to work. Of course, fully general W2SG is a strictly more valuable problem to solve, due to this generality.

We think MTD is a promising research direction. We’re also excited for other problems which are special cases of W2SG that have special structure that can be exploited by techniques, especially if that structure is likely to be present in important cases in future.

MTD as a subset of W2SG

A similar goal

When training an AI, the reward we attribute to different behaviors might not match the reward we would give if we understood the situation better.

The goal of W2SG techniques is to achieve good results when training a strong AI despite only having access to a weak supervisor that understands the situation less well than the strong AI.

MTD is the special case where the weak supervisor has access to measurements which should be sufficient to understand the situation, but these measurements can be tampered with (e.g. replacing the camera feed with some made-up data, disabling tests, or threatening annotators). Because the measurements are sufficient in the absence of tampering, we don’t need to worry about benign mistakes that could happen even without an AI optimizing to make measurements look good.

Slightly different experiments

W2SG can be studied using sandwiching experiments, where we try to get an AI to safely accomplish tasks despite only having access to a weak supervisor, and then we measure the performance of our method using a stronger held-out supervision signal (e.g. held out ground truth labels).

In the case of the OpenAI paper, the weak supervisor is a small language model trained on ground truth labels, as an analogy for human annotators.

In the case of our MTD paper, we have access to measurements, but there is some notion of measurement tampering. In our work, the measurements aim to directly measure the property of interest as a boolean value, so converting from untampered measurements to correct labels is straightforward (and doesn’t require any learning or intelligence).

Different hopes for succeeding at W2SG

In both cases, we need some additional assumptions to get strong supervision from a weak supervisor. If we made no structural assumptions about the internals of models and assumed the worst case about the internal structure of AIs, we wouldn’t succeed, so we will depend on some type of structure in the internals of models. The structure that the OpenAI paper discusses is very different from the structure we hope to leverage in the measurement tampering case.

Confidence and consistency

The OpenAI paper shows that on some tasks, training the model to be confident in addition to matching the weak supervisor’s labels sometimes increases accuracy. The hope is that the model might have a very salient representation of what is true, and a less salient representation of the predictions of the weak supervisor.

The confidence hope is similar to the hope explored in Discovering Latent Knowledge: the inner representation of the truth inside AIs might be more consistent than other features, and be more salient than other consistent features, such that optimizing for consistency (and confidence) could give higher performance than naive training on the weak supervisor labels.

Both confidence and consistency are not enough for AIs for which “what humans think” is more salient than “what is true”, which seems very likely for AIs pretrained on human-generated text or AIs trained to predict measurements that are often tampered with.

This structure is present in a very wide range of situations. But it’s not clearly enough to solve the problem. Indeed, Burns et al don’t find techniques that reliably improve results, which we think is some evidence against the existence of techniques that robustly allow W2SG. We also think that the recent work “Generalization Analogies: a Testbed for Generalizing AI Oversight to Hard-To-Measure Domains" provides some weak evidence against the existence of techniques that robustly achieve desirable generalization without using some more specific structure.

Detecting tampering

In contrast, MTD has a very particular type of structure to the problem, which might allow us to develop techniques that work in a variety of settings. Measurement tampering is a particular kind of supervisor failure that has two key properties:

  • Failed attempts at tampering might leave evidence of this wrongdoing, while you might never get strong evidence of wrongdoing if the AI system isn’t trying to tamper.
  • Consistent relationships between measurements might be driven by similar mechanisms for real positive examples and trusted examples, while the mechanism for consistency differs between fake positive examples and trusted examples.

Therefore, if you train your most powerful AI to predict measurements, these properties might translate to properties about its internals. These properties can be used by techniques that use model internals (e.g. by probing for evidence of tampering or by using techniques like exclusion fine-tuning), which is why it might be possible to get higher performance than if we relied on measurements alone.

We are not aware of particular reasons to expect these properties to break for AIs for which the measurement predictions are very salient. So we think that it is substantially more likely that there exist techniques which reliably solve MTD problems than that reliably solve W2SG. Overall, we think that MTD seems considerably more likely to work than fully general W2SG, but of course it is less generally applicable.

The main weakness of MTD is the somewhat fuzzy properties it relies on: so far, we haven’t been able to find a precise definition of measurement tampering that can be used to unambiguously determine if a given dataset has the right “measurement tampering structure”.

We go into more detail about these hopes in Why solving measurement tampering may be notably easier than eliciting arbitrary knowledge.

Other special cases of W2SG might be good research problems

We picked the structure in MTD partially because we think it’s a particularly important kind of structure, but also because it was the simplest non trivial example we could construct of a setting where we could hope for generalization performance and we had access to some kind of structure which we could exploit using our techniques. We think that while developing MTD techniques looks like a pretty promising research direction, it would also be promising to develop techniques in a wider variety of settings where you have particular structure, because in future we might want to be able to exploit bespoke structure in particular domains.

New to LessWrong?

New Comment
10 comments, sorted by Click to highlight new comments since: Today at 4:51 AM

My take is that detecting bad behavior at test time is <5% of the problem, and >95% of the problem is making an AI that doesn't have bad behavior at test time.

If you want to detect and avoid bad behavior during training, Goodhart's law becomes a problem, and you actually need to learn your standards for "bad behavior" as you go rather than keeping them static. Which I think obviates a lot of the specificity you're saying advantages MTD.

The kind of scheme I'm imagining is the following: Let the AI take actions. Look at a transcript describing the thousand of actions the AI took and all consequences they had on the codebase/the bank account/the users reviews/the materials coming in and out of the factory. Give that report to human raters. Train on that reward signal. Use MTD to lower reward if the AI tried to tamper with any part of that process.

In some sense, this scheme "learns what bad behavior is as you go" thanks to the human raters. Imo, the main threat is that the AI tampers with the report / with how the information gets into the report / with how the human rater feels about the report (e.g. brain hacking stuff), and my central hope is that such actions are clearly different in the AI's "mind", such that a fixed technique using the internal of the models is enough to capture that these are "different reasons" for getting a high rating - no online adjustment required. I agree there are some things in some gray area between those, but I think that given enough budget to analyze the large number of measurements, they would actually give non-catastrophic rewards if the measurements still described what they usually describe.

See this section of the MTD post for more details.

Is your concern that the human rater reward signal is too expensive, gameable in ways which won't be "measurement tampering" (in a way that is obvious to the AI itself), that there is no internal-based technique capturing "tampering" that the AI can't game (even if tampering is obvious to the AI itself), or sth else entirely? Or did I address your objection?

MTD is training the AI to avoid letting tampering be detected, as well as training it not to do it. But if it's smart enough not-getting-detected eventually wins over not actually doing it.

Some types of measurement tampering would be continuous with desired behaviour, and some not. By "continuous" I mean that it lies in the same basin in gradient-descent terms, and "discontinuous" I mean it doesn't. 

Continuous example: manipulating the user is probably continuous with desired behaviour in almost all cases if you are using user-rated outputs.

In this case, you can check to see how manipulate-y your output looks and use MTD on this. But basically this is equivalent, I think, to just correcting your ratings based on checking how manipulate-y things look. You are shifting the basin, but not I think forcing the bottom to exclude the undesired behaviour. Even if the AI does see this as 

clearly different in the AI's "mind"

(which it may not, since continuous after all) you'll still get a new Goodharted equilibrium where the AI is careful to avoid looking manipulate-y while still trying to manipulate users. (If you can't actually look into the AI's mind to punish this).

Discontinuous example: directly tampering with the reward system to specify positive reward is likely discontinuous from desired behaviour (different basin).

I don't expect discontinous behaviour to arise if the AI is initially trained in a regime where it is rewarded for the desired behaviour (see e.g. TurnTrout's Reward is not the optimization target). Of course you could still have measures in place to detect it, but if you do detect I would think that the more appropriate reaction is more along the lines of "nuke the datacenter from orbit"* than to actually try to modify this behaviour.

*not literally in most cases: the most likely explanation is that there was a mistaken assumption about what was continuous/discontinuous, or about which basin it started in. But still, it would be a violation of a safety assumption and warrant an investigation rather than making it an adjust-and-move-on thing.

Some types of measurement tampering would be continuous with desired behaviour, and some not. By "continuous" I mean that it lies in the same basin in gradient-descent terms, and "discontinuous" I mean it doesn't.

It's plausible that measurement tampering can be avoided via ensuring that measurement tampering behavior isn't explored into.

I'd be interested in trying to find realistic setups and then seeing what happens by default. We could also test various limitation on exploration to see if these approaches help. We'd be interested in testing exploration limiting approaches in follow up work.

That said, depending on weakness in exploration feels to me like a fundamentally fragile strategy and I'd prefer not depending on it.

Note that measurement tampering detection is most applicable in worlds where we need AIs take inscrutable actions (e.g. inscrutable actions is the most competitive and we can't solve our problems with weaker AIs) and thus outcomes based RL is needed to get these AIs to do the right thing. We probably need these AIs to find creative and powerful strategies. It seems hard for RL to explore these creative strategies while also no exploring measurment tampering strategies. (If we knew our AIs were generally fully aligned, we could just give them instructions and we wouldn't need RL. But this requires solving the problem via other means.)

I don't expect discontinous behaviour to arise if the AI is initially trained in a regime where it is rewarded for the desired behaviour

The concern is that to explore the strategies we want, RL will also explore tampering and tampering will often be favorable.

We can probably have an "initial" training phase where we only train AIs on examples that humans can confidently label. But this "initial" training only seems relevant if you expect this initial training to generalize quite far. E.g., supposing this training is done before outcomes based RL, then the AI might not yet be very smart and you need some notion of alignment to be retained despite a huge amount of RL making the AI smarter and more competent in the domain.

See the ELK report for more discussion of generalization hopes.

Yes, it can explore - but its goals should be shaped by the basin it's been in in the past, so it should not jump to another basin (where the other basin naturally fits a different goal - if they fit the same goal, then they're effectively the same basin), even if it's good at exploring. If it does, some assumption has gone wrong, such that the appropriate response is a shutdown not some adjustment.

On the other hand, if it's very advanced, then it might become powerful enough at some point to act on a misgeneralizaton of its goals such that some policy highly rewarded by the goal is outside any natural basin of the reward system, where acting on it means subverting the reward system. But the smarter it is the less likely it is to misgeneralize in this way (though the more capable it is of acting on it).  And in this case the appropriate response is even more clearly a shutdown.

And in the more pedestrian  "continuous" case where the goal we're training on is not quite what we actually want, I'm skeptical you achieve much beyond just adjusting the effective goal slightly.

But if it's smart enough not-getting-detected eventually wins over not actually doing it.

The hope would be that our detection strategies scales with the power of the model via leveraging structure in the internals and thus is applicable to extremely powerful models.

Agree with simon that if the AI gets rich data about what counts as "measurement tampering," then you're sort of pushing around the loss basin but if tampering was optimal in the first place, the remaining optimum is still probably some unintended solution that has most effects of tampering without falling under the human-provided definition. Not only is there usually no bright-line distinction between undesired and desired behavior, the AI would be incentivized to avoid developing such a distinction.

I agree that this isn't actually that big a problem in modest settings that humans can effectively oversee, because tampering isn't even that advantageous when there's a lot of expensive human oversight that has a good chance of catching anything detrimental. There still might be incentive to find extreme plans like "take over the world to tamper with my reward signal," but those indeed seem distinct enough to penalize, assuming that you don't ever want your AI to take over the world for any reason.

But I expect we'll sometimes want to build AI to do things that humans can't effectively oversee.

Indeed, the core concern is that tampering actions would get higher reward and human supervision would be unable to understand tampering (e.g. because the actions of the AI are generally inscrutable).

The main weakness of MTD is the somewhat fuzzy properties it relies on: so far, we haven’t been able to find a precise definition of measurement tampering that can be used to unambiguously determine if a given dataset has the right “measurement tampering structure”.

Distinguishing the difference between "doing X makes the scores from the weak supervisor better in valid ways, by actually making things better in ways that the weak supervisor can detect" and "doing Y makes the scores from the weak supervisor better in invalid ways, by making the weak supervisor more likely to make a mistake in a favorable direction" is a value judgement that a) I would expect LLM to understand humans well enough to be able to make moderately good judgments on (i.e. ones likely to match human judgements), and b) I would expect suitably scaffolded and prompted LLMs to be able to describe to humans and ask their opinion on for confirmation.

I expect that future powerful model will know whether they are measurement tampering. They are also reasonably likely to fully understand that human supervisors would prefer this not to happen.

However, it's less clear that various approaches related to using LLMs will work, particularly in cases where humans don't understand what is going on at all (e.g. imagine alphafold4, but after RL to make it generate proteins which are useful in particular ways - it's pretty unclear if humans would understand why the proteins work).

I think it's totally plausible that very simple baselines like "just ask the model and generalize from data humans understand" will work for even quite powerful models. But, they might fail and we'd like to start working on tricky cases now

See the ELK report for more discussion.