AI Safety in a World of Vulnerable Machine Learning Systems

AdamGleave; EuanMcLean

Even the most advanced contemporary machine learning systems are vulnerable to adversarial attack. The safety community has often assumed adversarial robustness to be a problem that will be solved naturally as machine learning (ML) systems grow more capable and general. However, recent work has shown that superhuman systems in a narrow domain such as AlphaZero are highly vulnerable to adversarial attack, as are general but less capable systems like large language models. This raises the possibility that adversarial (worst-case) robustness will continue to lag behind average-case capabilities. In other words, transformative AI systems are likely to be exploitable.

Exploitability will cause a wide variety of current alignment proposals to fail. Most extant agendas seek to align the main ML system with the assistance of helper ML systems. The main ML system is the primary system that takes actions in the world (e.g. interacting with users), with the helper ML systems acting as scaffolding to train and/or verify the main ML system. These alignment schemes will fail if the helpers are exploited by the main system – and we expect helpers to be vulnerable to exploitation (see Contemporary ML systems are exploitable by default).

In Table 1 we present a subjective risk matrix for a range of popular alignment agendas, evaluating the degree to which main ML systems have the ability and incentive to exploit the helper. We find many alignment agendas have a high risk of exploitation, with all having at least some risk.

Alignment Agenda	Main System’s Ability to Exploit Helper	Main System’s Incentive to Exploit Helper	Risk of Exploit
RL on learned reward model (e.g. RLHF, IRL)	Medium	High	High
Scalable oversight (e.g. recursive reward modeling, AI safety via debate)	Medium	High	High
Imitation learning (e.g. behavioral cloning, supervised fine-tuning)	Medium	Low	Low-Medium
Imitative Iterated Distillation and Amplification	High	Low	Medium
Auditing Tool (e.g. Adversarial Testing, Transparency)	Low	Medium	Low-Medium

Table 1: Subjective risk matrix for popular alignment agendas (see next section), using a helper ML system to assist with aligning the main ML system that will eventually be deployed.

We are most concerned by vulnerabilities in the helpers as this can impact the alignment of the main system. By contrast, an aligned but adversarially exploitable main system would not necessarily pose a danger, especially if the main system can recursively self-improve to fix itself. However, there is a possibility that even superintelligent systems cannot attain adversarial robustness. This would be a volatile situation, which could conceivably collapse into chaos (systems frequently exploiting each other), an implicit equilibrium (e.g. mutually assured destruction), or an explicit agreement (e.g. all AI systems self-modify to commit to not exploiting one another).

We see two possible approaches to fixing this: improving adversarial robustness, or developing fault tolerant alignment methods that can work even in the presence of vulnerable ML systems. We are most excited by fault tolerant alignment, as it is highly neglected and plausibly tractable, although further work is needed to solidify this approach. By contrast, adversarial robustness is an area that has received significant attention from the ML research community (low neglectedness)^[1]but with only modest progress (low to medium tractability).

In the remainder of this document, we will argue that systems are exploitable by default, explore the implications this has for alignment agendas in several different scenarios, and outline several research directions we are excited by.

Alignment agendas need robustness

Most alignment schemes implicitly depend on some aligned helper ML systems that can assist us in training or validating a main ML system, like a proto-AGI. In this section we assume the helper is exploitable, and explore the problems this causes for these alignment schemes. We justify this assumption this assumption in the next section.

A helper is more likely to be exploited if the main system is incentivized to exploit it, such as by the helper directly providing a training signal the main system is being optimized for, or if the main system has gained situational awareness. Conversely, it is less likely to happen if the optimization process is pursuing a goal that is largely independent from the helper’s output, such as if the helper is an interpretability tool checking for the presence of mesa-optimization. In the best case, we might even be able to (as in the case of IDA) control the optimization pressure to explicitly avoid inputs that might be adversarial.

The likelihood of an exploit also depends on the ability of the main system. This includes the computational resources available to develop an exploit, which could range from just forward-passes of the main system’s network to the main system having access to a large compute cluster. It also depends on how much information the main system has about the helper, which could range from merely a high-level knowledge of possible evaluation techniques (black box), through to query access to the helper (gray box), through to full access to its weights and source code (white box). We can’t rely on security by obscurity: many zero-shot black-box adversarial attacks have been conducted by human researchers, and superintelligent ML systems might be even more creative. Still, even superintelligent ML systems are not omnipotent, and in practice the “adversary” might be a relatively stupid optimization process like SGD.

Referring back to Table 1, one class of alignment agendas that is particularly sensitive to lack of robustness is those where the helper is an overseer providing feedback to the main system, such as in RLHF and most scalable oversight proposals. In this case, exploiting the helper is part of the training objective of the main system, explaining the ubiquity of reward hacking.

Corrective feedback from a human overseer can help mitigate reward hacking when the reward model is trained online, but there remains a significant risk that the main system hacks the helper more rapidly than a human is able to provide corrective feedback. Alternatively, the main system may simply learn to trick the human. Scalable oversight methods are designed to guard against tricking the human, but cannot protect against scenarios that are adversarial to both the human and scalable oversight model.

Imitation learning methods seem less risky, as the optimization pressure is simply to match the empirical distribution of a demonstration dataset. The closest to “reward hacking” in this setting would be overfitting to the dataset, a relatively benign failure mode. There is still some risk of inner optimization objectives arising, which could then be adversarial to other systems (e.g. attempt to hide themselves from transparency tools), but comparatively speaking this is one of the methods with the lowest risk of adversarial failure. Unfortunately, it also maxes out at around human performance.

Imitative Iterated Distillation and Amplification (IDA) is less sensitive to exploitability. IDA is a proposal to take an imitation learned model and amplify it beyond human-level performance. Casting it in our framework, the helper is a base imitation-learned model that we presume is safe and aligned. The main system is the model produced by IDA on the helper. At the first step, the helper is amplified by giving it the ability to ask questions to copies of itself. Critically, the helper has an incentive to avoid asking questions that are likely to cause it to fail. In this sense, IDA channels the optimization pressure away from adversarial exploitation, unlike e.g. vanilla RL.

The flipside is that the helper also likely has a high ability to exploit itself owing to this introspective knowledge of its own vulnerabilities. ^[2] This could result in some instability in the training scheme: if at any point a sub-part of the question tree is adversarial, the response could propagate to “infect” the rest of the question tree. This is related to the problem of security amplification.

One of the less concerning cases is for tools used only to evaluate, and not train a model, as is often the case for adversarial testing or transparency. Here, the main system has no direct incentive to fool the helper as it does not provide a training signal. However, in practical applications some optimization pressure against the helper will still leak in, for example if the helper is used to decide whether to abort a training run or change its hyperparameters. This would result in problems similar to those observed today with researchers overfitting to the test set.

Even if we only used these helper methods as a one-off final check on the main system, the main system might still be incentivised to fool these evaluation methods if it has situational awareness and long-term objectives. However, it will be a challenging exploit to pull off, as the main system may have only partial knowledge of the design of the helper (e.g. what tests the helper will perform) and cannot engage in direct trial-and-error exploitation of it. Additionally, transparency tools are typically applied to the weights of the main system, which the main system can at best only indirectly control (by gradient hacking). ^[3] Unfortunately, evaluation methods are not themselves solutions to alignment, but can provide an additional safeguard to avoid deployment of malign systems.

Contemporary ML systems are exploitable by default

Our best guess is that all contemporary ML systems solving non-trivial tasks are exploitable by a moderately resourced adversary. ℓ_p-norm adversarial examples in image classifiers were first described by Szegedy et al. in 2013, and nearly a decade later state-of-the-art image classifiers remain vulnerable despite intense research interest in adversarial defenses. These vulnerabilities can be exploited in real-world settings by physical adversarial attacks, and there are even naturally occurring images that are challenging for a wide variety of models. Moreover, analogous issues have been found in a diverse range of ML systems including language models, graph analysis, robotic policies and superhuman Go programs.

To the best of our knowledge, no ML system solving a non-trivial problem has ever withstood a well-resourced attack. ^[4] Adversarial defenses can be divided into those that are broken, and those that have not yet attracted concerted effort to break them. This should not be too surprising: the same could be said of most software systems in general.

One difference is that software security has notably improved over time. Although there almost certainly exist remote root exploits in most major operating systems, finding one is decidedly non-trivial, and is largely out of reach of most attackers. By contrast, exploiting ML systems is often alarmingly easy.

An apple with a label saying iPod that is misclassified as an iPod Figure 1: A typographic attack enables a no-code exploit of OpenAI Clip. More examples.

This is not to say we haven’t made progress. There has been an immense amount of work defending against ℓ_p-norm adversarial examples, and this has made attacks harder: requiring more sophisticated methods, or a larger ℓ_p-norm perturbation. For example, a state-of-the-art (SOTA) method DensePure achieves 77.8% certified accuracy on ImageNet for perturbations up to 0.5/255 ℓ₂-norm. However, this accuracy is still far behind the SOTA for clean images, which currently stands at 91.0% top-1 accuracy with CoCa. Moreover, the certified accuracy of DensePure drops to 54.6% at a 1.5/255 ℓ₂-norm perturbation – which is visually imperceptible to humans. This is well below the 62% achieved by AlexNet back in 2012.

There is substantial evidence for a trade-off between accuracy and robustness. Tsipras et al (2019) demonstrate this trade-off theoretically in a simplified setting. Moreover, there is ample empirical evidence for this. For example, DensePure was SOTA in 2022 for certified accuracy on adversarial inputs but achieved only 84% accuracy on clean images. By contrast, non-robust models achieved this accuracy 4 years earlier such as AmoebaNetA in 2018. There appears to therefore be a significant “robustness tax” to pay, analogous to the alignment tax.^[5]

In addition to certified methods such as DensePure, there are also a variety of defense methods that provide empirical protection against adversarial attack but without provable guarantees. However, the protection they provide is partial at best. For example, a SOTA method DiffPure achieves 74% accuracy on clean images in ImageNet but only 43% accuracy under a 4/255 ℓ_∞-norm perturbation. There is also a significant robustness tax here: Table 5 from the DiffPure paper shows that accuracy on clean images drops from 99.43% on CelebA-HQ to 94% with the diffusion defense.

To make matters worse, real attackers have a much broader range of possible attacks outlined by Gilmers et al (2018), such as rotating images, perturbing physical parameters in rendered images, adversarially selecting images from a real-world dataset, adversarial patches, single-pixel attacks and latent adversarial perturbations. We would like to be robust to all these attacks, but there appears to be fundamental trade-offs between robustness to different attacks, with Tramer et al (2019) showing such a trade-off between different types of ℓ_p-bounded and spatial perturbations. Moreover, there are currently no effective methods to defend against unrestricted adversarial examples outside of toy settings.

Although the ubiquitous presence of adversarial examples in contemporary ML systems is concerning, there is one glimmer of hope. Perhaps these adversarial examples are merely an artifact of the ML systems being insufficiently capable? Once the system reaches or surpasses human-level performance, we might hope it would have learned a set of representations at least as good as that of a human, and be no more vulnerable to adversarial attack than we are.

Unfortunately, recent work casts doubt on this. In Wang et al (2022), we find adversarial policies that beat KataGo, a superhuman Go program. We trained our adversarial policy with less than 14% of the compute that KataGo was trained with, but wins against a superhuman version of KataGo 97% of the time. This is not specific to KataGo: our exploit transfers to ELF OpenGo and Leela Zero, and in concurrent work from DeepMind Timbers et al (2022) were able to exploit an in-house replica of AlphaZero.

Of course, results in Go may not generalize to other settings, but we chose to study Go because we expected the systems to be unusually hard to exploit. In particular, since Go is a zero-sum game, being robust to adversaries is the key design objective, rather than merely one desiderata amongst many. Additionally, KataGo and AlphaZero use Monte-Carlo Tree Search coupled with a neural network evaluation. In general, we would expect search (which is provably optimal in the limit) to be harder to exploit than neural networks alone, and although search does make the system harder to exploit we are able to attack it even up to 10 million visits – far in excess of the threshold needed for superhuman performance, and well above the level used in most games.

There remains a possibility that although narrowly superhuman systems are vulnerable, more general systems might be robust. Large language models are the most general systems we have today, yet work by Ziegler et al (2022) find they are still exploitable even after significant adversarial training. Moreover, the existence of apparently fundamental tradeoffs between accuracy and robustness suggests that the most capable AI systems at any given time may be particularly likely to be vulnerable (Tsipras et al, 2019; Tramer et al, 2019).

Of course, at some point systems might be developed that are adversarially robust. This could be by “overshooting” on capability and generality, and then paying a robustness tax to get a suitably capable or general but robust system. Alternatively, new techniques might be developed that reduce or eliminate the robustness tax. Most optimistically, it is possible that general, human-level systems are naturally robust even though generality or human-level performance on their own are insufficient. In the next section, we will consider different possibilities for when and if adversarially robust systems are developed, and the implications this has for safety.

Future trajectories for robustness

We will consider three possible cases:

We solve adversarial robustness before transformative AI is developed;
We solve it after transformative AI is developed;
It is never solved.

Although coarse-grained, we believe this case split captures the most important distinctions.

For the purpose of this section, we will consider adversarial robustness to be solved if systems cannot be practically exploited to cause catastrophic outcomes. This is intended to be a low bar. In particular, this definition tolerates bounded errors. For example, we would tolerate threat actors being able to permanently trick AI systems into giving them 10% more resources in a trade. We’d also tolerate threat actors being able to temporarily divert even the majority of the AI’s resources, so long as this did not lead to permanent negative effects and that attackers eventually run out of such exploits.

We summarize our subjective credence in each of the cases below, and explore the cases qualitatively in the following sections.

Before you read further, consider estimating your credence for the following question, corresponding to Case 1:

Or the following question, corresponding to Case 1 or 2 being true:

Case 1: Adversarial robustness is solved before transformative AI is developed

Likelihood

There are two main sources of hope for this outcome. First, there is always a chance of an algorithmic insight that significantly improves robustness. Although we would expect the low-hanging fruit here to have already been plucked, insights are hard to predict, so we should not rule out the possibility of a breakthrough in the near-term. Second, there is the possibility of continued gradual progress in adversarial robustness in tandem with capabilities.

We’ve argued above that capabilities do not guarantee robustness and observed trade offs between capability and robustness. However, capabilities often do improve robustness. For example, pre-training improves the adversarial robustness of image classifiers.

One of the main reasons current ML systems are vulnerable is due to their incompetence. Even a superhuman system like KataGo is known to struggle to evaluate positions that require careful counting. It is perhaps less remarkable then that machine learning systems are vulnerable, and more remarkable that despite their often glaring flaws they are able to beat humans at many competitive games and achieve human-level accuracy at a variety of tasks.

In other words, perhaps the takeaway from the ubiquity of adversarial robustness should be that contemporary ML systems are only providing a facsimile of intelligence. In this view, transformative AI is likely still many decades out. Once transformative AI is developed and the systems are truly intelligent, many of these problems will naturally go away.

This seems plausible to us prima facie, but we’re deeply uncertain. Perhaps humans too have adversarial examples, they are just harder to find due to the expense of human experiments, and the (current) black-box nature of the human brain. We expect those with longer timelines to transformative AI to be more convinced by this argument. For those with short timelines, it seems unlikely that we will solve adversarial robustness in the next decade given the limited progress of the past decade.

Implications

In many ways, this is the best case scenario. In the rosiest outlook, we could solve adversarial robustness in the next few years by some flash of insight, well before transformative AI is developed. We should all do what we can to bring this about, although we are not currently aware of many neglected and tractable research directions that could achieve this.

The situation is more challenging if adversarial robustness is solved only gradually, or just before transformative AI. In that situation we are likely to encounter many of the problems discussed in the following section. However, the problems will mostly occur in settings where human feedback is still reliable, and scalable oversight is not required. This gives us a chance of detecting exploitation by manual inspection. However, this could still be challenging if the attack is subtle, or conducted rarely: e.g. it could be easy to overlook the main system obfuscating a single circuit in a large network. For that reason, it would still be beneficial to hasten the arrival of adversarial robustness, but it may not be essential for success.

Case 2: Adversarial robustness is solved after transformative AI is developed

Likelihood

The adversarial machine learning research community has spent almost a decade attempting to solve adversarial robustness, with limited progress. And we’ve seen that even adversarially trained superhuman systems (KataGo) and general but sub-human systems (language models) are exploitable. It’s not impossible the problem will be solved by default: perhaps narrowly superhuman systems with a bit more generality will start to naturally learn more robust and human-like representations. But absent new algorithmic insights into robustness, this problem seems likely to persist even into transformative AI systems.

However, a decade of research effort by the current ML research community is still small compared to the amount of resources that are likely to be brought to bear on the problem once transformative AI is developed. First, the economic incentive to resolve the issue will strengthen as profitable (but vulnerable) AI systems are deployed. Second, more advanced AI systems may partially automate ML research and development (R&D) leading to lower R&D costs for adversarial robustness. Consequently, the development of transformative AI might itself precipitate a solution to adversarial robustness.

Economic and political incentives. For the most part people are not currently losing large sums of money due to AI vulnerabilities. However, after transformative AI is developed, a large fraction of world GDP will depend on (vulnerable) AI systems. At this point, improving adversarial robustness could easily attract resources comparable to that of all information security spending today, or even rivaling that of a nation's defense budgets. This would be orders of magnitude more funding than is currently directed towards adversarial ML research.

Lower R&D costs. One of the more likely paths to transformative AI involves systems that are able to automate parts of science research and development (R&D). This is likely to lower the cost of AI research, enabling more (and potentially higher quality) adversarial robustness research.

Offense-Defense Balance. Developing transformative AI will certainly help improve adversarial robustness: but it will also lead to advances in attackers capabilities. Attackers will have a greater economic incentive to exploit widely deployed AI systems, and be able to leverage automated R&D systems to improve their attacks. However, it is possible that transformative AI will lead to a phase shift that favors defenders. In particular, defenders are more likely to prevail if there exist technical solutions to adversarial robustness that, while hard to find, once discovered are extremely difficult or impossible to exploit.

The history of cryptography followed a similar path: initial methods were consistently broken, but the latest methods have withstood concerted cryptanalysis for decades. Early ciphers date back thousands of years, but mathematical methods only began to proliferate in the 20th century. For many decades, there were no ciphers that could withstand significant cryptanalysis. In World War II, both the Allies and Germany were able to crack many of each other’s ciphers. More modern cryptography evolved soon after that, but early attempts continued to be dogged with issues. The Data Encryption Standard (DES) introduced in 1975 was broken in 1999, largely due to its small key size. The Advanced Encryption Standard (AES) adopted in 2001 has been widely scrutinized since then and is still currently believed to be practically secure. So although for many decades cryptographers and cryptanalysts were evenly matched, eventually cryptographic advances swung the offense-defense balance in favor of the defenders.

Implications

The good news in this scenario is that sufficiently advanced AI systems will not be exploitable. The bad news is that the precursors of transformative AI systems will still be vulnerable. This poses two problems. First, we must protect proto-transformative AI systems during their vulnerable period, analogous to their “childhood”. Second, we must protect helper AI systems we use to help align transformative AI systems from being exploited by the transformative AI system itself.

Protecting proto-transformative AI systems from outside bad actors seems challenging but solvable. An extreme option is to simply train the AI system on an airgapped network, although this might pose some practical challenges if the AI system required interaction with the broader world to learn. Less extreme options involve training the AI system in a sandbox with a restricted list of users.

The second issue of helper AI systems we use for alignment being exploitable is thornier, and consequently is the central focus of this agenda. Here the threat is internal, not external. Airgapping won’t work, because the helper AI systems need to be able to interact with the transformative AI in order to inspect it and shape its training process. However, restricting information flow (especially what the transformative AI knows about the helpers) could help.

Let’s consider one example in depth, where the helper is an overseer that is providing a training signal to the main system. The overseer stands in for a human judge. In the simplest case, it might be a reward model trained on human feedback, able to judge situations more quickly and cheaply than a human. In more complex cases, it might be trained by any of a range of scalable oversight methods (e.g. recursive reward modeling or AI safety via debate).

Suppose we are training the system that will be the first transformative AI system built. Initially, neither the main system nor overseer will be transformative AI systems, and so by assumption of this case both will be vulnerable. Critically, since the main system is trained to optimize the overseer’s output, there is an inherently adversarial relationship between the two.

Even though the main system may be initially unsophisticated, the evidence from contemporary ML suggests it will have no issue exploiting the overseer. Many algorithms to generate adversarial examples are very simple: much simpler than the networks they exploit.^[6] In our attack against KataGo, a randomly initialized adversary wins decisively after as little as 0.3% (against unhardened KataGo) and 6% (against hardened KataGo) as much training time as the victim agent.

One major advantage an overseer has is that it is typically trained online. If the main system starts to exploit it, the overseer has a chance to correct itself. This seems relatively easy so long as humans are not fooled by the exploit, and can provide corrective feedback in time. However, the overseer may still be vulnerable. For example, work on reward modeling in language models often involves online or iterated offline training, yet these reward models will still assign erroneously high output to gibberish text, ^[7] necessitating KL regularization during training.

Moreover, humans are hardly infallible: we have already seen examples of training processes learning to fool humans such as by appearing to but not actually grasping an object. Indeed, the limited nature of human feedback is the primary motivator for scalable oversight approaches. If the human trainer can be exploited by the main system then we must rely on the oversight process itself being able to detect and defend against the attack.

Notably, there is an analogy between AlphaZero-style training and iterative distillation and amplification. AlphaZero uses Monte-Carlo Tree Search (MCTS) as a policy improvement operator that amplifies a network via search. AlphaZero-style training then distills the amplified network back into the original network.

We might hope that the MCTS “amplification” step would make the network more robust. Empirically, the results are mixed. In our attack against KataGo, our first “pass-based adversary” that achieves a 99% win rate against KataGo without search is defeated even by searching only 32 nodes. However, our latest “cyclic adversary” is able to win 76% of games against KataGo searching 10 million nodes. By contrast, KataGo searching 2048 nodes is already strongly superhuman. MCTS helped KataGo but ultimately did not save it from adversarial attack.

Although search does improve robustness, when the underlying network has a sufficiently deeply rooted vulnerability it can be insufficient to protect against it at all practically achievable levels of search. Of course, amplification methods might be more sophisticated than MCTS. However, they will also need to operate in more complex environments with much larger branching factors and longer time horizons than in Go.

While an exploitable overseer is the most natural and concerning setting, versions of this problem also arise for other kinds of helper agents. For example, perhaps you are more concerned about inner than outer alignment, and are envisaging a helper agent conducting adversarial testing or mechanistic interpretability to detect inner misalignment in some main system system. Here, the main system system would have an incentive to fool the helper agent into falsely declaring the main system as inner aligned.

The risks vary depending on how the helper is being used. If deployed simply as a check at the end of training, then the incentive for the main system is indirect, as it is not receiving a training signal for it. However, if the mesa-optimizer has situational awareness then it would likely strongly suspect it will be subject to some degree of automated checks by other AI systems. Even if the exact nature of these checks is unknown, this may not prevent it from finding exploits as adversarial examples transfer fairly well. For example, our attack against KataGo transferred moderately well zero-shot: we achieve around a 3% win rate against Leela Zero and ELF OpenGo, which while low is better than human professionals.

In practice, labs will be reluctant to abandon what is likely an extremely expensive AI system just because it fails a single check. Instead, they are likely to want to tweak its training process, and try again. The risk in this iterated case is significantly higher as the main system now has multiple shots at finding an adversarial example. For example, a Go player on our team was able to adapt the adversarial strategy after just a few games of trial and error to frequently win against Leela Zero. AI systems may adapt in an analogous way.

The risk is highest when the main system receives a training signal incentivizing it to fool the helper agent. For example, the flip side of adversarial testing is adversarial training: pitting a main system against a helper that generates adversarial examples. Papernot et al, 2017 (Table 4) found that adversarial training against examples with small perturbations produced networks that exhibit gradient masking. They’re robust in a tiny region around inputs, defeating typical adversarial attack methods, but are still highly vulnerable to adversarial examples transferred from another model. In other words, adversarial training can result in a main system that looks good to the helper, but is not any more aligned (in this case, no more robust).

Similarly, one might wish to use mechanistic interpretability as part of human-in-the-loop training to produce models that learn human-like representations. A potential failure mode here is the model actually learning representations that appear human-like to the interpretability method, but with the model’s actual output being largely determined by some undetected steganographic encoding.

Despite these risks, there is some reason for optimism for helper agents such as interpretability methods that work by inspecting the main system’s weights. Even if the main system has an incentive to trick the helper, it at least has no direct control over its weights. However, the main system can often influence its weights indirectly (see gradient hacking) so long as it has some control over the data distribution it is trained on (common in reinforcement learning, for example). Moreover, when the helper provides a training signal, the optimization process producing the main system will “try” to exploit the main system even if the main system agent itself does not. Investigating the feasibility of attacks against helpers such as interpretability tools could be an interesting direction for future work.

Case 3: Adversarial robustness is never solved

Likelihood

In this scenario, arbitrarily capable systems will remain vulnerable to attack. This seems prima facie plausible since improvements in machine learning tend to improve both defense and attack methods. So, although it would be surprising if the ML systems of 2040 were vulnerable to 2020-era attacks, it seems very plausible that they might be vulnerable to new, more sophisticated attacks developed in the interim. A fixed offense-defense balance seems like a natural outside view.

Additionally, it is clear that human-level intelligence is not sufficient to be robust. Humans are themselves vulnerable to at times extreme manipulation by other humans. For example, some cult members have lost their possessions, liberty and sometimes even their lives. This is despite humans evolving in a partially adversarial environment, and so being well attuned to detecting social manipulation.

However, we can make systems perfectly robust in sufficiently narrow situations. For example, the minimax algorithm is provably optimal, it is just too computationally expensive to be used in most games. Still, it really is possible to make an unexploitable tic-tac-toe agent, and although challenging we would not be surprised if the ML community could make an agent that has bounded exploitability even in challenging games like poker.

Achieving bounded exploitability in more open-ended settings seems more challenging. However, it may be sufficient to detect exploitation and learn to defend against it online. Although not a panacea (a system learning online gives the attacker influence over the system’s training data, potentially opening it up to a data poisoning attack), this does seem like an easier problem, and is a threat model that has been relatively under-explored.

Implications

This case seems the most challenging for safety. All the challenges discussed in case 2 still apply and, in particular, the main system is likely to exploit helper agents we use to help align it. However, for the sake of exploring a different threat model, suppose that we have, somehow, produced an aligned AI system that is highly capable and genuinely has our best interests at heart. Unfortunately, the AI system having good intentions is not sufficient if the AI system can be tricked into performing acts against our interests.

Concretely, a highly capable AI system is likely to be an attractive target for well-resourced human threat actors like nation states. These threat actors may have their own AI systems to help automate the attack. Alternatively, perhaps a misaligned AI system has already been deployed, and is now itself a threat actor.

Without the ability to achieve technical protection against attack, actors are likely to seek other ways of defending themselves. For example, mutually assured destruction (MAD) equilibria could emerge, similar to in information security today. Even relatively amateurish ransomware attacks can be extremely disruptive; capable nation states could likely launch much more sophisticated attacks. But if they were discovered to be responsible, targeted nation states could respond either with their own cyber warfare or other soft power, or even with conventional military force. We might then expect threat actors to limit themselves primarily to espionage, which is less noticeable and so less likely to trigger a response, or targeted attacks seeking a narrow goal like Stuxnet.

Unfortunately, MAD equilibria are unstable, running the risk of actual mutual destruction. This is particularly risky in information security where attribution is notoriously difficult and where the barrier to entry is low. By contrast, in nuclear policy there are a small and well-defined set of possible threat actors (other nation states armed with nuclear weapons) and attribution is usually possible by detecting the launch site of missiles.

Since most AI systems and their principals would stand to lose from a conflict, there is an incentive for AI systems to come to an agreement to prevent this possibility. This is analogous to arms control pacts. Conceivably, AI systems might be able to improve on this, by self-modifying to be provably incapable of attacking other AI systems that have signed up to this agreement, although verifying that they actually self-modified might be difficult. Work on cooperative AI agendas might help with this, but may not be necessary, as sufficiently capable AI systems might be able to perform their own research on cooperative AI.

An alternative possible equilibrium is for one AI system to gain a sufficiently decisive lead that it is able to defend itself against the extant, less capable, threat actors. Such a concentration of power would pose its own risks, but might be a preferable alternative to constant conflict between AI systems. If the risk of conflict could be foreseen, it is conceivable even that different actors with the capability of producing advanced AI systems might agree to band together, producing a single AI system which would nonetheless seek to balance the desires of the group that created it. Such an event would be unprecedented, but not uncontemplated: the Baruch Plan proposed giving the United Nations a permanent monopoly over nuclear technology, with the ability to impose sanctions even on members of the permanent security council.

The outlook looks bad if neither a MAD or unipolar equilibria are attained. Conflict in general tends to be highly destructive and negative-sum. However, it is possible that conflict between AI systems could be closer to zero-sum wealth transfers and so less destructive of value than conventional military action, which might lead to a lower-than-expected cost.

Future research directions

We see three directions that are promising:

Better understanding the problem, such as investigating how general adversarial failure modes are and finding scaling laws for robustness;
Developing algorithmic improvements for adversarial robustness such as new training procedures or data augmentation;
Developing fault tolerant alignment techniques that function even in the presence of the vulnerable ML systems.

Understanding the problem

Although adversarial robustness is a well-studied area, there has been comparatively little work focusing on the settings most relevant to alignment: highly capable, general systems under realistic threat models. Consequently, there is low-hanging fruit to better understanding the nature of the problem, both for primary research and collating the relevant results that do already exist in the literature.

One promising direction is to develop scaling laws for robustness. Scaling laws for metrics of capabilities are well-established in domains including language models, generative image and video modeling and zero-sum board games. Determining analogous scaling laws for adversarial robustness would be greatly informative.

If the slope of the robustness scaling law is shallower than that of capabilities, we would expect the gap between capabilities and robustness to widen over time – a concerning outcome. By contrast, if the slope of the robustness scaling law is comparable to that of capabilities, then the gap might stay constant over time – suggesting the offense-defense balance will remain fixed. Finally, if the slope of the robustness scaling law is steeper than that of capabilities, we might expect there to be substantial gains in the future that close the gap.

An exploration into scaling laws could make use of data already developed elsewhere. For example, there already exist timeseries of the state-of-the-art accuracy of image classifiers in ImageNet and other benchmarks. There also exist some parallel time series for robust accuracy, such as RobustBench. Comparing these would give an initial indication of whether progress in adversarial accuracy is lagging behind, keeping pace with, or outstripping progress in clean accuracy.

There has already been some investigation of how model robustness varies with model size and dataset size. For example, Xie et al (2020; Figure 7) find that increasing the depth of a ResNet increases robust accuracy while having limited effect on clean accuracy. Carmon et al (2022; Figures 13 & 14) find that increasing the size of a labeled or unlabeled dataset improves robust accuracy, with Figure 13(a) in particular showing that robust accuracy benefits from increases in unlabeled data more than clean accuracy. However, to the best of our knowledge there are no quantitative scaling laws for robustness yet.

Most existing work in adversarial robustness has focused on image classification, which is a poor proxy for transformative AI, and ℓ_p-norm perturbations, a limited threat model. Consequently, we are particularly excited by further work probing vulnerabilities of narrowly superhuman systems under realistic threat models. We expect such investigation to be particularly informative for AI safety.

In particular, we are interested in investigating adversarial policies in superhuman game-playing systems outside of Go. For example, do vulnerabilities exist in Leela Chess Zero, an AlphaZero replica for chess? This would provide strong evidence that adversarial policies are a widely occurring phenomenon (at least for AlphaZero-style systems). We would expect chess systems to be more challenging to exploit than Go programs, as even search with hard-coded heuristics is sufficient for superhuman performance in chess. We would also be interested in trying to find adversarial policies in a broader range of games such as the Polygames to see how exploitability varies with factors like game complexity.

It would also be interesting to investigate systems trained with different algorithms, to rule out the possibility that the vulnerability is an artifact of AlphaZero-style training (like self-play). For example, DeepNash is a more principled method than self-play that has learned to play Stratego at a human expert level. Beyond board games, AlphaStar achieved expert-level performance in StarCraft and was trained using a population-based algorithm. Unfortunately, there are currently no open-source replications of these results, making it practically challenging to study these agents.

We could also seek to better understand existing adversarial attacks. There’s already been substantial work developing theories for why adversarial attacks persist, such as Adversarial Examples Are Not Bugs, They Are Features and Adversarial Spheres. But there are some notable gaps. For example, there’s been comparatively little work applying mechanistic interpretability to adversarial attacks to understand how the model fails. This could be both informative for adversarial robustness, and a useful test-case for interpretability.

Algorithmic improvements for adversarial robustness

Understanding the nature of the problem is important, but at some point we must take action to fix it. The most direct way is to develop algorithms or training procedures that improve adversarial robustness. Existing work that falls into this category includes adversarial defenses (such as certified defenses and adversarial training), more principled training procedures (such as policy-space response oracles instead of self-play) as well as trial-and-error improvements, such as discovering that pre-training can improve robustness.

Prior work in this area has certainly improved the state of adversarial robustness, and we hope it continues. However, it is a relatively crowded area and with uncertain tractability, so we would suggest people be selective as to what projects they work on in this domain. Directions that seem underexplored include: adversarial robustness under realistic threat models, like unrestricted adversarial examples; adversarial robustness focused on systems that seem good proxies for transformative AI, such as large language models or superhuman game-playing systems; and adversarial training to achieve limited guarantees (such as avoiding catastrophic outcomes).

Fault tolerant alignment

An alternative to making systems more robust is to make our alignment approach more tolerant to vulnerabilities in ML systems. To make an analogy to computer security, improving adversarial robustness is like fixing security vulnerabilities in programs, whereas fault tolerant alignment is like limiting the damage a vulnerability can cause by limiting the privileges of programs. Of course, they are not mutually exclusive: even if we aggressively apply the principle of least privilege, we still need some core, trusted component. However, this simplifies the problem from securing a large complex system to ensuring the small, core components are trustworthy.

Concretely, we should strive to isolate the main AI system and its helpers as much as possible from one another. For example, if the helper isn’t needed to generate the training signal (e.g. interpretability or adversarial testing tools) then the main system should not be able to query the helpers. Any interactions between the main and helper systems should happen in a sandboxed environment with the results not used for training the main sysem.

We could also take a defense in depth approach by including a number of independent helpers of different kinds. Then we can tolerate some of the helpers being exploited, provided at least one is not exploited and so provides us with a warning sign of the danger we are in. This is still non-trivial: adversarial examples often transfer, and the main system could explicitly blend adversarial examples targeted at different models, so we cannot simply trust the outputs of an ensemble model. But the more independent the tests (e.g. using different methods, such as adversarial testing and mechanistic interpretability) the harder it will be for a malevolent AI system to fool them.

Only limited isolation is possible when the helper is needed to generate a training signal for the main system (e.g. IDA or RRM). In these cases, we should strive to avoid adversarial pressure in the first place. This strategy is particularly attractive in the world where we solve adversarial robustness after transformative AI. In that case, the adversarial pressure mostly comes from within our system, and so in principle we might be able to avoid it.

One of the primary issues we identified was an optimization process adversarially exploiting an overseer. Fortunately, imitation learning objectives seem markedly less vulnerable to such reward hacking behavior. The downside is that imitation learning also effectively caps performance at the level of demonstrations. ^[8] However, if we recover the human policy we can use it as a starting point and choose how much additional optimization pressure to exert. This is effectively how RLHF works, by performing RL with a KL penalty from the base (imitation learned) language model.

However, RL with a KL penalty is far from the only approach. For example, quantilizers propose sampling from an imitation learning objective N times and picking the best of N. Recent results by Gao et al (2022) show this scheme produces comparable performance to RL, and is more KL efficient (finding policies that are closer to the base model), although is unfortunately computationally infeasible for large N. A natural direction for future work would be to make this more computationally efficient: such as using supervised learning to distill the best-of-N search into a model. More generally, one could explore the design space of optimization schemes, to try and find ways to direct the optimization pressure more towards improving performance without exploiting the model.

We can also seek to change the game played between the main system and helper AI systems to advantage the helpers. For example, by default an overseer AI system provides a training signal in real-time to the main system. For this scheme to work, the overseer needs to be robust zero-shot: a challenging desiderata. However, we could modify this game to instead periodically reset the main system to an earlier checkpoint, then continue training the main system with the current overseer. Under this scheme, the overseer just needs to not be repeatedly fooled by the same attack. Moreover, we could use this scheme to detect potential exploits, by seeing if the main system trained on the later overseer diverges from that from the earlier overseer.

Limitations

Prior work is inconclusive

The ubiquity of adversarial examples in contemporary ML systems suggests that adversarial robustness is a thorny problem to solve. However, most work on adversarial examples has focused on settings very different to those we care about for alignment. Some of these differences make the problem harder to solve: unrestricted adversarial examples are a more realistic threat model, but much harder to defend against than the more extensively-studied ℓ_p-norm perturbations.

However, many differences make the problem easier. For example, it might be sufficient to have a guarantee the model will never make any catastrophic mistake, while tolerating adversarial examples in low-stakes situations. As a toy example, suppose a reward model for an autonomous vehicle assigns +1 reward to getting to the destination, +3 reward for repeatedly circling the block if it sees a sequence of adversarially placed stickers on road signs, and -10 reward for crashing. Although the resulting model certainly has an annoying flaw in the presence of attackers, the worst-case outcome is a temporary annoyance, not a loss of life. In more open-ended domains, corrigibility stands out as a high-stakes outcome, whereas most micro-level decisions made by the AI are comparatively low-stakes.

Perhaps the most important difference is that most prior work on adversarial examples has focused on AI systems as they are today, whereas for alignment we care primarily about more advanced future AI systems. There is a possibility that adversarial robustness will be solved by default as capabilities advance. This is an important limitation, but is already discussed in Case 3, so we will not reiterate this argument here.

Incompetent adversarial examples

A variant on the above argument is that systems are unlikely to pose a grave danger to humanity through mere incompetence, so we should be most alarmed by systems that are capable and striving to achieve misaligned goals. However, as mentioned previously, most adversarial examples seen so far feel more like incompetence: the system is just fundamentally misunderstanding something about the world. Perhaps we should be encouraged by this, as a sign that contemporary AI systems are far from causing a threat.

We agree with this basic intuition. All else being equal, the ubiquity of adversarial examples advocates in favor of longer timelines to transformative AI. However, as adversarial examples are over a decade old we would expect this information to largely already be incorporated into people’s estimates.

Moreover, we agree that we should not expect AI systems to cause catastrophic harm as a direct result of being adversarially exploited. Rather, our primary concern is that aligned helper ML systems will be exploited by a main AI system, resulting in the main AI system being misaligned. Our secondary concern is that even if the main AI system is aligned, if it is exploitable it may be unable to hold bad actors (such as other misaligned AIs) in check. In both cases, the risk comes not from the AI systems that are being exploited, but from misaligned highly capable AIs.

We already knew this

Many of the concrete problems outlined in this post, such as reward hacking or even exploiting interpretability tools, have already been extensively discussed. So, in some sense, this post is not saying anything new: if you were already familiar with these prior risks, there is little reason to be more alarmed by them after reading this post. Instead, we view our key contribution as providing a framework to collect together seemingly disparate safety problems under a common roof and with, potentially, a common solution.

We think the intransigence of many adversarial robustness problems should give people pause for thought when trying to solve one of the special cases. For example, we expect that a solution to reward hacking or even a robust injury classifier could be turned into a solution to many other adversarial robustness problems. Consequently, we should expect such problems to be extremely challenging to solve, as many researchers have tried but failed to solve adversarial robustness.

Won’t improving robustness also improve capabilities?

We believe the directions we’ve highlighted differentially advance safety with limited capabilities externalities. However, in practice one of the easiest ways of getting more robust models may be to just increase their general capabilities. We therefore advocate for the safety community having a nuanced message about adversarial robustness, emphasizing closing the gap between average-case and worst-case performance rather than simply seeking to increase worst-case performance. In particular, there seems to be a popular false equivalency between “alignment” and “train with human feedback”; it would be unfortunate if a similar false equivalency between “safety” and “adversarial robustness” emerged.

Conclusion

We have argued that even state-of-the-art contemporary ML systems are vulnerable to adversarial attack, and that it is likely that even (near-)transformative AI systems will be similarly vulnerable. We’ve explored the implications of this for alignment, finding that a number of popular alignment proposals may fail in this regime. Finally, we’ve outlined research agendas to better understand this problem and address it, both by improving robustness and by adapting alignment techniques to better tolerate adversarial vulnerabilities.

If you are interested in working on problems related to this agenda, FAR AI is hiring for research engineers and research scientists. We’d also be interested in exploring collaborations with researchers at other institutions: feel free to reach out to hello@far.ai.

Acknowledgements

Thanks to Euan McLean for assistance editing this manuscript and to Tony Wang, Stephen Casper, Scott Emmons, Erik Jenner, Nikolaus Howe, Adriá Garriga-Alonso and Tom Tseng for feedback on earlier drafts.

^{^}
Adversarial robustness has received comparatively little attention from the x-risk focused community, so there may still be some areas that are important for x-risk but neglected by the broader ML research community, such as unrestricted adversarial examples.
^{^}
This doesn’t guarantee the helper can exploit itself: recognizing an exploit (so defending against it) could be easier than generation. However, the helper seems well-placed to exploit itself relative to other ML systems of comparable capabilities.
^{^}
Although future ML systems could have more control over their weights. For example, hypernetworks directly generate the weights of another network. In a less extreme case, neural-architecture search with a training objective based on some automatic interpretability metric could exert selection pressure towards “deceptively interpretable” architectures.
^{^}
The best adversarial defenses can largely prevent imperceptible attacks, but are still easily defeated by perceptible perturbations that would not confuse humans.
^{^}
Some recent work (e.g. Cheng et al (2020) and Altinisik et al (2022)) has had some success increasing clean accuracy of adversarially trained models by adaptively perturbing the examples, thereby reducing the robustness tax for adversarial training.
^{^}
Section III of Carlini & Wagner (2016) provide a good summary of methods, most of which are relatively simple optimization problems, although they do require access to gradients through the networks.
^{^}
Table 29 of the supplementary materials of Stiennon et al (2020).
^{^}
Some modest gains are possible from denoising demonstrations, and sufficiently capable systems might generalize a bit past the human distribution.

I'd like to register that I disagree with the claim that standard online RLHF requires adversarial robustness in AIs persay. (I agree that it requires that humans are adversarially robust to the AI, but this is a pretty different problem.)

In particular, the place where adversarial robustness shows up is in sample efficiency. So, poor adversarial robustness is equivalent to poor sample efficiency. My understanding is that the trend is toward higher not lower sample efficiency with scale, so this seems on track.

This same reasoning also applies to recursive oversight schemes or debate.

Overall, my guess is that commercial incentives and normal scaling will cause the sample efficiencies of these processes to be quite high in practice. Additionally, my guess would be that improving adversarial robustness isn't the best way to push on sample efficiency.

That said, I do think there are many important reasons to improve adversarial robustness.

To check my understanding, is your view something like:

1. If the reward model isn't adversarially robust, then the RL component of RLHF will exploit it.

2. These generations will show up in the data presented to the human. Provided the human is adversarially robust, then the human feedback will provide corrective signal to the reward model.

3. The reward model will stop being vulnerable to those adversarial examples, although may still be vulnerable to other adversarial examples.

4. If we repeat this iterative process enough times, we'll end up with a robust reward model.

Under this model, improving adversarial robustness just means we need fewer iterations, showing up as improved sample efficiency.

I agree with this view up to a point. It does seem likely that with sufficient iterations, you'd get an accurate reward model. However, I think the difference in sample efficiency could be profound: e.g exponential (needing to get explicit corrective human feedback for most adversarial examples) vs linear (generalizing in the right way from human feedback). In that scenario, we may as well just ditch the reward model and provide training signal to the policy directly from human feedback.

In practice, we've seen that adversarial training (with practical amounts of compute) improves robustness but models are still very much vulnerable to attacks. I don't see why RLHF's implicit adversarial training would end up doing better than explicit adversarial training.

In general I tend to view sample efficiency discussions as tricky without some quantitative comparison. There's a sense in which decision trees and a variety of other simple learning algorithms are a viable approach to AGI, they're just very sample (and compute) inefficient.

The main reason I can see why RLHF may not need adversarial robustness is if the KL penalty from base model approach people currently use is actually enough.

This is pretty close to my understanding, with one important objection.

Thanks for responding and trying to engage with my perspective.

Objection

If we repeat this iterative process enough times, we'll end up with a robust reward model.

I don't claim we'll necessarily ever get a fully robust reward model, just that the reward model will be mostly robust on average to the actual policy you use as long as human feedback is provided at a frequent enough interval. We never needed a good robust reward model which works on every input, we just needed a reward model which captured our local preferences about the actual current policy distribution sufficiently well.

At any given point, the reward model will be vulnerable to arbitrary adversarial attacks under sufficient optimization pressure, but we don't need arbitrary optimization against any given reward model. Like, each human update lets you optimize a bit more against the reward model which gets you the ability to get somewhat closer to the policy you actually want.

KL penalty is presumably an important part of the picture.

My overall claim is just that normal RLHF is pretty fault tolerant and degrades pretty gracefully with lack of robustness.

Sample efficiency seems fine now and progress continues

Beyond this, my view is considerably informed by 'sample efficiency seems fine in practice now and is better better not worse with more powerful models (from my limited understanding)'. It's possible this trend could reverse with sufficiently big models, but I'd find this surprising and don't see any particular reason to expect this. Technical advancement in RLHF is ongoing and will improve sample efficiency further.

It seems to me like your views either imply that sample efficiency is low enough now that high quality RLHF currently can't compete with other ways of training AIs which are less safe but have cheaper reward signals (e.g., heavy training on automated outcomes based feedback or low quality human reward signal). Or possibly that this will happen in the future as models get more powerful (reversing the current trend toward better sample efficiency). My understanding is that RLHF is quite commercially popular and sample efficiency isn't a huge issue. Perhaps I'm missing some important gap between how RLHF is used now and how it will have to be used in the future to align powerful systems?

Also note that vanilla RLHF doesn't necessarily require optimizing against a reward model. For instance, a recently released paper Direct Preference Optimization (DPO) avoids this step entirely. It's unclear if DPO will win out for various reasons, but as far as I can tell, you should think that DPO should totally resolve the robustness concern with vanilla RLHF. Of course, recursive oversight schemes and other things which use AIs to help oversee AIs could still involve optimizing against AIs.

Edit: I originally linked to the wrong paper, oops

Thanks for the follow-up, this helps me understand your view!

At any given point, the reward model will be vulnerable to arbitrary adversarial attacks under sufficient optimization pressure, but we don't need arbitrary optimization against any given reward model. Like, each human update lets you optimize a bit more against the reward model which gets you the ability to get somewhat closer to the policy you actually want.

Sure, this feels basically right to me. My reframing of this would be that we could do in principle do RL directly with feedback provided by a human. Learning a reward model lets us gain some sample efficiency over this, but sometimes it fails to generalize correctly to what a human would say, and adversarial examples are an important special case of this. But provided the policy is the final output of this process, not the reward model, it doesn't matter if the reward model is robust or not -- just that the feedback the policy has received has steered it into a good position.

It seems to me like your views either imply that sample efficiency is low enough now that high quality RLHF currently can't compete with other ways of training AIs which are less safe but have cheaper reward signals (e.g., heavy training on automated outcomes based feedback or low quality human reward signal). Or possibly that this will happen in the future as models get more powerful (reversing the current trend toward better sample efficiency). My understanding is that RLHF is quite commercially popular and sample efficiency isn't a huge issue. Perhaps I'm missing some important gap between how RLHF is used now and how it will have to be used in the future to align powerful systems?

This argument feels like it's proving too much. InstructGPT isn't perfect, but it does produce a lot less toxic output and follow instructions a lot better than the base model GPT-3. RLHF seems to work, and GPT-4 is even better, showing that it gets easier with bigger models. Why should we expect this trend to reverse? Why are we worried about this safety thing anyway?

I actually find this style of argument pretty plausible: I'm a relative optimist on this forum, I do think that some fairly basic methods like a souped-up RLHF might well be sufficient to make things go OK (while preferring to have more principled methods giving us a bigger safety margin). But I'm somewhat surprised to hear you making that case!

Suppose we condition on RLHF failing. At a high level, failures split into: (a) human labelers rewarded the wrong thing (e.g. fooling humans); (b) the reward model failed to predict human labelers judgement and rewarded the wrong thing (e.g. reward hacking); (c) RL produced a policy that is capable enough to be dangerous but is optimizing something other than the reward model (e.g. mesa-optimization). I find all three of these risks plausible, and I don't see a specific reason to privilege (a) or (c) substantially over (b). It sounds like you're most concerned about (a), and I'd love to hear your reasons for that.

However, I do think it's interesting to explore concrete failure modes: given RLHF is working well now, what does my view imply about how it might stop working? One scenario I find plausible is that as models get bigger, the sample efficiency of RLHF continues to increase, since the models have higher fidelity representations of a greater variety of tasks. RLHF therefore just needs to localize the task that's already in the network. However, the performance of this process is ultimately capped by what the base model already represents. I can see two ways this could go wrong:

1. Misaligned base model. If the foundation model that's being RLHF'd is already misaligned, then a small amount of RL training is not going to be enough to disabuse it of this. By contrast, a large amount of RL training (of the order of 10% of the training steps used for self-supervised learning) with a high-fidelity reward signal might. Unfortunately, we can't do that much RL training without just reward hacking, so we never try. (Or we take that many time steps, but with a KL penalty, forcing the model to stay close to the unaligned base model.)

2. Harmless base model. If the foundation model starts off harmless (not necessarily aligned, just not actively trying to cause harm), then I'd expect RLHF'ing it to only improve things so long as the training signal never rewards bad behavior. However, the designers want the model to significantly outperform humans at this task. The model has capacity to learn to do this, but can't just leverage existing capabilities in the foundation model, as the performance of that model is limited to that of the best humans it saw in the self-supervised training data. So, we need to do RL for many more time steps. Collecting fresh human data for that is prohibitive, so we rely on a reward model -- unfortunately that gets hacked.

I think I find the second case more compelling. The first case seems concerning as well, but it seems like quite a scary scenario even if robustness gets solved, whereas I expect fixing robustness to actually make a significant dent in the second scenario.

That said, I feel like I should emphasize in all this that I largely think robustness is an intractable problem to solve, and that while it's worth trying to improve it at the margin I'm most excited by efforts to make systems not need robustness. I think you make a good point that having humans in the loop in the training makes RLHF degrade more gracefully than reward learning approaches that train on a fixed offline dataset, and the KL penalty also helps. I suspect there are many similar algorithmic tweaks that would make algorithms less sensitive to robustness.

For example, riffing off going from an offline to online dataset, could we improve things further by collecting a dataset that anticipates where the RL process might try to exploit it? That sounds fanciful, but there's a simple hack you can do to get something like this. Just train a policy on the current reward model. Then collect human feedback from that policy. Then roll the policy back to the last checkpoint, and repeat the training using the new reward model. You could do this step once per checkpoint, or keep doing it until you get human approval to move on (e.g. the reward model now aligns with human feedback). In this way you should be able to avoid ever giving the policy reward for the wrong behavior. I suspect this process would be much less sample efficient than vanilla RLHF, but it would have better safety properties, and measuring how much slower it is could be a good proxy for how severe the "robustness tax" is.

Also note that vanilla RLHF doesn't necessarily require optimizing against a reward model. For instance, a recently released paper Direct Policy Optimization (DPO) avoids this step entirely.

I'm a bit confused by the claim here, although I've only read the abstract and skimmed the paper so perhaps it'd become obvious from a closer read. As far as I can tell, the cited paper focuses on motion-planning, and considers a rather restricted setting of LQR policies. This is a reasonable starting point, but a human communicating that a cart pole should stand up (or a desired quadcoptor trajectory) feels much simpler than even toy tasks for RLHF in LLMs like summarizing text. So I don't really see this as much evidence in favor of being able to drop the reward model. Generally, any highly sample efficient model-based RL approach would enable RL training to proceed without a reward model by having humans directly label the trajectory,

The model has capacity to learn to do this, but can't just leverage existing capabilities in the foundation model, as the performance of that model is limited to that of the best humans it saw in the self-supervised training data. So, we need to do RL for many more time steps.

It's unclear to me if this is true because modelling the humans that generated the training data sufficiently well probably requires the model to be smarter than the humans it is modelling. So I expect the current regime where RLHF just elicits a particular persona from the model rather than teaching any new abilities to be sufficient to reach superhuman capabilities.

(Here's a possibly arcane remark. It's worth noting that I think always correct reward models are sufficient for high stakes alignment via runtime filtering (technically you just need to never give a very bad output decent reward). So, always correct reward models would be great if you could get them. Note that always correct could look pretty different than current advsarial robustness; in particular, you also need to worry about collusion and stuff like seizing physical control of the reward channel. I also think that avoiding high stakes failure via adversarial training and/or other robustness style tech seems generally good. I've been assuming we're talking about the low stakes alignment problem (aka outer alignment))

This argument feels like it's proving too much. InstructGPT isn't perfect, but it does produce a lot less toxic output and follow instructions a lot better than the base model GPT-3. RLHF seems to work, and GPT-4 is even better, showing that it gets easier with bigger models. Why should we expect this trend to reverse? Why are we worried about this safety thing anyway?

I actually find this style of argument pretty plausible: I'm a relative optimist on this forum, I do think that some fairly basic methods like a souped-up RLHF might well be sufficient to make things go OK (while preferring to have more principled methods giving us a bigger safety margin). But I'm somewhat surprised to hear you making that case!

I think this trend extrapolation argument seems fine in the absence of a specific defeater. In the case of AI takeover, there is a clear defeater to 'models have been getting more aligned over time'.

I was trying to get at which specific defeater you thought overcame the trend expolation argument. Here's an attempt at an exhaustive list of defeaters:

Thinking sample efficiency would get worse in the future breaking the current trend
Thinking that the current 'trend extrapolated' sample efficiency would be insufficent or otherwise good to improve on the margin
Thinking that there would be important negative consequences of reward model non-robustness which aren't well described by sample efficiency (e.g., teaching the model to lie via first practicing on the reward model, or having an easier time exploring into bad behavior or something)

It's also worth noting that if I was centrally interested in (2) I would push on that directly. (But you have other applications of robustness in mind, so this might not be that interesting.)

I still don't understand which of (1), (2), or (3) your most worried about. (Maybe (3) based on some argument I don't yet understand? I also don't see why hacking the reward model is dangerous which is maybe an important crux here.)

Aside on alignment trend extrapolation:

I'm also not really sure how to measure 'aligned' in a way that makes sense given that models have been also getting smarter. It seems plausible that alignment has been notably improving over time? Beyond this, the more natural trend extrapolation might be takeover risk. My guess is that the ex-ante takeover risk from GPT4 should have been like 0.1% and then the future/ongoing risk is more like 0.001% or something. And, the trend extrapolation doesn't look good here : ).

I still don't understand which of (1), (2), or (3) your most worried about.

Sample efficiency isn't the main way I think about this topic so it's a bit difficult to answer. I find all these defeaters fairly plausible, but if I had to pick the central concern it'd be (3).

I tend to view ML training as a model taking a path through a space of possible programs. There's some programs that are capable and aligned with our interests; others that are capable but will actively pursue harmful goals; and of course many other programs that just don't do anything particularly useful. Assuming we start with model that is aligned (where "aligned" could include "model cannot do anything useful so does not cause any harm") and we only reward positive behavior, I find it plausible that we can hill-climb to more capable models while preserving alignment.

However, suppose we at some point err and reward undesirable behavior. (This could occur due to incorrect human feedback, or a reward model that is not robust, or some other issues.) At this point, we're training a sub-component of the system that is actively opposed to our interests. Hopefully, we eventually discover this sub-component, and can then disincentivize it in the training process. But at that point, there is some uncertainty in my mind: will the training process remove the sub-component, or simply train the sub-component into being better able to fool the training process?

Now, we don't need the reward model to be perfectly robust to avoid this (as you quite rightly point out), just robust in the region of policy space around the current policy where the RL algorithm is likely to explore. But empirically current reward model robustness falls short of even this.

In response to:

2. Harmless base model. If the foundation model starts off harmless (not necessarily aligned, just not actively trying to cause harm), then I'd expect RLHF'ing it to only improve things so long as the training signal never rewards bad behavior. However, the designers want the model to significantly outperform humans at this task. The model has capacity to learn to do this, but can't just leverage existing capabilities in the foundation model, as the performance of that model is limited to that of the best humans it saw in the self-supervised training data. So, we need to do RL for many more time steps. Collecting fresh human data for that is prohibitive, so we rely on a reward model -- unfortunately that gets hacked.

you write:

Are you assuming that we can't collect human data online as the policy optimizes against the reward model? (People currently do collect data online to avoid getting hacked like this.) This case seems probably hopeless to me without very strong regularization (I think you agree with this being mostly hopeless), but it also seems easy to avoid by just collecting human data online.

No, I do expect online data collection to take place, I just don't expect to be able to do that data collection fast enough or in large enough volumes to kick in before hacking takes place. I think in your taxonomy, this is defeater (2): I think we'll need substantially more samples to train superhuman models than we do human models, as the demands from RLHF switch from localizing a task that a network already knows how to perform, to teaching a model to perform a new capability (safely). (I will note online data collection is a pain and people seem to try and do as little as possible of it.)

Collecting fresh human data for that is prohibitive, so we rely on a reward model -- unfortunately that gets hacked.

Are you assuming that we can't collect human data online as the policy optimizes against the reward model? (People currently do collect data online to avoid getting hacked like this.) This case seems probably hopeless to me without very strong regularization (I think you agree with this being mostly hopeless), but it also seems easy to avoid by just collecting human data online.

Suppose we condition on RLHF failing. At a high level, failures split into: (a) human labelers rewarded the wrong thing (e.g. fooling humans); (b) the reward model failed to predict human labelers judgement and rewarded the wrong thing (e.g. reward hacking); (c) RL produced a policy that is capable enough to be dangerous but is optimizing something other than the reward model (e.g. mesa-optimization).

I don't really see why (b) leads to dangerous failures. It seems like failures should be totally benign and just result in somewhat lower production?

Beyond this, it seems like this failure should happen early as it doesn't require clever models to occur, so by default there will be strong commercial incentives to resolve this.

I agree it's an alignment failure in some sense which could be addressed by alignment technology. I just think it isn't very important to reduce from an X-risk/AI takeover perspective.

Related to this. You say:

I suspect this process would be much less sample efficient than vanilla RLHF, but it would have better safety properties, and measuring how much slower it is could be a good proxy for how severe the "robustness tax" is.

What specific safety properties are you thinking about?

As far as I can tell sample efficiency is the only safety property which seems important here.

There could be other important considerations beyond sample efficiency for avoiding your policy doing some hacking of the reward model, but none of the ones I can think of seem importantly dangerous to me.

Naively, I'd guess it's slightly bad on top of the sample efficiency issues, but mostly unimportant.

Suppose we condition on RLHF failing. At a high level, failures split into: (a) human labelers rewarded the wrong thing (e.g. fooling humans); (b) the reward model failed to predict human labelers judgement and rewarded the wrong thing (e.g. reward hacking); (c) RL produced a policy that is capable enough to be dangerous but is optimizing something other than the reward model (e.g. mesa-optimization).

Minor point, feel free to ignore.

FWIW, I typically use 'reward hacking' to refer to just (a) here. I'd just call (b) 'poor reward model sample efficiency'. That said, I more centrally use 'reward hacking' to describe hacking a reward process based on outcomes via stuff like 'sensor tampering', but this is still a subset of RLHF: the subset where humans look at outcomes and then assess reward taking this into account.

Oh, we're using terminology quite differently then. I would not call (a) reward hacking, as I view the model as being the reward (to the RL process), whereas humans are not providing reward at all (but rather some data that gets fed into a reward model's learning process). I don't especially care about what definitions we use here, but do wonder if this means we're speaking past each other in other areas as well.

I'm a bit confused by the claim here, although I've only read the abstract and skimmed the paper so perhaps it'd become obvious from a closer read. As far as I can tell, the cited paper focuses on motion-planning, and considers a rather restricted setting of LQR policies.

I originally linked to the wrong paper! : (

Here is the actual Direct Preference Optimization paper. (I guess I just googled something like 'DPO RL' and then didn't actually check that it was the right paper)

Yikes, sorry for wasting your time.

Ah, that paper makes a lot more sense. A reward model was attractive in the original Deep RL From Human Preferences paper because the environment was complex and non-differentiable: using RL was a natural fit. It's always seemed a bit stranger to use RL for fine-tuning language models, especially in the prompt-completion setting where the "environment" is trivial. (RL becomes more natural when you start introducing external tools, or conversations with humans.)

I'll need to take a closer look at the paper, but it looks like they derive the DPO objective by starting from the RL objective under KL optimization. So if it does what it says on the tin, then I'd expect the resulting policy incentives to be similar. My hunch is the problem of reward hacking has shifted from an explicit to implicit problem rather than being eliminated, although I'm certainly not confident on this. Could be interesting to study using a similar approach to the Scaling Laws for Reward Model Overoptimization paper.

comment TLDR: Adversarial examples are a weapon against the AIs we can use for good and solving adversarial robustness would let the AIs harden themselves.

I haven't read this yet (I will later : ) ), so it's possible this is mentioned, but I'd note that exploiting the lack of adversarial robustness could also be used to improve safety. For instance, AI systems might have a hard time keeping secrets if they also need to interact with humans trying to test for verifiable secrets. E.g., trying to jailbreak AIs to get them to tell you about the fact that they already hacked the ssh server.

Ofc, there are also clownier versions like trying to fight the robot armies with adversarial examples (naively seems implausible, but who knows). I'd guess that reasonably active militaries probably already should be thinking about making their camo also be a multi-system adversarial example if possible. (People are already using vision models for targeting etc.)

I don't think this lack of adversarial robustness gets you particularly strong alignment properties, but it's worth noting upsides as well as downsides.

Perhaps we'd ideally train our AIs to be robust to most classes of things but just ensure they remain non-robust in a few (ideally secret) edge cases. However, the most important use cases for AI exploits are in situations after AIs have already started to be able to apply training procedures to themselves without our consent.

My overall guess is that the upsides of solving adversarial robustness outweigh the downsides.

This is a good point, adversarial examples in what I called in the post the "main" ML system can be desirable even though we typically don't want them in the "helper" ML systems used to align the main system.

One downside to adversarial vulnerability of the main ML system is that it could be exploited by bad actors (whether human or other, misaligned AIs). But this might be fine in some settings: e.g. you build some airgapped system that helps you build the next, more robust and aligned AI. One could also imagine crafting adversarial example backdoors that are cryptographically hard to discover if you don't know how they were constructed.

I generally expect that if adversarial robustness can be practically solved then transformative AI systems will eventually self-improve themselves to the point of being robust. So, the window where AI systems are dangerous & deceptive enough that we need to test them using adversarial examples but not capable enough to have overcome this might be quite short. Could still be useful as an early-warning sign, though.

For the purpose of this section, we will consider adversarial robustness to be solved if systems cannot be practically exploited to cause catastrophic outcomes.

Regarding the predictions, I want to make the following quibble: According to the definition above, one way of "solving" adversarial robustness is to make sure that nobody tries to catastrophically exploit the system in the first place. (In particular, exploitable AI that takes over the world is no longer exploitable.)

So, a lot with this definition rests on how do you distinguish between "cannot be exploited" and "will not be exploited".

And on reflection, I think that for some people, this is close to being a crux regarding the importance of all this research.

"However, to the best of our knowledge there are no quantitative scaling laws for robustness yet."

For scaling laws for adversarial robustness, see appendix A.15 of openreview.net/pdf?id=sckjveqlCZ#page=22

Thanks, I'd missed that!

Curious if you have any high-level takeaways from that? Bigger models do better, clearly, but e.g. how low do you think we'll be able to get the error rate in the next 5-10 years given expected compute growth? Are there any follow-up experiments you'd like to see happen in this space?

Also could you clarify whether the setting was for adversarial training or just a vanilla model? "During training, adversarial examples for training are constructed by PGD attacker of 30 iterations" makes me think it's adversarial training but I could imagine this just being used for evals.

The setting was adversarial training and adversarial evaluation. During training, PGD attacker of 30 iterations is used to construct adversarial examples used for training. During testing, the evaluation test set is an adversarial test set that is constructed via PGD attacker of 20 iterations.

Experimental data of y-axis is obtained from Table 7 of https://arxiv.org/abs/1906.03787; experimental data of x-axis is obtained from Figure 7 of https://arxiv.org/abs/1906.03787.

I also want to add that I really like the use of the prediction UI in this post.

Imitation learning methods seem less risky, as the optimization pressure is simply to match the empirical distribution of a demonstration dataset. The closest to “reward hacking” in this setting would be overfitting to the dataset, a relatively benign failure mode. There is still some risk of inner optimization objectives arising, which could then be adversarial to other systems (e.g. attempt to hide themselves from transparency tools), but comparatively speaking this is one of the methods with the lowest risk of adversarial failure. [Bolding by DanielFilan]

Presumably in a set-up where the agent can meaningfully act in the world while still doing imitation learning, one way to reward hack would be to take a short-term imitation hit to install tons of surveillance of humans and/or make them easier to predict, so that in the long-term it will be easier to accurately imitate them, right? That said, I think I probably agree with the comparative claim you make.

Right: if the agent has learned an inner objective of "do things similar to what humans do in the world at the moment I am currently acting", then it'd definitely be incentivized to do that. It's not rewarded by the outer objective for e.g. behavioral cloning on a fixed dataset, as installing bunch of cameras would be punished by that loss (not something humans do) and changing human behavior wouldn't help as BC would still be on the dataset of pre-manipulation demos. That might be little comfort if you're worried about inner optimization, but most the other failures described happen even in the outer alignment case.

That said, if you had a different imitation learning setup that was something like doing RL on a reward of "do the same thing one of our human labelers chooses given the same state" then the outer objective would reward what the behavior you describe. It'd be a hard exploration problem for the agent to learn to exploit the reward in that way, but it quite probably could do so if situationally aware.

Conversely, it is less likely to happen if the optimization process is pursuing a goal that is largely independent from the helper’s output, such as if the helper is an interpretability tool checking for the presence of mesa-optimization.

Maybe I'm missing something obvious here, but for advanced AI systems under monitoring that happen to be misaligned, won't checking for the presence of mesa-optimization be relevant to the main system's goal, in cases where the results of those checks matter for whether to re-train or deploy the AI?

Ah, you make this point here:

However, in practical applications some optimization pressure against the helper will still leak in, for example if the helper is used to decide whether to abort a training run or change its hyperparameters.

I would like to point out one aspects of the "Vulnerable ML systems" scenario that the post doesn't discuss much: the effect on adversarial vulnerability on widespread-automation worlds.

Using existing words, some ways of pointing towards what I mean are: (1) Adversarial robustness solved after TAI (your case 2), (2) vulnerable ML systems + comprehensive AI systems, (3) vulnerable ML systems + slow takeoff, (4) fast takeoff happening in the middle of (3).

But ultimately, I think none of these fits perfectly. So a longer, self-contained description is something like:

Consider the world where we automate more and more things using AI systems that have vulnerable components. Perhaps those vulnerabilities primarily come from narrow-purpose neural networks and foundation models. But some might also come from insecure software design, software bugs, and humans in the loop.
And suppose some parts of the economy/society will be designed more securely (some banks, intelligence services, planes, hopefully nukes)...while others just have glaring security holes.
A naive expectation would be that a security hole gets fixed if and only if there is somebody who would be able to exploit it. This is overly optimistic, but note that even this implies the existence of many vulnerabilities that would require stronger-than-existing level of capability to exploit. More realistically, the actual bar for fixing security holes will be "there might be many people who can exploit this, but it is not worth their opportunity cost". And then we will also not-fix all the holes that we are unaware of, or where the exploitation goes undetected.
These potential vulnerabilities leave a lot of space for actual exploitation when the stakes get higher, or we get a sudden jump in some area of capabilities, or when many coordinated exploits become more profitable than what a naive extrapolation would suggest.

There are several potential threats that have particularly interesting interactions with this setting:

(A) Alignment scheme failure: An alignment scheme that would otherwise work fails due to vulnerabilities in the AI company training it. This seems the closest to what this post describes?
(B) Easier AI takeover: Somebody builds a misaligned AI that would normally be sub-catastrophic, but all of these vulnerabilities allow it to take over.
(C) Capitalism gone wrong: The vulnerabilities regularly get exploited, in ways that either go undetected or cause negative externalities that nobody relevant has incentives to fix. And this destroys a large portion of the total value.
(D) Malicious actors: Bad actors use the vulnerabilities to cause damage. (And this makes B and C worse.)
(E) Great-power war: The vulnerabilities get exploited during a great-power war. (And this makes B and C worse.)

Connection to Cases 1-3: All of this seems very related to how you distinguish between adversarial robustness gets solved before tranformative AI/after TAI/never. However, I would argue that TAI is not necessarily the relevant cutoff point here. Indeed, for Alignment failure (A) and Easier takeover (B), the relevant moment is "the first time we get an AI capable of forming a singleton". This might happen tomorrow, by the time we have automated 25% of economically-relevant tasks, half a year into having automated 100% of tasks, or possibly never. And for the remaining threat models (C,D,E), perhaps there are no single cutoff points, and instead the stakes and implications change gradually?

Implications: Personally, I am the most concerned about misaligned AI (A and B) and Capitalism gone wrong (C). However, perhaps risks from malicious actors and nation-state adversaries (D, E) are more salient and less controversial, while pointing towards the same issues? So perhaps advancing the agenda outlined in the post can be best done through focusing on these? [I would be curious to know your thoughts.]

An idea for increasing the impact of this research: Mitigating the "goalpost moving" effect for "but surely a bit more progress on capabilities will solve this".

I suspect that many people who are sceptical of this issue will, by default, never sit down and properly think about this. If they did, they might make some falsifiable predictions and change their minds --- but many of them might never do that. Or perhaps many people will, but it will all happen very gradually, and we will never get a good enough "coordination point" that would allow us to take needle-shifting actions.

I also suspect there are ways of making this go better. I am not quite sure what they are, but here are some ideas: Making and publishing surveys. Operationalizing all of this better, in particular with respect to the "how much does this actually matter?" aspect. Formulating some memorable "hypothesis" that makes it easier to refer to this in conversations and papers (cf "orthogonality thesis"). Perhaps making some proponents of "the opposing view" make some testable predictions, ideally some that can be tested with systems whose failures won't be catastrophic yet?