When I started working on this project, a number of people came to me and told me (with varying degrees of tact) that I was wasting my time on a fool's errand. Around half the people told me they thought it was extremely unlikely I'd find such a vulnerability. Around the other half told me such vulnerabilities obviously existed, and there was no point demonstrating it. Both sets of people were usually very confident in their views. In retrospect I wish I'd done a survey (even an informal one) before conducting this research to get a better sense of people's views.
Personally I'm in the camp that vulnerabilities like these existing was highly likely given the failures we've seen in other ML systems and the lack of any worst-case guarantees. But I was very unsure going in how easy they'd be to find. Go is a pretty limited domain, and it's not enough to beat the neural network: you've got to beat Monte-Carlo Tree Search as well (and MCTS does have worst-case guarantees, albeit only in the limit of infinite search). Additionally, there are results showing that scale improves robustness (e.g. more pre-training data reduces vulnerability to adversarial examples in image classifiers).
In fact, although the method we used is fairly simple, actually getting everything to work was non-trivial. There was one point after we'd patched the first (rather degenerate) pass-attack that the team was doubting whether our method would be able to beat the now stronger KataGo victim. We were considering cancelling the training run, but decided to leave it going given we had some idle GPUs in the cluster. A few days later there was a phase shift in the win rate of the adversary: it had stumbled across some strategy that worked and finally was learning.
This is a long-winded way of saying that I did change my mind as a result of these experiments (towards robustness improving less than I'd previously thought with scale). I'm unsure how much effect it will have on the broader ML research community. The paper is getting a fair amount of attention, and is a nice pithy example of a failure mode. But as you suggest, the issue may be less a difference in concrete belief (surely any ML researcher would acknowledge adversarial examples are a major problem and one that is unlikely to be solved any time soon), than that of culture (to what degree is a security mindset appropriate?).
This post was written as a summary of the results of the paper, intended for a fairly broad audience, so we didn't delve much into the theory of change behind this agenda here. You might find this blog post describing the broader research agenda this paper fits into provides some helpful context, and I'd be interested to hear your feedback on that agenda.
Thanks for flagging this disagreement Ryan. I enjoyed our earlier conversation (on LessWrong and in-person) and updated in favor of the sample efficiency framing, although we (clearly) still have some significant differences in perspective here. Would love to catch up again sometime and see if we can converge more on this. I'll try and summarize my current take and our key disagreements for the benefit of other readers.
I think I mostly agree with you that in the special case of vanilla RLHF this problem is equivalent to a sample efficiency problem. Specifically, I'm referring to the case where we perform RL on a learned reward model; that reward model is trained based on human feedback from an earlier version of the RL policy; and this process iterates. In this case, if the RL algorithm learns to exploit the reward model (which it will, in contemporary systems, without some regularization like a KL penalty) then the reward model will receive corrective feedback from the human. At worst, this process will just not converge, and the policy will just bounce from one adversarial example to another -- useless, but probably not that dangerous. In practice, it'll probably work fine given enough human data and after tuning parameters.
However, I think sample efficiency could be a really big deal! Resolving this issue of overseers being exploited I expect could change the asymptotic sample complexity (e.g. exponential to linear) rather than just changing the constant factor. My understanding is that your take is that sample efficiency is unlikely to be a problem because RLHF works fine now, is fairly sample efficient, and improves with model scale -- so why should we expect it to get worse?
I'd argue first that sample efficiency now may actually be quite bad. We don't exactly have any contemporary model that I'd call aligned. GPT-4 and Claude are a lot better than what I'd expect from base models their size -- but "better than just imitating internet text" is a low bar. I expect if we had ~infinite high quality data to do RLHF on these models would be much more aligned. (I'm not sure if having ~infinite data of the same quality that we do now would help; I tend to assume you can trade less quantity for increased quality, but there are obviously some limits here.)
I'm additionally concerned that sample efficiency may be highly task dependent. RLHF is a pretty finnicky method, so we're tending to see the success cases of it. What if there are just certain tasks that it's really hard to use RLHF for (perhaps because the base model doesn't already have a good representation of it)? There'll be a strong economic pressure to develop systems that do that task anyway, just using less reliable proxies for that task objective.
(A similar argument will apply for various recursive oversight schemes or debate.)
This might be the most interesting disagreement, and I'd love to dig into this more. With RLHF I can see how you can avoid the problem with sufficient samples since the human won't be fooled by the AdvEx. But this stops working in a domain where you need scalable oversight as the inputs are too complex for a human to judge, so can't provide any input.
The strongest argument I can see for your view is that scalable oversight procedures already have to deal with a human that says "I don't know" for a lot of inputs. So, perhaps you can make a base model that perfectly mimics what the human would say on a large subset of inputs, and for AdvEx's (as well as some other inputs) says "I don't know". This is still a hard problem -- my impression was adversarial example detection is still far from solved -- but is plausibly a fair bit easier than full robustness (which I suspect isn't possible). Then you can just use your scalable oversight procedure to make the "I don't knows" go away.
Alteratively, if you think the issue is that periodically being incentivized to adversarially attack the reward model has serious problematic effects on the inductive biases of RL, it seems relevant to argue for why this would be the case. I don't really see why this would be important. It seems like periodically being somewhat trained to find different advexes shouldn't have much effect on how the AI generalizes?
I think this is an area where we disagree but it doesn't feel central to my view -- I can see it going either way, and I think I'd still be concerned by whether the oversight process is robust even if the process wasn't path dependent (e.g. we just did random restarting of the policy every time we update the reward model).
Thanks, that's a good link. In our case our assets significantly exceed the FDIC $250k insurance limit and there are operational costs to splitting assets across a large number of banks. But a high-interest checking account could be a good option for many small orgs.
Does this circle exploit have any connection to convolutions? That was my first thought when I saw the original writeups, but nothing here seems to help explain where the exploit is coming from. All of the listed agents vulnerable to it, AFAIK, make use of convolutions. The description you give of Wu's anti-circle training sounds a lot like you would expect from an architectural problem like convolution blindness: training can solve the specific exploit but then goes around in cycles or circles (ahem), simply moving the vulnerability around, like squeezing a balloon.
We think it might. One weak point against this is that we tried training CNNs with larger kernels and the problem didn't improve. However, it's not obvious that larger kernels would fix it (it gives the model less need for spatial locality, but it might still have an inductive bias towards it), and the results are a bit confounded since we trained the CNN based on historical KataGo self-play training data rather. We've been considering training a version of KataGo from scratch (generating new self-play data) to use vision transformers which would give a cleaner answer to this. It'd be somewhat time consuming though, so curious to hear how interesting you and other commenters would find this result so we can prioritize.
We're also planning on doing mechanistic interpretability to better understand the failure mode, which might shed light on this question.
Do you know they are distinct? The discussion of Go in that paper is extremely brief and does not describe what the exploitation is at all, AFAICT. Your E3 also doesn't seem to describe what the Timbers agent does.
My main reason for believing they're distinct is that an earlier version of their paper includes Figure 3 providing an example Go board that looks fairly different to ours. It's a bit hard to compare since it's a terminal board, there's no move history, but it doesn't look like what would result from capture of a large circular group. But I do wish the Timbers paper went into more detail on this, e.g. including full game traces from their latest attack. I encouraged the authors to do this but it seems like they've all moved on to other projects since then and have limited ability to revise the paper.
This matches my impression. FAR could definitely use more funding. Although I'd still at the margin rather hire someone above our bar than e.g. have them earn-to-give and donate to us, the math is getting a lot closer than it used to be, to the point where those with excellent earning potential and limited fit for AI safety might well have more impact pursuing a philanthropic pathway.
I'd also highlight there's a serious lack of diversity in funding. As others in the thread have mentioned, the majority of people's funding comes (directly or indirectly) from OpenPhil. I think OpenPhil does a good job trying to mitigate this (e.g. being careful about power dynamics, giving organizations exit grants if they do decide to stop funding an org, etc) it's ultimately not a healthy dynamic, and OpenPhil appears to be quite capacity constrained in terms of grant evaluation. So, the entry of new funders would help diversify this in addition to increasing total capacity.
One thing I don't see people talk about as much but also seems like a key part of the solution: how can alignment orgs and researchers make more efficient use of existing funding? Spending that was appropriate a year or two ago when funding was plentiful may not be justified any longer, so there's a need to explicitly put in place appropriate budgets and spending controls. There's a fair amount of cost-saving measures I could see the ecosystem implementing that would have limited if any hit on productivity: for example, improved cash management (investing in government money market funds earning ~5% rather than 0% interest checking accounts); negotiating harder with vendors (often possible to get substantial discounts on things like cloud compute or commercial real-estate); and cutting back on some fringe benefits (e.g. more/higher-density open plan rather than private offices). I'm not trying to point fingers here: I've made missteps here as well, for example FAR's cash management currently has significant room for improvement -- we're in the process of fixing this and plan to share a write-up of what we found with other orgs in the next month.
I still don't understand which of (1), (2), or (3) your most worried about.
Sample efficiency isn't the main way I think about this topic so it's a bit difficult to answer. I find all these defeaters fairly plausible, but if I had to pick the central concern it'd be (3).
I tend to view ML training as a model taking a path through a space of possible programs. There's some programs that are capable and aligned with our interests; others that are capable but will actively pursue harmful goals; and of course many other programs that just don't do anything particularly useful. Assuming we start with model that is aligned (where "aligned" could include "model cannot do anything useful so does not cause any harm") and we only reward positive behavior, I find it plausible that we can hill-climb to more capable models while preserving alignment.
However, suppose we at some point err and reward undesirable behavior. (This could occur due to incorrect human feedback, or a reward model that is not robust, or some other issues.) At this point, we're training a sub-component of the system that is actively opposed to our interests. Hopefully, we eventually discover this sub-component, and can then disincentivize it in the training process. But at that point, there is some uncertainty in my mind: will the training process remove the sub-component, or simply train the sub-component into being better able to fool the training process?
Now, we don't need the reward model to be perfectly robust to avoid this (as you quite rightly point out), just robust in the region of policy space around the current policy where the RL algorithm is likely to explore. But empirically current reward model robustness falls short of even this.
In response to:
2. Harmless base model. If the foundation model starts off harmless (not necessarily aligned, just not actively trying to cause harm), then I'd expect RLHF'ing it to only improve things so long as the training signal never rewards bad behavior. However, the designers want the model to significantly outperform humans at this task. The model has capacity to learn to do this, but can't just leverage existing capabilities in the foundation model, as the performance of that model is limited to that of the best humans it saw in the self-supervised training data. So, we need to do RL for many more time steps. Collecting fresh human data for that is prohibitive, so we rely on a reward model -- unfortunately that gets hacked.
you write:
Are you assuming that we can't collect human data online as the policy optimizes against the reward model? (People currently do collect data online to avoid getting hacked like this.) This case seems probably hopeless to me without very strong regularization (I think you agree with this being mostly hopeless), but it also seems easy to avoid by just collecting human data online.
No, I do expect online data collection to take place, I just don't expect to be able to do that data collection fast enough or in large enough volumes to kick in before hacking takes place. I think in your taxonomy, this is defeater (2): I think we'll need substantially more samples to train superhuman models than we do human models, as the demands from RLHF switch from localizing a task that a network already knows how to perform, to teaching a model to perform a new capability (safely). (I will note online data collection is a pain and people seem to try and do as little as possible of it.)
Oh, we're using terminology quite differently then. I would not call (a) reward hacking, as I view the model as being the reward (to the RL process), whereas humans are not providing reward at all (but rather some data that gets fed into a reward model's learning process). I don't especially care about what definitions we use here, but do wonder if this means we're speaking past each other in other areas as well.
Ah, that paper makes a lot more sense. A reward model was attractive in the original Deep RL From Human Preferences paper because the environment was complex and non-differentiable: using RL was a natural fit. It's always seemed a bit stranger to use RL for fine-tuning language models, especially in the prompt-completion setting where the "environment" is trivial. (RL becomes more natural when you start introducing external tools, or conversations with humans.)
I'll need to take a closer look at the paper, but it looks like they derive the DPO objective by starting from the RL objective under KL optimization. So if it does what it says on the tin, then I'd expect the resulting policy incentives to be similar. My hunch is the problem of reward hacking has shifted from an explicit to implicit problem rather than being eliminated, although I'm certainly not confident on this. Could be interesting to study using a similar approach to the Scaling Laws for Reward Model Overoptimization paper.
Thanks for the follow-up, this helps me understand your view!
At any given point, the reward model will be vulnerable to arbitrary adversarial attacks under sufficient optimization pressure, but we don't need arbitrary optimization against any given reward model. Like, each human update lets you optimize a bit more against the reward model which gets you the ability to get somewhat closer to the policy you actually want.
Sure, this feels basically right to me. My reframing of this would be that we could do in principle do RL directly with feedback provided by a human. Learning a reward model lets us gain some sample efficiency over this, but sometimes it fails to generalize correctly to what a human would say, and adversarial examples are an important special case of this. But provided the policy is the final output of this process, not the reward model, it doesn't matter if the reward model is robust or not -- just that the feedback the policy has received has steered it into a good position.
It seems to me like your views either imply that sample efficiency is low enough now that high quality RLHF currently can't compete with other ways of training AIs which are less safe but have cheaper reward signals (e.g., heavy training on automated outcomes based feedback or low quality human reward signal). Or possibly that this will happen in the future as models get more powerful (reversing the current trend toward better sample efficiency). My understanding is that RLHF is quite commercially popular and sample efficiency isn't a huge issue. Perhaps I'm missing some important gap between how RLHF is used now and how it will have to be used in the future to align powerful systems?
This argument feels like it's proving too much. InstructGPT isn't perfect, but it does produce a lot less toxic output and follow instructions a lot better than the base model GPT-3. RLHF seems to work, and GPT-4 is even better, showing that it gets easier with bigger models. Why should we expect this trend to reverse? Why are we worried about this safety thing anyway?
I actually find this style of argument pretty plausible: I'm a relative optimist on this forum, I do think that some fairly basic methods like a souped-up RLHF might well be sufficient to make things go OK (while preferring to have more principled methods giving us a bigger safety margin). But I'm somewhat surprised to hear you making that case!
Suppose we condition on RLHF failing. At a high level, failures split into: (a) human labelers rewarded the wrong thing (e.g. fooling humans); (b) the reward model failed to predict human labelers judgement and rewarded the wrong thing (e.g. reward hacking); (c) RL produced a policy that is capable enough to be dangerous but is optimizing something other than the reward model (e.g. mesa-optimization). I find all three of these risks plausible, and I don't see a specific reason to privilege (a) or (c) substantially over (b). It sounds like you're most concerned about (a), and I'd love to hear your reasons for that.
However, I do think it's interesting to explore concrete failure modes: given RLHF is working well now, what does my view imply about how it might stop working? One scenario I find plausible is that as models get bigger, the sample efficiency of RLHF continues to increase, since the models have higher fidelity representations of a greater variety of tasks. RLHF therefore just needs to localize the task that's already in the network. However, the performance of this process is ultimately capped by what the base model already represents. I can see two ways this could go wrong:
1. Misaligned base model. If the foundation model that's being RLHF'd is already misaligned, then a small amount of RL training is not going to be enough to disabuse it of this. By contrast, a large amount of RL training (of the order of 10% of the training steps used for self-supervised learning) with a high-fidelity reward signal might. Unfortunately, we can't do that much RL training without just reward hacking, so we never try. (Or we take that many time steps, but with a KL penalty, forcing the model to stay close to the unaligned base model.)
2. Harmless base model. If the foundation model starts off harmless (not necessarily aligned, just not actively trying to cause harm), then I'd expect RLHF'ing it to only improve things so long as the training signal never rewards bad behavior. However, the designers want the model to significantly outperform humans at this task. The model has capacity to learn to do this, but can't just leverage existing capabilities in the foundation model, as the performance of that model is limited to that of the best humans it saw in the self-supervised training data. So, we need to do RL for many more time steps. Collecting fresh human data for that is prohibitive, so we rely on a reward model -- unfortunately that gets hacked.
I think I find the second case more compelling. The first case seems concerning as well, but it seems like quite a scary scenario even if robustness gets solved, whereas I expect fixing robustness to actually make a significant dent in the second scenario.
That said, I feel like I should emphasize in all this that I largely think robustness is an intractable problem to solve, and that while it's worth trying to improve it at the margin I'm most excited by efforts to make systems not need robustness. I think you make a good point that having humans in the loop in the training makes RLHF degrade more gracefully than reward learning approaches that train on a fixed offline dataset, and the KL penalty also helps. I suspect there are many similar algorithmic tweaks that would make algorithms less sensitive to robustness.
For example, riffing off going from an offline to online dataset, could we improve things further by collecting a dataset that anticipates where the RL process might try to exploit it? That sounds fanciful, but there's a simple hack you can do to get something like this. Just train a policy on the current reward model. Then collect human feedback from that policy. Then roll the policy back to the last checkpoint, and repeat the training using the new reward model. You could do this step once per checkpoint, or keep doing it until you get human approval to move on (e.g. the reward model now aligns with human feedback). In this way you should be able to avoid ever giving the policy reward for the wrong behavior. I suspect this process would be much less sample efficient than vanilla RLHF, but it would have better safety properties, and measuring how much slower it is could be a good proxy for how severe the "robustness tax" is.
Also note that vanilla RLHF doesn't necessarily require optimizing against a reward model. For instance, a recently released paper Direct Policy Optimization (DPO) avoids this step entirely.
I'm a bit confused by the claim here, although I've only read the abstract and skimmed the paper so perhaps it'd become obvious from a closer read. As far as I can tell, the cited paper focuses on motion-planning, and considers a rather restricted setting of LQR policies. This is a reasonable starting point, but a human communicating that a cart pole should stand up (or a desired quadcoptor trajectory) feels much simpler than even toy tasks for RLHF in LLMs like summarizing text. So I don't really see this as much evidence in favor of being able to drop the reward model. Generally, any highly sample efficient model-based RL approach would enable RL training to proceed without a reward model by having humans directly label the trajectory,
It can definitely be worth spending money when there's a clear case for it improving employee productivity. I will note there are a range of both norms and physical layouts compatible with open-plan, ranging from "everyone screaming at each other and in line of sight" trading floor to "no talking library vibes, desks facing walls with blinders". We've tried to make different open plan spaces zoned with different norms and this has been fairly successful, although I'm sure some people will still be disturbed by even library-style areas and be more productive in a private office.