AdamGleave

Wiki Contributions

Comments

To check my understanding, is your view something like:

  1. If the reward model isn't adversarially robust, then the RL component of RLHF will exploit it.

  2. These generations will show up in the data presented to the human. Provided the human is adversarially robust, then the human feedback will provide corrective signal to the reward model.

  3. The reward model will stop being vulnerable to those adversarial examples, although may still be vulnerable to other adversarial examples.

  4. If we repeat this iterative process enough times, we'll end up with a robust reward model.

Under this model, improving adversarial robustness just means we need fewer iterations, showing up as improved sample efficiency.

I agree with this view up to a point. It does seem likely that with sufficient iterations, you'd get an accurate reward model. However, I think the difference in sample efficiency could be profound: e.g exponential (needing to get explicit corrective human feedback for most adversarial examples) vs linear (generalizing in the right way from human feedback). In that scenario, we may as well just ditch the reward model and provide training signal to the policy directly from human feedback.

In practice, we've seen that adversarial training (with practical amounts of compute) improves robustness but models are still very much vulnerable to attacks. I don't see why RLHF's implicit adversarial training would end up doing better than explicit adversarial training.

In general I tend to view sample efficiency discussions as tricky without some quantitative comparison. There's a sense in which decision trees and a variety of other simple learning algorithms are a viable approach to AGI, they're just very sample (and compute) inefficient.

The main reason I can see why RLHF may not need adversarial robustness is if the KL penalty from base model approach people currently use is actually enough.

Yes, thanks for spotting my typo! ($2.75 psf isn't crazy for Berkeley after negotiation, but is not something I've ever seen as a list price.)

To compare this to other costs, renting two floors of the WeWork, which we did for most of the summer last year, cost around $1.2M/yr for 14,000 sq. ft. of office space. The Rose Garden has 20,000 sq. ft. of floor space and 20,000 additional sq. ft. of usable outdoor space for less implied annual cost than that.

I'm sympathetic to the high-level claim that owning property usually beats renting if you're committing for a long time period. But the comparison with WeWork seems odd: WeWork specializes in providing short-term, serviced office space and does so at a substantial premium to the more traditional long-term, unserviced commercial real estate contract. When we were looking for office space in Berkeley earlier this year we were seeing list price between $3.25-$3.75/month per square foot, or $780k-900k/year for 20,000 square feet. I'd expect with negotiation you could get somewhat better pricing than this implies, especially if committing to a longer time period.

Of course, the extra outdoor space, mixed-use zoning and ability to highly customize the space may well offset this. But it starts depending a lot more on the details (e.g. how often is the outdoor space used; how much more productive are people in a customized space vs a traditional office) than it might first seem.

This is a good point, adversarial examples in what I called in the post the "main" ML system can be desirable even though we typically don't want them in the "helper" ML systems used to align the main system.

One downside to adversarial vulnerability of the main ML system is that it could be exploited by bad actors (whether human or other, misaligned AIs). But this might be fine in some settings: e.g. you build some airgapped system that helps you build the next, more robust and aligned AI. One could also imagine crafting adversarial example backdoors that are cryptographically hard to discover if you don't know how they were constructed.

I generally expect that if adversarial robustness can be practically solved then transformative AI systems will eventually self-improve themselves to the point of being robust. So, the window where AI systems are dangerous & deceptive enough that we need to test them using adversarial examples but not capable enough to have overcome this might be quite short. Could still be useful as an early-warning sign, though.

Right: if the agent has learned an inner objective of "do things similar to what humans do in the world at the moment I am currently acting", then it'd definitely be incentivized to do that. It's not rewarded by the outer objective for e.g. behavioral cloning on a fixed dataset, as installing bunch of cameras would be punished by that loss (not something humans do) and changing human behavior wouldn't help as BC would still be on the dataset of pre-manipulation demos. That might be little comfort if you're worried about inner optimization, but most the other failures described happen even in the outer alignment case.

That said, if you had a different imitation learning setup that was something like doing RL on a reward of "do the same thing one of our human labelers chooses given the same state" then the outer objective would reward what the behavior you describe. It'd be a hard exploration problem for the agent to learn to exploit the reward in that way, but it quite probably could do so if situationally aware.

Thanks, I'd missed that!

Curious if you have any high-level takeaways from that? Bigger models do better, clearly, but e.g. how low do you think we'll be able to get the error rate in the next 5-10 years given expected compute growth? Are there any follow-up experiments you'd like to see happen in this space?

Also could you clarify whether the setting was for adversarial training or just a vanilla model? "During training, adversarial examples for training are constructed by PGD attacker of 30 iterations" makes me think it's adversarial training but I could imagine this just being used for evals.

Rachel did the bulk of the work on this post (well-done!), I just provided some advise on the project and feedback on earlier manuscripts.

I wanted to share why I'm personally excited by this work in case it helps contextualize it for others.

We'd all like AI systems to be "corrigible", cooperating with us in correcting them. Cooperative IRL has been proposed as a solution to this. Indeed Dylan Hadfield-Menell et al show that CIRL is provably corrigible in a simple setting, the off-switch game.

Provably corrigible sounds great, but where there's a proof there's also an assumption, and Carey et al soon pointed out a number of other assumptions under which this no longer holds -- e.g. if there is model misspecification causing the incorrect probability distribution to be computed.

That's a real problem, but every method can fail if you implement it wrongly (although some are more fragile than others), so this didn't exactly lead to people giving up on the CIRL framework. Recently Shah et al described various benefits they see of CIRL (or "assistance games") over reward learning, though this doesn't address the corrigibility question head on.

A lot of the corrigibility properties of CIRL come from uncertainty: it wants to defer to a human because the human knows more about its preferences than the robot. Recently, Yudkowsky and others described the problem of fully updated deference: if the AI has learned everything it can, it may have no uncertainty, at which point this corrigibility goes away. If the AI has learned your preferences perfectly, perhaps this is OK. But here Carey's critique of model misspecification rears its head again -- if the AI is convinced you love vanilla ice cream, saying "please no give me chocolate" will not convince it (perhaps it thinks you have a cognitive bias against admitting your plain, vanilla preferences -- it knows the real you), whereas it might if it had uncertainty.

I think the prevailing view on this forum is to be pretty down on CIRL because its not corrigible. But I'm not convinced corrigibility in the strict sense is even attainable or desirable. In this post, we outline a bunch of examples of corrigible behavior that I would absolutely not want in an assistant -- like asking me for approval before every minor action! By contrast, the near-corrigible behavior -- asking me only when the robot has genuine uncertainty -- seems more desirable... so long as the robot has calibrated uncertainty. To me, CIRL and corrigibility seem like two extremes: CIRL is focusing on maximizing human reward, whereas corrigibility is focused on avoiding ever doing the wrong thing even under model misspecification. In practice, we need a bit of both -- but I don't think we have a good theoretical framework for that yet.

In addition to that, I hope this post serves as a useful framework to ground future discussions on this. Unfortunately I think there's been an awful lot of talking past each other in debates on this topic in the past. For example, to the best of my knowledge, Hadfield-Menell and other authors of CIRL never believed it solved corrigibility under the assumptions Carey introduced. Although our framework is toy, I think it captures the key assumptions people disagree about, and it can be easily extended to capture more as needed in future discussions.

I'm excited by many of the interventions you describe but largely for reasons other than buying time. I'd expect buying time to be quite hard, in so far as it requires coordinating to prevent many actors from stopping doing something they're incentivized to do. Whereas since alignment research community is small, doubling it is relatively easy. However, it's ultimately a point in favor of the interventions that they look promising under multiple worldviews, but it might lead me to prioritize within them differently to you.

One area I would push back on is the skills you describe as being valuable for "buying time" seem like a laundry list for success in research in general, especially empirical ML research:

Skills that seem uniquely valuable for buying time interventions: general researcher aptitudes, ability to take existing ideas and strengthen them, experimental design skills, ability to iterate in response to feedback, ability to build on the ideas of others, ability to draw connections between ideas, experience conducting “typical ML research,” strong models of ML/capabilities researchers, strong communication skills

It seems pretty bad for the people strongest at empirical ML research to stop doing alignment research. Even if we pessimistically assume that empirical research now is useless (which I'd strongly disagree with), surely we need excellent empirical ML researchers to actually implement the ideas you hope the people who can "generate and formalize novel ideas" come up with. There are a few aspects of this (like communication skills) that do seem to differentially point in favor of "buying time", maybe have a shorter and more curated list in future?

Separately given your fairly expansive list of things that "buy time" I'd have estimated that close to 50% of the alignment community are already doing this -- even if they believe their primary route to impact is more direct. For example, I think most people working on safety at AGI labs would count under your definition: they can help convince decision-makers in the lab not to deploy unsafe AI systems, buying us time. A lot of the work on safety benchmarks or empirical demonstrations of failure modes falls into this category as well. Personally I'm concerned people are falling into this category of work by default and that there's too much of this, although I do think when done well it can be very powerful.

I agree that in a fast takeoff scenario there's little reason for an AI system to operate withing existing societal structures, as it can outgrow them quicker than society can adapt. I'm personally fairly skeptical of fast takeoff (<6 months say) but quite worried that society may be slow enough to adapt that even years of gradual progress with a clear sign that transformative AI is on the horizon may be insufficient.

In terms of humans "owning" the economy but still having trouble getting what they want, it's not obvious this is a worse outcome than the society we have today. Indeed this feels like a pretty natural progression of human society. Humans already interact with (and not so infrequently get tricked or exploited by) entities smarter than them such as large corporations or nation states. Yet even though I sometimes find I've bought a dud on the basis of canny marketing, overall I'm much better off living in a modern capitalist economy than the stone age where humans were more directly in control.

However, it does seem like there's a lot of value lost in the scenario where humans become increasingly disempowered, even if their lives are still better than in 2022. From a total utilitarian perspective, "slightly better than 2022" and "all humans dead" are rounding errors relative to "possible future human flourishing". But things look quite different under other ethical views, so I'm reluctant to conflate these outcomes.

Thanks for this response, I'm glad to see more public debate on this!

The part of Katja's part C that I found most compelling was the argument that for a given AI system its best interests might be to work within the system rather than aiming to seize power. Your response argues that even if this holds true for AI systems that are only slightly superhuman, eventually we will cross a threshold where a single AI system can takeover. This seems true if we hold the world fixed -- there is some sufficiently capable AI system that can take over the 2022 world. But this capability threshold is a moving target: humanity will get better at aligning and controlling AI systems as we gain more experience with them, and we may be able to enlist the help of AI systems to keep others in check. So, why should we expect the equilibrium here to be an AI takeover, rather than AIs working for humans because that it is in their selfish best interest in a market economy where humans are currently the primary property owner?

I think the crux here is whether we expect AI systems to by default collude with one another. They might -- they have a lot of things in common that humans don't, especially if they're copies of one another! But coordination in general is hard, especially if it has to be surreptitious.

As an analogy, I could argue that for much of human history soldiers were only slightly more capable than civilians. Sure, a trained soldier with a shield and sword is a fearsome opponent, but a small group of coordinated civilians could be victorious. Yet as we develop more sophisticated weapons such as guns, cannons, missiles, the power that a single soldier has grows greater and greater. So, by your argument, eventually a single soldier will be powerful enough to take over the world.

This isn't totally fanciful -- the Spanish conquest of the Inca Empire started with just 168 soldiers! The Spanish fought with swords, crossbows, and lances -- if the Inca Empire were still around, it seems likely that a far smaller modern military force could defeat them. Yet, clearly no single soldier is in a position to take over the world, or even a small city. Military coup d'etats are the closest, but involve convincing a significant fraction of the military that is in their interest to seize power. Of course most soldiers wish to serve their nation, not seize power, which goes some way to explaining the relatively low rate of coup attempts. But it's also notable that many coup attempts fail, or at least do not lead to a stable military dictatorship, precisely because of difficulty of internal coordination. After all, if someone intends to destroy the current power structure and violate their promises, how much can you trust that they'll really have your back if you support them?

An interesting consequence of this is that it's ambiguous whether making AI more cooperative makes the situation better or worse.

Load More