ryan_greenblatt

Wiki Contributions

Comments

Sorry, thanks for the correction.

I personally disagree on this being a good benchmark for outer alignment for various reasons, but it's good to understand the intention.

This is pretty close to my understanding, with one important objection.

Thanks for responding and trying to engage with my perspective.

Objection

If we repeat this iterative process enough times, we'll end up with a robust reward model.

I don't claim we'll necessarily ever get a fully robust reward model, just that the reward model will be mostly robust on average to the actual policy you use as long as human feedback is provided at a frequent enough interval. We never needed a good robust reward model which works on every input, we just needed a reward model which captured our local preferences about the actual current policy distribution sufficiently well.

At any given point, the reward model will be vulnerable to arbitrary adversarial attacks under sufficient optimization pressure, but we don't need arbitrary optimization against any given reward model. Like, each human update lets you optimize a bit more against the reward model which gets you the ability to get somewhat closer to the policy you actually want.

KL penalty is presumably an important part of the picture.

My overall claim is just that normal RLHF is pretty fault tolerant and degrades pretty gracefully with lack of robustness.

Sample efficiency seems fine now and progress continues

Beyond this, my view is considerably informed by 'sample efficiency seems fine in practice now and is better better not worse with more powerful models (from my limited understanding)'. It's possible this trend could reverse with sufficiently big models, but I'd find this surprising and don't see any particular reason to expect this. Technical advancement in RLHF is ongoing and will improve sample efficiency further.

It seems to me like your views either imply that sample efficiency is low enough now that high quality RLHF currently can't compete with other ways of training AIs which are less safe but have cheaper reward signals (e.g., heavy training on automated outcomes based feedback or low quality human reward signal). Or possibly that this will happen in the future as models get more powerful (reversing the current trend toward better sample efficiency). My understanding is that RLHF is quite commercially popular and sample efficiency isn't a huge issue. Perhaps I'm missing some important gap between how RLHF is used now and how it will have to be used in the future to align powerful systems?

Also note that vanilla RLHF doesn't necessarily require optimizing against a reward model. For instance, a recently released paper Direct Policy Optimization (DPO) avoids this step entirely. It's unclear if DPO will win out for various reasons, but as far as I can tell, you should think that DPO should totally resolve the robustness concern with vanilla RLHF. Of course, recursive oversight schemes and other things which use AIs to help oversee AIs could still involve optimizing against AIs.

I'd like to register that I disagree with the claim that standard online RLHF requires adversarial robustness in AIs persay. (I agree that it requires that humans are adversarially robust to the AI, but this is a pretty different problem.)

In particular, the place where adversarial robustness shows up is in sample efficiency. So, poor adversarial robustness is equivalent to poor sample efficiency. My understanding is that the trend is toward higher not lower sample efficiency with scale, so this seems on track.

This same reasoning also applies to recursive oversight schemes or debate.

Overall, my guess is that commercial incentives and normal scaling will cause the sample efficiencies of these processes to be quite high in practice. Additionally, my guess would be that improving adversarial robustness isn't the best way to push on sample efficiency.

That said, I do think there are many important reasons to improve adversarial robustness.

If I build a chatbot, and I can't jailbreak it, how do I determine whether that's because the chatbot is secure or because I'm bad at jailbreaking? How should AI scientists overcome Schneier's Law of LLMs?

FWIW, I think there aren't currently good benchmarks for alignment and the ones you list aren't very relevant.

In particular, MMLU and Swag both are just capability benchmarks where alignment training is very unlikely to improve performance. (Alignment-ish training could theoretically could improve performance by making the model 'actually try', but what people currently call alignment training doesn't improve performance for existing models.)

The MACHIAVELLI benchmark is aiming to test something much more narrow than 'how unethical is an LLM?'. (I also don't understand the point of this benchmark after spending a bit of time reading the paper, but I'm confident it isn't trying to do this.) Edit: looks like Dan H (one of the authors) says that the benchmark is aiming to test something as broad as 'how unethical is an LLM' and generally check outer alignment. Sorry for the error. I personally don't think this is a good test for outer alignment (for reasons I won't get into right now), but that is what it's aiming to do.

TruthfulQA is perhaps the closest to an alignment benchmark, but it's still covering a very particular difficulty. And it certainly isn't highlighting jailbreaks.

Oops, somehow I missed that context.

Thanks for the clarification.

Does This Make Any Sense?

I'm confused - it looks like the first paragraph of this section is taken from a prior post on attribution patching.

When I try to interpret your points here, I come to the conclusion that you think humans, upon reflection, would cause human extinction (in favor of resources being used for something else).

Or at least that many/most humans would, upon reflection, prefer resources to be used for purposes other than preserving human life (including not preserving human life in simulation). And this holds even if (some of) the existing humans 'want' to be preserved (at least according to a conventional notion of preferences).

I think this empirical view seems pretty implausible.

That said, I think it's quite plausible that upon reflection, I'd want to 'wink out' any existing copies of myself in favor of using resources better things. But this is partially because I personally (in my current state) would endorse such a thing: if my extrapolated volition thought it would be better to not exist (in favor of other resource usage), my current self would accept that. And, I think it currently seems unlikely that upon reflection, I'd want to end all human lives (in particular, I think I probably would want to keep humans alive who had preferences against non-existence). This applies regardless of trade; it's important to note this to avoid a 'perpetual motion machine' type argument.

Beyond this, I think that most or many humans or aliens would, upon reflection, want to preserve currently existing humans or aliens who had a preference against non-existence. (Again, regardless of trade.)

Additionally, I think it's quite plausible that most or many humans or aliens will enact various trades or precommitments prior to reflecting (which is probably ill-advised, but it will happen regardless). So current preferences which aren't stable under reflection might have a significant influence overall.

I would be more sympathetic if you made a move like, "I'll accept continuity through the human range of intelligence, and that we'll only have to align systems as collectively powerful as humans, but I still think that hands-on experience is only..." In particular, I think there is a real disagreement about the relative value of experimenting on future dangerous systems instead of working on theory or trying to carefully construct analogous situations today by thinking in detail about alignment difficulties in the future.

Here are some views, often held in a cluster:

I'm not sure exactly which clusters you're referring to, but I'll just assume that you're pointing to something like "people who aren't very into the sharp left turn and think that iterative, carefully bootstrapped alignment is a plausible strategy." If this isn't what you were trying to highlight, I apologize. The rest of this comment might not be very relevant in that case.

To me, the views you listed here feel like a straw man or weak man of this perspective.

Furthermore, I think the actual crux is more often "prior to having to align systems that are collectively much more powerful than humans, we'll only have to align systems that are somewhat more powerful than humans." This is essentially the crux you highlight in A Case for the Least Forgiving Take On Alignment. I believe disagreements about hands-on experience are quite downstream of this crux: I don't think people with reasonable views (not weak men) believe that "without prior access to powerful AIs, humans will need to align AIs that are vastly, vastly superhuman, but this will be fine because these AIs will need lots of slow, hands-on experience in the world to do powerful stuff (like nanotech)."

So, discussing how well superintelligent AIs can operate from first principles seems mostly irrelevant to this discussion (if by superintelligent AI, you mean something much, much smarter than the human range).

Pitting two models against each other in a zero-sum competition only works so long as both models actually learn the desired goals. Otherwise, they may be able to reach a compromise with each other and cooperate towards a non-zero-sum objective.

If training works well, then they can't collude on average during training, only rarely or in some sustained burst prior to training crushing these failures.

In particular, in the purely supervised case with gradient descent, performing poorly on average in durining training requires gradient hacking (or more benign failures of gradient descent, but it's unclear why the goals of the AIs would be particularly relevant in this case).

In the RL case, it requires exploration hacking (or benign failures as in the gradient case).

The only way to prevent this is for the decision maker to assign some probability to all possible actions, regardless of how bad the predicted outcome is. This necessarily means bad outcomes will occur more frequently than they would if they could make deterministic decisions based on honest conditional predictions. We might reasonably say we don’t want to ever randomly take an action that leads to the extinction of humanity with high probability, but if this is true then a predictor can lie about that to dissuade us from any action. Even if we would be willing to take such an action with very small probability in order to get honest conditional predictions, we likely cannot commit to following through on such an action if our randomizer lands on it.

Thinking about this in terms of precommitment seems to me like it's presupposing that the AI perfectly optimizes the training objective in some deep sense (which seems implausible to me). The reason why this exploration procedure works is presumably that you end up selecting such actions frequently during training which in turn selects for AIs which perform well. Epsilon exploration only works if you sample the epsilon. So, it doesn't work if you set the epsilon to 1e-40 or something.

Load More