Some people have suggested that a lot of the danger of training a powerful AI comes from reinforcement learning. Given an objective, RL will reinforce any method of achieving the objective that the model tries and finds to be successful including things like deceiving us or increasing its power.

If this were the case, then if we want to build a model with capability level X, it might make sense to try to train that model either without RL or with as little RL as possible. For example, we could attempt to achieve the objective using imitation learning instead. 

However, if, for example, the alternate was imitation learning, it would be possible to push back and argue that this is still a black-box that uses gradient descent so we would have no way of knowing that the internals were safe.

Would this be likely to lead to a safer model or is the risk mostly independent of RL?

Notes:

Obviously, someone could probably then apply RL to any such model in order to produce a more powerful model. And having a safe model of capacity level X doesn't save you from someone else building an unsafe model of capacity X unless you've got a plan of how to use the model to change the strategic situation.

But I think it's worth considering this question all the same, just in case some of the governance interventions end up bearing fruit and we do end up with the option to accept less powerful systems.

New Answer
New Comment

3 Answers sorted by

porby

7211

"RL" is a wide umbrella. In principle, you could even train a model with RL such that the gradients match supervised learning. "Avoid RL" is not the most directly specified path to the-thing-we-actually-want.

The source of spookiness

Consider two opposite extremes:

  1. A sparse, distant reward function. A biped must successfully climb a mountain 15 kilometers to the east before getting any reward at all.
  2. A densely shaped reward function. At every step during the climb up the mountain, there is a reward designed to induce gradients that maximize training performance. Every slight mispositioning of a toe is considered.

Clearly, number 2 is going to be easier to train, but it also constrains the solution space for the policy.

If number 1 somehow successfully trained, what's the probability that the solution it found would look like number 2's imitation data? What's the probability it would look anything like a bipedal gait? What's the probability it just exploits the physics simulation to launch itself across the world?

If you condition on a sparse, distant reward function training successfully, you should expect the implementation found by the optimizer to sample from a wide distribution of possible implementations that are compatible with the training environment.

It is sometimes difficult to predict what implementations are compatible with the environment. The more degrees of freedom exist in the environment, the more room the optimizer has to roam. That's where the spookiness comes from.

Is RL therefore spooky?

RL appears to make this spookiness more accessible. It's difficult to use (un)supervised learning in a way that gives a model great freedom of implementation; it's usually learning from a large suite of examples.

But there's a major constraint on RL: in the absence of sufficient environmental structure, reward shaping, or other sources of optimizer guidance, it is nearly impossible for any computationally tractable optimizer to find any implementation for a sparse/distant reward function. It simply won't sample the reward often enough to produce useful gradients.[1]

In other words, practical applications of RL are computationally bounded to a pretty limited degree of reward sparsity/distance. All the examples of "RL" doing interesting things that look like they involve sparse/distant reward involve enormous amounts of implicit structure of various kinds, like powerful world models.[2] 

Given these limitations, the added implementation-uncertainty of RL is usually not so massive that it's worth entirely banning it. Do be careful about what you're actually reinforcing, just as you must be careful with prompts or anything else, and if you somehow figure out a way to make from-scratch sparse/distant rewards work better without a hypercomputer, uh, be careful?

A note on offline versus online RL

The above implicitly assumes online RL, where the policy is able to learn from new data generated by the policy as it interacts with the environment.

Offline RL that learns from an immutable set of data does not allow the optimizer as much room to explore, and many of the apparent risks of RL are far less accessible.

Usage in practice

The important thing is that the artifact produced by a given optimization process falls within some acceptable bounds. Those bounds might arise from the environment, computability, or something else, but they're often available.

RL-as-it-can-actually-be-applied isn't that special here. The one suggestion I'd have is to try to use it in a principled way. For example: doing pretraining but inserting an additional RL-derived gradient to incentivize particular behaviors works, but it's just arbitrarily shoving a bias/precondition into the training. The result will be at some equilibrium between the pretraining influence and the RL influence. Perhaps the weighting could be chosen in an intentional way, but most such approaches are just ad hoc.

For comparison, you could elicit similar behavior by including a condition metatoken in the prompt (see decision transformers for an example). With that structure, you can be more explicit about what exactly the condition token is supposed to represent, and you can do fancy interpretability techniques to see what the condition is actually causing mechanistically.[3]

  1. ^

    If you could enumerate all possible policies with a hypercomputer and choose the one that performs the best on the specified reward function, that would train, and it would also cause infinite cosmic horror. If you have a hypercomputer, don't do that.

  2. ^

    Or in the case of RLHF on LLMs, the fine-tuning process is effectively just etching a precondition into the predictor, not building complex new functions. Current LLMs, being approximators of probabilistic inference to start with, have lots of very accessible machinery for this kind of conditioning process.

  3. ^

    There are other options here, but I find this implementation intuitive.

Oh this is a great way of laying it out. Agreed on many points, and I think this may have made some things easier for me to see, likely some of that is actual update that changes opinions I've shared before that you're disagreeing with. I'll have to ponder.

Oh, this is a fascinating perspective.

So most uses of RL already just use a small-bit of RL.

So if the goal was "only use a little bit of RL", that's already happening.

Hmm... I still wonder if using even less RL would be safer still.

[-]porby105

I do think that if you found a zero-RL path to the same (or better) endpoint, it would often imply that you've grasped something about the problem more deeply, and that would often imply greater safety.

Some applications of RL are also just worse than equivalent options. As a trivial example, using reward sampling to construct a gradient to match a supervised loss gradient is adding a bunch of clearly-pointless intermediate steps.

I suspect there are less trivial cases, like how a decision transformer isn't just learning an optimal policy for its dataset but rather a supertask: what different levels of performance look like on that task. By subsuming an RL-ish task in prediction, the predictor can/must develop a broader understanding of the task, and that understanding can interact with other parts of the greater model. While I can't currently point to strong empirical evidence here, my intuition would be that certain kinds of behavioral collapse would be avoided by the RL-via-predictor because the distribution is far more explicitly maintained during training.[1][2]

But there are often reasons why the more-RL-shaped thing is currently being used. It's not always trivial to swap over ... (read more)

All the examples of "RL" doing interesting things that look like they involve sparse/distant reward involve enormous amounts of implicit structure of various kinds, like powerful world models.

I guess when you say “powerful world models”, you’re suggesting that model-based RL (e.g. MuZero) is not RL but rather “RL”-in-scare-quotes. Was that your intention?

I’ve always thought of model-based RL is a central subcategory within RL, as opposed to an edge-case.

7johnswentworth
Personally, I consider model-based RL to be not RL at all. I claim that either one needs to consider model-based RL to be not RL at all, or one needs to accept such a broad definition of RL that the term is basically-useless (which I think is what porby is saying in response to this comment, i.e. "the category of RL is broad enough that it belonging to it does not constrain expectation much in the relevant way").
8Steven Byrnes
@nostalgebraist bites that bullet here: and so does Russell & Norvig 3rd edition …That’s not my perspective though. For my part, there’s a stereotypical core of things-I-call-RL which entails: * (A) There’s a notion of the AI’s outputs being better or worse * (B) But we don’t have ground truth (even after-the-fact) about what any particular output should ideally have been * (C) Therefore the system needs to do some kind of explore-exploit. By this (loose) definition, both model-based and model-free RL are central examples of “reinforcement learning”, whereas LLM self-supervised base models are not reinforcement learning (cf. (B), (C)), nor are ConvNet classifiers trained on ImageNet (ditto), nor are clustering algorithms (cf. (A), (C)), nor is A* or any other exhaustive search within a simple deterministic domain (cf. (C)), nor are VAEs (cf. (C)), etc. (A,B,C) is generally the situation you face if you want an AI to win at videogames or board games, control bodies while adapting to unpredictable injuries or terrain, write down math proofs, design chips, found companies, and so on. This (loose) definition of RL connects to AGI safety because (B-C) makes it harder to predict the outputs of an RL system. E.g. we can plausibly guess that an LLM base model, given internet-text-like prompts, will continue in an internet-text-typical way. Granted, given OOD prompts, it’s harder to say things a priori about the output. But that’s nothing compared to e.g. AlphaZero or AlphaStar, where we’re almost completely in the dark about what the trained model will do in any nontrivial game-state whatsoever. (…Then extrapolate the latter to human-level AGIs acting in the real world!) (That’s not an argument that “we’re doomed if AGI is based on RL”, but I do think that a very RL-centric AGI would need tailored approaches to thinking about safety and alignment that wouldn’t apply to LLMs; and I likewise think that likewise a massive increase in the scope and centrality of LLM-
6porby
Calling MuZero RL makes sense. The scare quotes are not meant to imply that it's not "real" RL, but rather that the category of RL is broad enough that it belonging to it does not constrain expectation much in the relevant way. The thing that actually matters is how much the optimizer can roam in ways that are inconsistent with the design intent. For example, MuZero can explore the superhuman play space during training, but it is guided by the structure of the game and how it is modeled. Because of that structure, we can be quite confident that the optimizer isn't going to wander down a path to general superintelligence with strong preferences about paperclips.
6Steven Byrnes
Right, and that wouldn’t apply to a model-based RL system that could learn an open-ended model of any aspect of the world and itself, right? I think your “it is nearly impossible for any computationally tractable optimizer to find any implementation for a sparse/distant reward function” should have some caveat that it only clearly applies to currently-known techniques. In the future there could be better automatic-world-model-builders, and/or future generic techniques to do automatic unsupervised reward-shaping for an arbitrary reward, such that AIs could find out-of-the-box ways to solve hard problems without handholding.
5porby
It does still apply, though what 'it' is here is a bit subtle. To be clear, I am not claiming that a technique that is reasonably describable as RL can't reach extreme capability in an open-ended environment. The precondition I included is important: In my frame, the potential future techniques you mention are forms of optimizer guidance. Again, that doesn't make them "fake RL," I just mean that they are not doing a truly unconstrained search, and I assert that this matters a lot. For example, take the earlier example of a hypercomputer that brute forces all bitstrings corresponding to policies and evaluates them to find the optimum with no further guidance required. Compare the solution space for that system to something that incrementally explores in directions guided by e.g. strong future LLM, or something. The RL system guided by a strong future LLM might achieve superhuman capability in open-ended domains, but the solution space is still strongly shaped by the structure available to the optimizer during training and it is possible to make much better guesses about where the optimizer will go at various points in its training. It's a spectrum. On one extreme, you have the universal-prior-like hypercomputer enumeration. On the other, stuff like supervised predictive training. In the middle, stuff like MuZero, but I argue MuZero (or its more open-ended future variants) is closer to the supervised side of things than the hypercomputer side of things in terms of how structured the optimizer's search is. The closer a training scheme is to the hypercomputer one in terms of a lack of optimizer guidance, the less likely it is that training will do anything at all in a finite amount of compute.
4Steven Byrnes
I agree that in the limit of an extremely structured optimizer, it will work in practice, and it will wind up following strategies that you can guess to some extent a priori. I also agree that in the limit of an extremely unstructured optimizer, it will not work in practice, but if it did, it will find out-of-the-box strategies that are difficult to guess a priori. But I disagree that there’s no possible RL system in between those extremes where you can have it both ways. On the contrary, I think it’s possible to design an optimizer which is structured enough to work well in practice, while simultaneously being unstructured enough that it will find out-of-the-box solutions very different from anything the programmers were imagining. Examples include: * MuZero: you can’t predict a priori what chess strategies a trained MuZero will wind up using by looking at the source code. The best you can do is say “MuZero is likely to use strategies that lead to its winning the game”. * “A civilization of humans” is another good example: I don’t think you can look at the human brain neural architecture and loss functions etc., and figure out a priori that a civilization of humans will wind up inventing nuclear weapons. Right?
4porby
I don't disagree. For clarity, I would make these claims, and I do not think they are in tension: 1. Something being called "RL" alone is not the relevant question for risk. It's how much space the optimizer has to roam. 2. MuZero-like strategies are free to explore more space than something like current applications of RLHF. Improved versions of these systems working in more general environments have the capacity to do surprising things and will tend to be less 'bound' in expectation than RLHF. Because of that extra space, these approaches are more concerning in a fully general and open-ended environment. 3. MuZero-like strategies remain very distant from a brute-forced policy search, and that difference matters a lot in practice. 4. Regardless of the category of the technique, safe use requires understanding the scope of its optimization. This is not the same as knowing what specific strategies it will use. For example, despite finding unforeseen strategies, you can reasonably claim that MuZero (in its original form and application) will not be deceptively aligned to its task. 5. Not all applications of tractable RL-like algorithms are safe or wise. 6. There do exist safe applications of RL-like algorithms.

thinking at the level of constraints is useful. very sparse rewards offer less constraints on final solution. imitation would offer a lot of constraints (within distribution and assuming very low loss).

a way to see RL/supervised distinction dissolve is to convert back and forth. With a reward as negative token prediction loss, and actions being the set of tokens, we can simulate auto-regressive training with RL (as mentioned by @porby). conversely, you could first train RL policy and then imitate that (in which case why would imitator be any safer?). ... (read more)

ryan_greenblatt

166

If you avoid using RL, then you might need a much "smarter" model for a given level of usefulness.

And even without RL, you need to be getting bits of selection from somewhere: to get useful behavior you have to at the very least specify what useful behavior would be (though the absolute minimum number of bits would be very small given a knowledgable model). (So some selection or steering is surely required, but you might hope this selection/steering is safer for some reason or perhaps more interpretable (like e.g. prompting can in principle be).)

Dramatically cutting down on RL might imply that you need a much, much smarter model overall. (For instance, the safety proposal discussed in "conditioning predictive models" seems to me like it would require a dramatically smarter model than would be required if you used RL normally (if this stuff worked at all).)

Given that a high fraction of the concern (IMO) is proportional to how smart your model is, needing a much smarter model seems very concerning.

Ok, so cutting RL can come with costs, what about the benefits to cutting RL? I think the main concern with RL is that it either teaches the model things that we didn't actually need and which are dangerous or that it gives it dangerous habits/propensities. For instance, it might teach models to consider extremely creative strategies which humans would have never thought of and which humans don't at all understand. It's not clear we need this to do extremely useful things with AIs. Another concern is that some types of outcome-based RL will teach the AI to cleverly exploit our reward provisioning process which results in a bunch of problems.

But, there is a bunch of somewhat dangerous stuff that RL teaches which seems clearly needed for high usefulness. So, if we fix the level of usefulness, this stuff has to be taught to the model by something. For instance, being a competent agent that is at least somewhat aware of its own abilities is probably required. So, when thinking about cutting RL, I don't think you should be thinking about cutting agentic capabilities as that is very likely required.

My guess is that much more of the action is not in "how much RL", but is instead in "how much RL of the type that seems particular dangerous and which didn't result in massive increases in usefulness". (Which mirrors porby's answer to some extent.)

In particular we'd like to avoid:

  1. RL that will result in AIs learning to pursue clever strategies that humans don't understand or at least wouldn't think of. (Very inhuman strategies.) (See also porby's answer which seems basically reasonable to me.)
  2. RL on exploitable outcome-based feedback that results in the AI actually doing the exploitation a non-trivial fraction of the time.

(Weakly exploitable human feedback without the use of outcomes (e.g. the case where the human reviews the full trajectory and rates how good it seems overall) seems slightly concerning, but much less concerning overall. Weak exploitation could be things like sycophancy or knowing when to lie/deceive to get somewhat higher performance.)

Then the question is just how much of a usefulness tax it is to cut back on these types of RL, and then whether this usefulness tax is worth it given that it implies we have to have a smarter model overall to reach a fixed level of usefulness.

(Type (1) of RL from the above list is eventually required for AIs with general purpose qualitatively wildly superhuman capabilities (e.g. the ability to execute very powerful strategies that humans have a very hard time understanding) , but we can probably get done almost everything we want without such powerful models.)

My guess is that in the absence of safety concerns, society will do too much of these concerning types of RL, but might actually do too little of safer types of RL that help to elicit capabilities (because it is easier to just scale up the model further than to figure out how to maximally elicit capabilities).

(Note that my response ignores the cost of training "smarter" models and just focuses on hitting a given level of usefulness as this seems to be the requested analysis in the question.)

You mention that society may do too little of the safer types of RL. Can you clarify what you mean by this?

5ryan_greenblatt
In brief: large amounts of high quality process based RL might result in AI being more useful earlier (prior to them becoming much smarter). This might be expensive and annoying (e.g. it might require huge amounts of high quality human labor) such that by default labs do less of this relative to just scaling up models than would be optimal from a safety perspective.

Seth Herd

32

Compared to what?

If you want an agentic system (and I think many humans do, because agents can get things done), you've got to give it goals somehow. RL is one way to do that. The question of whether that's less safe isn't meaningful without comparing it to another method of giving it goals.

The method I think is both safer and implementable is giving goals in natural language, to a system that primarily "thinks" in natural language. I think this is markedly safer than any RL proposal anyone has come up with so far. And there are some other options for specifying goals without using RL, each of which does seem safer to me:

Goals selected from learned knowledge: an alternative to RL alignment

I think it's still valid to ask in the abstract whether RL is a particularly dangerous approach to training an AI system.

4Seth Herd
Surely asking if anything is safer is only sensible when comparing it to something. Are you comparing it to some implicit expected-if-not RL method of alignment? I don't think we have a commonly shared concept of what that would be. That's why I'm pointing to some explicit alternatives in that post.
4Chris_Leong
I've heard people suggest that they have arguments related to RL being particularly dangerous, although I have to admit that I'm struggling to find these arguments at the moment. I don't know, perhaps that helps clarify why I've framed the question the way that I've framed it?
6Steven Byrnes
I sometimes say things kinda like that, e.g. here.
6Seth Herd
I agree, I have heard that claim many times, probably including the vague claim that it's "more dangerous" than a poorly-defined imagined alternative. A bunch of pessimistic stuff in the vein of List of Lethalities focuses on reinforcement learning, analyzing how and why that is likely to go wrong. That's what started me thinking about true alternatives. So yes, that does clarify why you've framed it that way. And I think it's a useful question. In fact, I would've been prone to say "RL is unsafe and shouldn't be used". Porby's answer to your question is insightful; it notes that other types of learning aren't that different in kind. It depends how the RL or other learning is done. One reason that non-RL approaches (at least the few I know of) seem safer is that they're relying on prediction or other unsupervised learning to create good, reliable representations of the world, including goals for agents. That type of learning is typically better because you can do more of it. You don't need either a limited set of human-labeled data, which is always many orders of magnititude scarcer than data gathered from sensing the world (e.g., language input for LLMs, images for vision, etc). The other alternative is having a reward-labeling algorithm which can attach reward signals to any data, but that seems unreliable in that we don't have even good guesses on an algorithm that can identify human values or even reliable instruction-following.