porby - LessWrong

I sometimes post experiment ideas on my shortform. If you see one that seems exciting and you want to try it, great! Please send me a message so we can coordinate and avoid doing redundant work.

Does reducing the amount of RL for a given capability level make AI safer?

porby3mo41

But I disagree that there’s no possible RL system in between those extremes where you can have it both ways.

I don't disagree. For clarity, I would make these claims, and I do not think they are in tension:

Something being called "RL" alone is not the relevant question for risk. It's how much space the optimizer has to roam.
MuZero-like strategies are free to explore more space than something like current applications of RLHF. Improved versions of these systems working in more general environments have the capacity to do surprising things and will tend to be less 'bound' in expectation than RLHF. Because of that extra space, these approaches are more concerning in a fully general and open-ended environment.
MuZero-like strategies remain very distant from a brute-forced policy search, and that difference matters a lot in practice.
Regardless of the category of the technique, safe use requires understanding the scope of its optimization. This is not the same as knowing what specific strategies it will use. For example, despite finding unforeseen strategies, you can reasonably claim that MuZero (in its original form and application) will not be deceptively aligned to its task.
Not all applications of tractable RL-like algorithms are safe or wise.
There do exist safe applications of RL-like algorithms.

Does reducing the amount of RL for a given capability level make AI safer?

porby3mo51

It does still apply, though what 'it' is here is a bit subtle. To be clear, I am not claiming that a technique that is reasonably describable as RL can't reach extreme capability in an open-ended environment.

The precondition I included is important:

in the absence of sufficient environmental structure, reward shaping, or other sources of optimizer guidance, it is nearly impossible for any computationally tractable optimizer to find any implementation for a sparse/distant reward function

In my frame, the potential future techniques you mention are forms of optimizer guidance. Again, that doesn't make them "fake RL," I just mean that they are not doing a truly unconstrained search, and I assert that this matters a lot.

For example, take the earlier example of a hypercomputer that brute forces all bitstrings corresponding to policies and evaluates them to find the optimum with no further guidance required. Compare the solution space for that system to something that incrementally explores in directions guided by e.g. strong future LLM, or something. The RL system guided by a strong future LLM might achieve superhuman capability in open-ended domains, but the solution space is still strongly shaped by the structure available to the optimizer during training and it is possible to make much better guesses about where the optimizer will go at various points in its training.

It's a spectrum. On one extreme, you have the universal-prior-like hypercomputer enumeration. On the other, stuff like supervised predictive training. In the middle, stuff like MuZero, but I argue MuZero (or its more open-ended future variants) is closer to the supervised side of things than the hypercomputer side of things in terms of how structured the optimizer's search is. The closer a training scheme is to the hypercomputer one in terms of a lack of optimizer guidance, the less likely it is that training will do anything at all in a finite amount of compute.

Does reducing the amount of RL for a given capability level make AI safer?

porby3mo61

Calling MuZero RL makes sense. The scare quotes are not meant to imply that it's not "real" RL, but rather that the category of RL is broad enough that it belonging to it does not constrain expectation much in the relevant way. The thing that actually matters is how much the optimizer can roam in ways that are inconsistent with the design intent.

For example, MuZero can explore the superhuman play space during training, but it is guided by the structure of the game and how it is modeled. Because of that structure, we can be quite confident that the optimizer isn't going to wander down a path to general superintelligence with strong preferences about paperclips.

Does reducing the amount of RL for a given capability level make AI safer?

porby3mo105

I do think that if you found a zero-RL path to the same (or better) endpoint, it would often imply that you've grasped something about the problem more deeply, and that would often imply greater safety.

Some applications of RL are also just worse than equivalent options. As a trivial example, using reward sampling to construct a gradient to match a supervised loss gradient is adding a bunch of clearly-pointless intermediate steps.

I suspect there are less trivial cases, like how a decision transformer isn't just learning an optimal policy for its dataset but rather a supertask: what different levels of performance look like on that task. By subsuming an RL-ish task in prediction, the predictor can/must develop a broader understanding of the task, and that understanding can interact with other parts of the greater model. While I can't currently point to strong empirical evidence here, my intuition would be that certain kinds of behavioral collapse would be avoided by the RL-via-predictor because the distribution is far more explicitly maintained during training.^[1]^[2]

But there are often reasons why the more-RL-shaped thing is currently being used. It's not always trivial to swap over to something with some potential theoretical benefits when training at scale. So long as the RL-ish stuff fits within some reasonable bounds, I'm pretty okay with it and would treat it as a sufficiently low probability threat that you would want to be very careful about how you replaced it, because the alternative might be sneakily worse.^[3]

^{^}
KL divergence penalties are one thing, but it's hard to do better than the loss directly forcing adherence to the distribution.
^{^}
You can also make a far more direct argument about model-level goal agnosticism in the context of prediction.
^{^}
I don't think this is likely, to be clear. They're just both pretty low probability concerns (provided the optimization space is well-constrained).

Does reducing the amount of RL for a given capability level make AI safer?

Answer by porbyMay 05, 20247211

"RL" is a wide umbrella. In principle, you could even train a model with RL such that the gradients match supervised learning. "Avoid RL" is not the most directly specified path to the-thing-we-actually-want.

The source of spookiness

Consider two opposite extremes:

A sparse, distant reward function. A biped must successfully climb a mountain 15 kilometers to the east before getting any reward at all.
A densely shaped reward function. At every step during the climb up the mountain, there is a reward designed to induce gradients that maximize training performance. Every slight mispositioning of a toe is considered.

Clearly, number 2 is going to be easier to train, but it also constrains the solution space for the policy.

If number 1 somehow successfully trained, what's the probability that the solution it found would look like number 2's imitation data? What's the probability it would look anything like a bipedal gait? What's the probability it just exploits the physics simulation to launch itself across the world?

If you condition on a sparse, distant reward function training successfully, you should expect the implementation found by the optimizer to sample from a wide distribution of possible implementations that are compatible with the training environment.

It is sometimes difficult to predict what implementations are compatible with the environment. The more degrees of freedom exist in the environment, the more room the optimizer has to roam. That's where the spookiness comes from.

Is RL therefore spooky?

RL appears to make this spookiness more accessible. It's difficult to use (un)supervised learning in a way that gives a model great freedom of implementation; it's usually learning from a large suite of examples.

But there's a major constraint on RL: in the absence of sufficient environmental structure, reward shaping, or other sources of optimizer guidance, it is nearly impossible for any computationally tractable optimizer to find any implementation for a sparse/distant reward function. It simply won't sample the reward often enough to produce useful gradients.^[1]

In other words, practical applications of RL are computationally bounded to a pretty limited degree of reward sparsity/distance. All the examples of "RL" doing interesting things that look like they involve sparse/distant reward involve enormous amounts of implicit structure of various kinds, like powerful world models.^[2]

Given these limitations, the added implementation-uncertainty of RL is usually not so massive that it's worth entirely banning it. Do be careful about what you're actually reinforcing, just as you must be careful with prompts or anything else, and if you somehow figure out a way to make from-scratch sparse/distant rewards work better without a hypercomputer, uh, be careful?

A note on offline versus online RL

The above implicitly assumes online RL, where the policy is able to learn from new data generated by the policy as it interacts with the environment.

Offline RL that learns from an immutable set of data does not allow the optimizer as much room to explore, and many of the apparent risks of RL are far less accessible.

Usage in practice

The important thing is that the artifact produced by a given optimization process falls within some acceptable bounds. Those bounds might arise from the environment, computability, or something else, but they're often available.

RL-as-it-can-actually-be-applied isn't that special here. The one suggestion I'd have is to try to use it in a principled way. For example: doing pretraining but inserting an additional RL-derived gradient to incentivize particular behaviors works, but it's just arbitrarily shoving a bias/precondition into the training. The result will be at some equilibrium between the pretraining influence and the RL influence. Perhaps the weighting could be chosen in an intentional way, but most such approaches are just ad hoc.

For comparison, you could elicit similar behavior by including a condition metatoken in the prompt (see decision transformers for an example). With that structure, you can be more explicit about what exactly the condition token is supposed to represent, and you can do fancy interpretability techniques to see what the condition is actually causing mechanistically.^[3]

^{^}
If you could enumerate all possible policies with a hypercomputer and choose the one that performs the best on the specified reward function, that would train, and it would also cause infinite cosmic horror. If you have a hypercomputer, don't do that.
^{^}
Or in the case of RLHF on LLMs, the fine-tuning process is effectively just etching a precondition into the predictor, not building complex new functions. Current LLMs, being approximators of probabilistic inference to start with, have lots of very accessible machinery for this kind of conditioning process.
^{^}
There are other options here, but I find this implementation intuitive.

List your AI X-Risk cruxes!

porby3mo180

Stated as claims that I'd endorse with pretty high, but not certain, confidence:

There exist architectures/training paradigms within 3-5 incremental insights of current ones that directly address most incapabilities observed in LLM-like systems. (85%; if false, my median strong AI estimate would jump by a few years, p(doom) effect would vary depending on how it was falsified)
It is not an accident that the strongest artificial reasoners we have arose from something like predictive pretraining. In complex and high dimensional problem spaces like general reasoning, successful training will continue to depend on schemes with densely informative gradients that can constrain the expected shape of the training artifact. In those problem spaces, training that is roughly equivalent to sparse/distant reward in naive from-scratch RL will continue to mostly fail.^[1] (90%; if false, my p(doom) would jump a lot)
Related to, and partially downstream of, #2: the strongest models at the frontier of AGI will continue to be remarkably corrigible (in the intuitive colloquial use of the word, but not strictly MIRI's use). That is, the artifact produced by pretraining and non-malicious fine tuning will not be autonomously doomseeking even if it has the capability. (A bit less than 90%; this being false would also jump by p(doom) by a lot)
Creating agents out of these models is easy and will get easier. Most of the failures in current agentic applications are not fundamental, and many are related to #1. There are no good ways to stop a weights-available model from, in principle, being used as a potentially dangerous agent, and outcome variance will increase as capabilities increase. (95%; I'm not even sure what the shape of this being false would be, but if there was a solution, it'd drop my current p(doom) by at least half)
Scale is sufficient to bypass the need for some insights. While a total lack of insights would make true ASI difficult to reach in the next few years, the hardware and scale of 2040 is very likely enough to do it the dumb way, and physics won't get in the way soon enough. (92%; falsification would make the tail of my timelines longer. #1 and #5 being falsified together could jump my median by 10+ years.)
We don't have good plans for how to handle a transition period involving widely available high-capability systems, even assuming that those high-capability systems are only dangerous when intentionally aimed in a dangerous direction.^[2] It looks an awful lot like we're stuck with usually-reactive muddling, and maybe some pretty scary sounding defensive superintelligence propositions. (75%; I'm quite ignorant of governance and how international coordination could actually work here, but it sure seems hard. If this ends up being easy, it would also drop my p(doom) a lot.)

^{^}
Note that this is not a claim that something like RLHF is somehow impossible. RLHF, and other RL-adjacent techniques that have reward-equivalents that would never realistically train from scratch, get to select from the capabilities already induced by pretraining. Note that many 'strong' RL-adjacent techniques involve some form of big world model, operate in some constrained environment, or otherwise have some structure to work with that makes it possible for the optimizer to take useful incremental steps.
^{^}
One simple story of many, many possible stories:
1. It's 20XY. Country has no nukes but wants second strike capacity.
2. Nukes are kinda hard to get. Open-weights superintelligences can be downloaded.
3. Country fine-tunes a superintelligence to be an existential threat to everyone else that is activated upon Country being destroyed.
4. Coordination failures occur; Country gets nuked or invaded in a manner sufficient to trigger second strike.
5. There's a malign superintelligence actively trying to kill everyone, and no technical alignment failures occurred. Everything AI-related worked exactly as its human designers intended.

porby's Shortform

porby6mo30

Yup, exactly the same experience here.

porby's Shortform

porby6mo40

Has there been any work on the scaling laws of out-of-distribution capability/behavior decay?

A simple example:

Simultaneously train task A and task B for N steps.
Stop training task B, but continue to evaluate the performance of both A and B.
Observe how rapidly task B performance degrades.

Repeat across scale and regularization strategies.

Would be nice to also investigate different task types. For example, tasks with varying degrees of implied overlap in underlying mechanisms (like #2).

I've previously done some of these experiments privately, but not with nearly the compute necessary for an interesting result.

The sleeper agents paper reminded me of it. I would love to see what happens on a closer-to-frontier model that's intentionally backdoored, and then subjected to continued pretraining. Can a backdoor persist for another trillion tokens of nonadversarial-but-extremely-broad training? Does that vary across scale etc?

I'd also like to intentionally find the circumstances that maximize the persistence of out of distribution capabilities not implied by the current training distribution.

Seems like identifying a robust trend here would have pretty important Implications, whichever direction it points.

porby's Shortform

porby6mo40

A further extension and elaboration on one of the experiments in the linkpost:
Pitting execution fine-tuning against input fine-tuning also provides a path to measuring the strength of soft prompts in eliciting target behaviors. If execution fine-tuning "wins" and manages to produce a behavior in some part of input space that soft prompts cannot elicit, it would be a major blow to the idea that soft prompts are useful for dangerous evaluations.

On the flip side, if ensembles of large soft prompts with some hyperparameter tuning always win (e.g. execution fine tuning cannot introduce any behaviors accessible by any region of input space without soft prompts also eliciting it), then they're a more trustworthy evaluation in practice.

LESSWRONG
LW

Posts

Wiki Contributions

Comments

The source of spookiness

Is RL therefore spooky?

A note on offline versus online RL

Usage in practice