Quantilizers and Generative Models

Based on this, my general sense is that quantilizers don't make generative models much more useful for alignment

Right, the point of quantilizers is not to make generative models safer. It's to be safer than non-generative models (in cases where the training distribution is in fact safe and you don't need to filter very hard to succeed at the task).

I expect the purely statistical safety/filtering tradeoff to actually be pretty unimportant. More important are the vulnerabilities that come from the training distribution actually not being safe in the first place. The performance cost also does seem pretty important, but could potentially be sidestepped (maybe train a student model on filtered data) if safety was actually solved.

[-]RyanCarey3yΩ120

It's noteworthy that the safety guarantee relies on the "hidden cost" (:= proxy_utility - actual_utility) of each action being bounded above. If it's unbounded, then the theoretical guarantee disappears.

[-]Adam Jermyn3yΩ110

Good point! And indeed I am skeptical that there are useful bounds on the cost...

[-]Sam Marks3yΩ110

My recent post on generative models has some related discussion; see especially remark 1 on the satisficer, quantilizer, and optimizer approaches to making agents with generative models.

Two interesting differences between the approaches discussed here and in my linked post:

In my post, I assumed that the generative model was trained on a data set which included rewards (for example, humans playing Breakout, where the reward is provided by the environment; or a setting in which rewards can be provided by a reward model trained with human feedback). In contrast, you seem to be primarily considering settings, like language models trained on the internet, in which rewards are not provided (or at least, only provided implicitly in the sense that "here's some Nobel prize winning research: [description of research]" implicitly tells the model that the given research is the type of research that good researchers produce, and thus kinda acts as a reward label). My assumption trivializes the problem of making a quantilizer (since we can just condition on reward in the top 5% of previously observed rewards). But your assumption might be more realistic, in that the generative models we try to get superhuman performance out of won't be trained on data that includes rewards, unless we intentionally produce such data sets.
My post focuses on a particular technique for improving capabilities from baseline called online generative modeling; in this scheme, after pretraining, the generative model starts an online training phase in which episodes that it outputs are fed back into the generative model as new inputs. Over time, this will cause the distribution of previously-observed rewards to shift upwards, and with it the target quantile. Note that if the ideas you lay out here for turning a generative model into a quantilizer work, then you can stack online generative modeling on top. Why would you do this? It seems like you're worried that your techniques can safely produce the research of a pretty good biologist but not of the world's best biologist on their best day. One way around this is to just ask your generative model to produce the research of a pretty good biologist, but use the online generative modeling trick to let its expectation of what pretty good biology research looks like drift up over time. Would this be safer? I don't know, but it's at least another option.

[-]Adam Jermyn3yΩ220

This is helpful, thanks for summarizing the differences! I definitely agree on the first one.

On the second one, my concern is basically that all the safety guarantees that quantilizers provide have an inherent power/safety tradeoff (modulo whatever I'm missing from the "Targeted Impact" section).

That said, it's possible that your nested approach may avoid the 'simulate a deceptive AGI' failure mode. At least, if it's a continuous trajectory of improvement from median human performance up to very superhuman performance you might hope that that trajectory doesn't involve suddenly switching from human-like to AGI-like models. I don't personally find this very comforting (it seems totally plausible to me that there's a continuous path from "median human" to "very dangerous misaligned model" in model-space), but it does at least seem better than directly asking for a superhuman model.

LESSWRONG
LW

LESSWRONG
LW

24

Quantilizers and Generative Models

24

Ω 14

24

Ω 14

Abstract

Definitions

Where does the base distribution come from?

Power vs Safety

Single Sampler

Many Samples

Targeted Impact

Summary