Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Thanks to Evan Hubinger for discussions about quantilizers, and to James Lucassen for discussions about conditioned generative models. Many of these ideas are discussed in Jessica Taylor's Quantilizers: A Safer Alternative to Maximizers for Limited Optimization: this post just expands on a particular thread of ideas in that paper. Throughout I’ll refer to sections of the paper. I have some remaining confusion about the “targeted impact” section, and would appreciate clarifications/corrections!

Abstract

This post explores the relationship between quantilizers and generative models. My main takeaways are:

  1. A natural way to build a quantilizer is by sampling from an appropriately-conditioned generative model.
  2. Unfortunately quantilizing doesn’t seem to confer much advantage over the underlying generative model: to the extent that a quantilizer is more powerful than a generative model, it’s more dangerous, and vice versa.
  3. Quantilizing is pretty computationally expensive relative to the advantage it brings, making it unclear if this is a competitive approach even if it conferred a net safety advantage at fixed power.

Definitions

I’ll follow the setup in “Quantilizers: A Safer Alternative to Maximizers for Limited Optimization” (Section 1, Definition 1) and say that a quantilizer is a model with:

  1. A base distribution , which describes a set of outputs we think are “normal” or “unlikely to lead to doom”.
  2. A quantile .
  3. An ordering over outputs (e.g. a utility or loss function).

A quantilizer returns a random output from the top  fraction of outputs from  according to the ordering. So for instance a 0-quantilizer is a utility maximizer, a 0.1-quantilizer returns outputs in the best 10%, etc.

Where does the base distribution come from?

One challenge with quantilizers is that they require a base distribution  that we trust. That is, we want to be able to sample from  many times and never hit bad outputs. Unfortunately, we don’t know how to explicitly write this distribution down.

A natural alternative to explicit construction is to train a generative model on the distribution of actions a humans takes (Section 3.1). Then for any given task  can be written as

This seems quite doable. Train GPT-N on text written by humans, then prompt it with the task:

Prompt: You are a good person who wants to accomplish <task>. What follows is your description of how you intend to do this.

The output will probably be text describing how a good person would accomplish that task!

Unfortunately we might get unlucky and sample a person who isn’t so good (but thinks of themselves as good). That still satisfies the prompt. Robustly getting the agent we want seems hard, but might be possible in the limit where we ask for agents that are generally plausible in the training distribution of the generative model.

Power vs Safety

Armed with our base distribution, we can start quantilizing. Naively we can do this by sampling  times from , then returning the top output. This produces a tradeoff between capabilities and safety.

Single Sampler

In the limit  we’re just returning actions a good human would take. That is, our quantilizer just samples once and returns that sample as output, so it’s no more powerful than a random good person.

Such a model is really unlikely to take over the world, but it’s also really unlikely to solve quantum chemistry problems. This is because both of these outcomes unlikely according to the base distribution.

We can improve things a bit by conditioning on being an unusually capable person, e.g.

Prompt: You are a good and extremely capable person who wants to accomplish <task>. What follows is your description of how you intend to do this.

Now we can get top-end human performance out, though at the cost that we have increased the odds of bad outcomes.

For instance, the model might decide to role-play a biologist who invents some extremely positive technology that happens to have unintended negative consequence (e.g. making engineered pandemics more likely). That’s not a unique problem to AI though, just a “giving humans more capabilities” problem.

Many Samples

In the limit  we’re sampling many times and returning the best action we find. Quantilizing like this makes actions more powerful, but at the cost of making them less likely according to the base distribution. This potentially makes them more dangerous.

For instance, instead of role-playing an extremely capable biologist who is likely to be alive today, the model can now role-play even more capable biologists who are unlikely to exist. Or it can role-play that same biologist on their best day, having the best insight of their life. Or, even worse, it can role-play a deceptive AI that’s pretending to be an extremely capable (and good) human biologist.

We now have at least two dangers:

  1. As in the single-sampler case (), there is danger in simulating more powerful human agents. This problem becomes worse as  decreases, because we can sample increasingly powerful agents who are increasingly far from the distribution we trust (”actions good humans take”).
  2. In addition, as we sample increasingly unlikely agents from the model we are more and more likely to encounter something non-human (e.g. a deceptive AI). This seems very bad, and I’m not sure how to avoid it.

Targeted Impact

The general tradeoff above is that as we make the quantilizer more powerful (e.g. reduce , or condition on more capabilities) we make it more dangerous by pulling it away from the action-space we think of as generally safe.

Section 2.2 refers to a technique for quantilizing that avoids this tradeoff called Targeted Impact, but I have not been able to follow the argument. It seems to produce the same quantilizing algorithm (because the condition at the end of the section is exactly the same as the one in Section 2.1), and so I’m confused about where the improvement comes from.

I’m probably missing something, and would really appreciate hearing/seeing a more detailed explanation of this approach!

Summary

My understanding right now is:

  1. A natural way to build a quantilizer is by sampling from an appropriately-conditioned generative model.
  2. This approach probably succeeds at being ‘close to as safe as a human’, at the cost of not being much more powerful.
  3. The more we try to make the model powerful, the weaker our safety guarantees get. This seems bad.
  4. Loss of safety comes both from sampling increasingly unlikely actions and from sampling from increasingly powerful agents. The former means we lose formal guarantees on how bad actions can be relative to normal human actions, the latter means both that we sample from a riskier class of actions ("actions taken by very capable agents") and that we are more likely to accidentally sample from actions taken by misaligned/deceptive agents.
  5. The number of bits of optimization power we get by quantilizing rather than directly sampling from the generative model is small () and comes at large cost (), which makes this likely uncompetitive on capabilities relative to just using bigger models trained for longer.

Based on this, my general sense is that quantilizers don't make generative models much more useful for alignment. That said, I suspect I'm missing something important about the "Targeted Impact" approach, and am keen to understand that better.

New to LessWrong?

New Comment
5 comments, sorted by Click to highlight new comments since: Today at 2:29 PM

Based on this, my general sense is that quantilizers don't make generative models much more useful for alignment

Right, the point of quantilizers is not to make generative models safer. It's to be safer than non-generative models (in cases where the training distribution is in fact safe and you don't need to filter very hard to succeed at the task).

I expect the purely statistical safety/filtering tradeoff to actually be pretty unimportant. More important are the vulnerabilities that come from the training distribution actually not being safe in the first place. The performance cost also does seem pretty important, but could potentially be sidestepped (maybe train a student model on filtered data) if safety was actually solved.

It's noteworthy that the safety guarantee relies on the "hidden cost" (:= proxy_utility - actual_utility)  of each action being bounded above. If it's unbounded, then the theoretical guarantee disappears. 

Good point! And indeed I am skeptical that there are useful bounds on the cost...

My recent post on generative models has some related discussion; see especially remark 1 on the satisficer, quantilizer, and optimizer approaches to making agents with generative models.

Two interesting differences between the approaches discussed here and in my linked post:

  • In my post, I assumed that the generative model was trained on a data set which included rewards (for example, humans playing Breakout, where the reward is provided by the environment; or a setting in which rewards can be provided by a reward model trained with human feedback). In contrast, you seem to be primarily considering settings, like language models trained on the internet, in which rewards are not provided (or at least, only provided implicitly in the sense that "here's some Nobel prize winning research: [description of research]" implicitly tells the model that the given research is the type of research that good researchers produce, and thus kinda acts as a reward label). My assumption trivializes the problem of making a quantilizer (since we can just condition on reward in the top 5% of previously observed rewards). But your assumption might be more realistic, in that the generative models we try to get superhuman performance out of won't be trained on data that includes rewards, unless we intentionally produce such data sets.
  • My post focuses on a particular technique for improving capabilities from baseline called online generative modeling; in this scheme, after pretraining, the generative model starts an online training phase in which episodes that it outputs are fed back into the generative model as new inputs. Over time, this will cause the distribution of previously-observed rewards to shift upwards, and with it the target quantile. Note that if the ideas you lay out here for turning a generative model into a quantilizer work, then you can stack online generative modeling on top. Why would you do this? It seems like you're worried that your techniques can safely produce the research of a pretty good biologist but not of the world's best biologist on their best day. One way around this is to just ask your generative model to produce the research of a pretty good biologist, but use the online generative modeling trick to let its expectation of what pretty good biology research looks like drift up over time. Would this be safer? I don't know, but it's at least another option.

This is helpful, thanks for summarizing the differences! I definitely agree on the first one. 

On the second one, my concern is basically that all the safety guarantees that quantilizers provide have an inherent power/safety tradeoff (modulo whatever I'm missing from the "Targeted Impact" section).

That said, it's possible that your nested approach may avoid the 'simulate a deceptive AGI' failure mode. At least, if it's a continuous trajectory of improvement from median human performance up to very superhuman performance you might hope that that trajectory doesn't involve suddenly switching from human-like to AGI-like models. I don't personally find this very comforting (it seems totally plausible to me that there's a continuous path from "median human" to "very dangerous misaligned model" in model-space), but it does at least seem better than directly asking for a superhuman model.