Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This is a link-post for a paper I recently read: Pretraining Language Models with Human Preferences, followed by my reactions to this paper.

Reading this paper has significantly reduced my near-term , and I'd like to explain why. Thus, this is also an alignment proposal. While I don't think what I'm proposing here is a complete solution to aligning a superintelligent ASI, I think it might work well up to at least around a human-level AGI, and even be a useful basis to build on at ASI level (at that level, I'd advocate adding on value learning). It can achieve some of the simpler things that people have been hoping we might get from Interpretability (and for more complex things might also combine well with and even simplify Interpretability, if that can be made to work at scale.) It's also simple, immediately actionable, has a fairly low alignment tax, and best of all, also has lots of useful capabilities effects, so that even a superscalar not very concerned about x-risk might well still want to implement it. [I'm fully aware that many people on the Alignment Forum/Less Wrong disapprove of pointing out anything that helps capabilities, and generally I would agree — however, in this case fortunately the alignment and capabilities advantages are heavily and obviously entangled, and I see this as the sugar that may help the medicine go down, so I am going to briefly mention it.]

The Paper

Let's start with the paper. The authors experiment with a number of different ways you might train an LLM not to do some form of undesired behavior. For the paper, they chose three simple, well-defined bad behaviors for which they had low-computational cost, high-accuracy classifiers, and which were behaviors simple enough that a fairly small, economical-to-pretrain LLM could reasonably be expected to understand them. They demonstrate that, unlike the common approach of first training a foundation model on the task "learn to autocomplete a large chunk of the web, which includes both good and bad behavior", followed by fine-tuning/RLHF on "now learn to recognize and only do good behavior, not bad", it is a lot more effective to build this control training in from the start during the pretraining (they estimate by around an order of magnitude). So they evaluate five different methods to do that (plus standard pretraining as a control).

The simplest behavior training approach they try is just prefiltering your training set so that it doesn't have any examples of bad behavior in it. Then, for your resulting foundation model, bad behavior is out-of-distribution (so may, or may not, be difficult for it to successfully extrapolate to). Interestingly, while that approach is was fairly effective, it wasn't the best (it consistently tended to harm capabilities, and didn't even always give the best behavior, as one might expect from analogies to a similar approach to trying to raise children: extrapolating out-of-the-training-distribution isn't reliably hard). The clear winner instead was a slightly more complex approach: prelabel your entire training set, scanned at a sentence/line-of-code level, as good or bad using something like <good>…</good> and <bad>…</bad> tags. Then at inference time, start the response generation after a <good> tag, and during inference tweak the token generation process to ban the model from generating an </good> tag (unless it's the matching end-tag at the end of the document after an end_of_text token) or a <bad> tag (i.e. these are banned tokens, whose probability is reset to zero).

So, teach your LLM the difference between good and bad all the way through its pretraining, and then at inference time only allow it to be good. This is a ridiculously simple idea, and interestingly it works really well. [This technique is called "conditional training" and was first suggested about 5-6 years ago — it seems a little sad that it's taken this long for someone to demonstrate how effective it is. Presumably the technical challenge is the classifiers.]

Applications to Alignment

So (assuming this carries over to larger LLMs and more abstract behaviors, as seems likely) that reduces the problem of (for example) "make an AI very unlikely to be deceitful (in ways that it learned from or recognizes as similar to human deceit)" to just "create an efficient high-accuracy classifier that can be used to scan and label the internet/pretraining dataset for sentences where the writer/speaker is either being deceitful, or advocating this (but not if they're warning about it)". Note that this is a lot easier than being sure that you can catch a superintelligent AI being sneakily deceitful: you only have to fairly-accurately catch humans doing it — even better, only past humans unable to (and generally uninterested in) adversarialy adapting to your classifier. (Though clearly you'd want to look quite carefully at a sample to determine what your deceitfulness-classifier's false negative rate looked like and what sorts of deceitful human behavior it was often missing.) This sounds like the sort of thing that one could prompt and/or fine-tune GPT-4 to be rather good at, and then distill that behavior down into to something cheaper to run. At worst, the cost of "scan your entire training set once with GPT-4" is already lower than the cost of "train GPT-4.0.1 on your entire training set", but to reduce the alignment tax it would be nice to make it significantly lower. Obviously you could also use a combination of an efficient high-recall preclassifier and a more expensive high-precision classifier. [They also mention in the paper that even labeling only a fraction of the dataset was still quite effective.]

This gives you a foundation model that knows in great detail what deceit is (to at least the accuracy of your classifier, plus any generalization that the LLM may have figured out), will automatically label it with <deceit>…</deceit> tags, and can be directly set never to be (knowingly) deceitful. Even more usefully, by monitoring what the predicted probability for the <deceit> tag was before you set it to zero, you can also monitor how tempted the model was to switch to being deceitful, on a token-by-token basis (including the effects of any jailbreaks that the model has being sent), which seems like it could be a really useful warning light.

Why LLMs are Challenging to Control

LLMs are pretrained as next-token predictors on large samples of the internet and similar pretraining datasets. Almost all tokens in these sources were generated by agents called humans (either solo, or a group of them working together to write and edit some text). So while LLMs are not themselves agentic, they learn to simulate human-like agents' token generation behaviors. In particular, they learn to simulate a wide distribution of human-like agentic mesaoptimizer behaviors of a wide distribution of authors and fictional characters, and at inference time they (gradually, as the character speaks and acts) pick something out of that distribution to simulate this time, depending on contextual cues like prompt text content and style, plus randomness. Predicting what agent they're going to pick and how that agent is then going to act is hard (especially if someone else is adversarially injecting jailbreak text into the LLM's context). So the inner alignment problem for an LLM isn't aligning just one mesaoptimizer, it's aligning a whole context-dependent distribution of them: controlling an LLM's behavior requires reliably controlling the behavior of the distributions of agents that it chooses to simulate across a wide range of circumstances — controlling/choosing between all the eyes on the shoggoth, not just putting a mask on two of them near the front. 

Now, however, we just train it to understand what's allowed and what isn't, and then intervene directly at the token generation level, so that it can only simulate agents (mesaoptimizers) who will do allowed things. The LLM isn't agentic: it isn't fighting you or interested in deceiving you or power-seeking or anything like that; it's goal-agnostic in the sense of FAQ: What the heck is goal agnosticism?. It only 'wants' to accurately predict the next token, and if you intervene directly in that process so that it can't generate </good> or <bad> tokens, then no agent who (in the LLM's trained opinion) will try to do bad things will ever get simulated. So you don't need to worry about some badly-behaved simulated agent figuring out how to trick you: it never gets summoned in the first place (at inference time — badly-behaved agents were simulated at pretraining time). In the shoggoth metaphor, now all its eyes have been helpfully color-coded green or red, and we can tell it to keep all the red ones closed.

You can of course just prompt an LLM to act helpful and aligned (if the behavior you want can be accurately described in no more than thousands of tokens): but, by the Walugi effect there's a chance it will then morph the evil twin of what the prompt said it was, because that's a common plot device, and also because pretending to be something else is what liars do. (Except that, in well-classified pretraining text, they always do it inside <deceit> tags.) Also, anyone who has access to part of the prompt can jailbreak this by pushing in a different direction, as can non-adversarial events that just happen to occur in the prompt, and random chance. My intuition is that behavior that was consistently pretrained in should be stronger, its effect should last as long as you're banning an </good> tag, and that checking the probability for the </good> tag before it was set to zero lets you monitor if you're getting intentionally or accidentally jailbroken.

Adding Bells and Whistles

OK, let's extend this technique beyond what the authors did in the paper. Suppose that rather than just a simple binary classification of good and bad behaviors, we also had some behaviors that were undesirable (or at least concerning enough to be worth flagging) in some contexts, but OK in others. Classify these behaviors separately, and give them each different tags. For example, there are clearly corporate contexts where we don't want our LLM to generate anything that deserves to be inside <nsfw>…</nsfw> tags, but other circumstances where that may be acceptable (or perhaps even where that's what the user is currently looking for). So, classify and tag this behavior in your training dataset (rather than filtering it out), train your foundation model to understand the difference, and then at generation time, you can chose whether your model starts generating after a <sfw> tag and is banned from generating an <nsfw>, or otherwise. (Or you could even tell your model at inference time that it's currently only allowed to be NSFW, if that's your thing.) So you get a model that (at the cost of a few control tokens) can be controlled at inference time to behave in different ways, without needing any fine-tuning, and without any of the vagaries of prompting (though you should certainly also prompt it to do what you currently want). Very useful. You could also boost or penalize the logits/probabilities of control tags, rather than just enforcing or banning them, to give finer-grained control — for example, you could make the model simulate an agent who is unlikely to go there spontaneously but can still get <nsfw> in situations where that's very clearly invited. You can also change these controls dynamically on the fly during inference, using any logic you want. [Also, after adding a few more tags (such as political viewpoint, snarkiness, and such-like), and being appropriately more/less controlling in different contexts, such as fiction vs. a personal assistant, this would remove much of the current appeal of unaligned/differently-aligned open-source models, making AI governance a lot easier.]

Another set of tags that might be very useful would be emotions: tag any text emitted by, or describing the behavior of, someone under a significant emotion with the name of the emotion (under some useful ontology of human emotions: most estimates I have seen give somewhere in the range of 6 to 90 of them, depending on your level of hair-splitting). Then if you're generating fiction, allow these tags to be generated freely (or in some genres, fairly freely), whereas if you're implementing a customer service bot then, no matter what the customer says, the LLM is not allowed to generate an <angry> or <rude> tag (and we might want flag or terminate the conversation if it's even trying to do so with any significant probability).

Apart from <deceit>, behaviors with tags like <powerseeking>, <criminality>, and <psychopathy> also seem like they should be really useful things to be able to detect and block. Alignment now becomes a matter of building good classifiers for unaligned human behavior on the Internet. [Thus 4chan become a useful part of the pretraining dataset.] Short of an AI sufficiently approximately-Bayesian to be capable of value learning, aligned behavior from an agent is basically rational behavior when motivated only by combination of the emotions <love type="platonic" target="all of humanity">, <benevolence> and complete <selflessness>. Those are three more fairly-abstract classifiers, but it's pretty obvious where to start on them. Consistently staying in that specific combination of motivations is entirely out-of-distribution behavior for humans, as you'd expect from evolutionary theory. However, we're social animals, and almost all humans act pretty aligned with each other a lot of the time. For example, when I'm at work, my employers pay me to act aligned with the well-being of the company and its stock owners, and I do. Ignoring that little motivational detail of a paycheck, my behavior at work looks really aligned to the company. So, labeled at a per-sentence level, aligned behavior is really common from humans, even though basically no humans are actually well-aligned. What is out-of-distribution for a human agent is still acting that way for the benefit of total strangers, and when the human's own life is on the line. But that doesn't seem like it would be very difficult behavior for an LLM to extrapolate to, given a large training set suitably labeled with aspects of aligned behavior showing when humans are acting aligned, and when they stop. Basically, just don't stop, no matter what.

Generally-document-level contextual tags like <fiction>, <research>, <opinion> and <high_stakes> might also be useful. Behaviors that in a <high_stakes> context are concerning hallucinations are called 'creativity' in a <fiction> context. One might expect that an LLM could learn these distinctions itself, and then hope it would act appropriately based on prompting, but using tags to make things clearer and current expectations more definitive might also prove helpful, especially for resisting jailbreaking or random contents in the prompt.

Getting this alignment technique at least well past the "Don't Kill Everyone" minimally-aligned requirement seems quite promising to me, with a sufficiently capable LLM to identify and understand that sort of aligned behavior. It can of course be mixed-and matched with other alignment techniques to your taste in a swiss-cheese security approach. One very obvious extension would be to also run your classifiers on the model's output (but be cautious about RL-fine-tuning your model using this signal, to avoid an adversarial training regime that could encourage the model to learn how to fool them). Or possibly you could make use of some sort of adversarial GAN-like approach (with a generator model that you then don't deploy) to improve the classifier's robustness. Another approach would be to have the model retag a sample of its pretraining data (possibly after prompting or fine-tuning it, say in order to subtly adjust the definition of a tag), and then compare its tagging and tag-logits to what your classifiers did. It's clearly also very useful for any AGI/alignment system built on top of LLMs, such as scaffolded agents or graphs-of-thoughts. If token-tree-search techniques along the lines of Q* Search turn out to be important, it should combine well with them: the behavior tag tokens label tree branches containing the behavior, letting you prune them fast. Overall, I'm hopeful that this approach might extend basically up to the skilled-AGI level where we can automate alignment research and phase over into value learning. So that's why my  went down on reading this paper.

Returning to speculating about how this approach might combine with Interpretability, this gives us an extensible top-down, coarse-scale, behavioral means of detecting, monitoring and controlling what's going on in an LLM. Interpretability, if we can make it work, should give us something similar that is bottom-up, fine-scale, and mechanistic. The two ought to complement each other: interpretability features whose activation is strongly correlated with concerning behavior tags such as <deceit>, <powerseeking>, <criminality>, <anger> and so forth are obviously top priorities to investigate, and the effect of patching/scrubbing them on the logits of these tags should be very informative. Ideally, we can get the two approaches to meet in the middle.


  • This is an approach to dealing with unaligned behavior of human-level and human-like agents simulated by an LLM that the LLM learnt from examples of unaligned human behavior, or at least can recognize as analogous. It probably doesn't extend fully to ASI-level agents capable of inventing ingenious new categories of unaligned behavior.
  • It requires us to enumerate categories of bad/concerning/good behavior, and do per-category work on them. So it assumes that unaligned/aligned human-like behavior can usefully be divided into a manageable number of categories.
  • It's based on classifiers: the classifier used to label the pretraining set, training the LLM to act as a classifier of its own emitted token-stream. So it shares all of the various, well understood challenges of machine-learning classifiers. For example, we know that in very-high-dimensional spaces like an LLM's residual embeddings, constructing a classifier highly robust against carefully adversarially-chosen examples is hard,[1] so making this approach proof against deliberate skilled jailbreaking is likely to be hard.
  • As written, it's measuring and blocking behaviors like <angry> as binary classifiers, trying to put sharp edges on fuzzy phenomena — though obviously one could easily extend this to quantized intensity bands like <angry intensity="somewhat"> if that was useful.
  • Since it involves work during pretraining, the cycle time for changing the system or experimenting with modifications is long, and involves a lot of computational cost. This could probably be ameliorated by performing experiments first using a fine-tuning approach, and then batching their full implementation into the re-pre-training cycles needed to keep an LLM's knowledge cutoff up-to-date.


Finally, for any forum-readers who don't care about alignment, only about capabilities, at the beginning I promised you some sugar for your cooperation. The level of detailed, reliable, and flexible LLM behavior control that the above should give us is a fine and obviously-marketable start. Next, consider also labeling documents in your training set with classifier and/or metadata-derived tags describing the estimated IQ level and education level of their writer (or for fiction, the lower of the IQ of the writer and the character currently being described). Also, if a document or part of a document is something like a wikipedia article or a scientific paper or news article that has been through extensive editing and rewrites, tag that fact. Now consider the output from a really large LLM (maybe GPT-6 or 7) pretrained this way, when you start generation after the tags <iq band="145_160"><education level="postdoctoral"><edited>. Perhaps even, if the model can extrapolate numbers well, <iq band="175_180">.

  1. ^

    How to do this is pretty well understood for image classifiers, see for example Adversarial Robustness - Theory and Practice. Doing the same for rather abstract semantic text classifiers is likely significantly harder (while the fundamental minimax structure of the adversarial problem remains, the space is completely non-continuous and the space of text perturbations retaining a similar semantic meaning has a vastly more complex structure: even just establishing a good semantic metric on it is challenging, though LLM residial embedding space seems like an obvious place to start).

New to LessWrong?

New Comment
30 comments, sorted by Click to highlight new comments since: Today at 4:34 AM

Signal boosted! This is one of those papers that seems less known that it should be. It's part of the reason why I'm optimistic about dramatic increases in the quality of "prosaic" alignment (in the sense of avoiding jailbreaks and generally behaving as expected) compared to RLHF, and I think it's part of a path that's robust enough to scale.

You can compress huge prompts into metatokens, too (just run inference with the prompt to generate the training data). And nest and remix metatokens together.

It's also interesting in that it can preserve the constraints on learnable values during predictive training, unlike approaches equivalent to RL with sparse/distant rewards.

The fact that the distinctions it learns about the metatokens become better and better as more optimization pressure is applied is an interesting inversion of the usual doom-by-optimization story. Taking such a model to the extreme of optimization just makes it exceedingly good at distinguishing subtle details of what constitutes <nice> versus <authoritative_tone> versus <correct>. It's an axis of progress in alignment that generalizes as the capability does; the capability is the alignment. I'm pretty certain that a model that has very thoroughly learned what "nice" means at the human level can meaningfully generalize it to contexts where it hasn't seen it directly applied.[1]

I'm also reasonably confident in finding some other paths to extremely similar effects on internal representations. I wouldn't be surprised if we can decompose conditions into representational features to learn about what they mean at the learned feature level, then cobble together new inference-time conditions via representational intervention that would have equivalent effects to training new metatokens. 

  1. ^

    After all, ChatGPT4/DALLE3 can generate an image of a vacuum cleaner that "embodies the aspirational human trait of being kind to one another." That seems like more of a reach than a hypothetical superintelligence figuring out that humans wouldn't be okay with, say, a superscience plan that would blow up 25% of the earth's crust.

    Generated by DALL·E 

I much enjoyed your post Using predictors in corrigible systems — now I need to read the rest of your posts! (I also love the kindness vacuum cleaner.) What I'm calling a simulator (following Janus's terminology) you call a predictor, but it's the same insight: LLMs aren't potentially-dangerous agents, they're non-agentic systems capable of predicting the sequence of tokens from (many different) potentially-dangerous agents. I also like your metatoken concept: that's functionally what I'm suggesting for the tags in my proposal, except I follow the suggestion of this paper to embed them via pretraining. Which is slow and computationally expensive, so probably an ideal that one works one's way up for the essentials, rather then an rapid-iteration technique.

What I'm calling a simulator (following Janus's terminology) you call a predictor

Yup; I use the terms almost interchangeably. I tend to use "simulator" when referring to predictors used for a simulator-y use case, and "predictor" when I'm referring to how they're trained and things directly related to that.

I also like your metatoken concept: that's functionally what I'm suggesting for the tags in my proposal, except I follow the suggestion of this paper to embed them via pretraining.

Yup again—to be clear, all the metatoken stuff I was talking about would also fit in pretraining. Pretty much exactly the same thing. There are versions of it that might get some efficiency boosts by not requiring them to be present for the full duration of pretraining, but still similar in concept. (If we can show an equivalence between trained conditioning and representational interventions, and build representational interventions out of conditions, that could be many orders of magnitude faster.) 

You can compress huge prompts into metatokens, too (just run inference with the prompt to generate the training data)

I'm very curious about this technique but couldn't find anything about it. Do you have any references I can read?

Alas, nope! To my knowledge it hasn't actually been tried at any notable scale; it's just one of those super simple things that would definitely work if you were willing to spend the compute to distill the behavior.

FWIW, I'm a Staff ML SWE, interested in switching to research engineering, and I'd love to make these things happen — either at a superscaler with ample of resources for it, or failing that, at something like Eleuther or an alignment research lab.

I think that'd be great!

Some of this stuff technically accelerates capabilities (or more specifically, the elicitation of existing capabilities), but I think it also belongs to a more fundamentally reliable path on the tech tree. The sooner the industry embraces it, the less time they spend in other parts of the tech tree that are more prone to misoptimization failures, and the less likely it is that someone figures out how to make those misoptimization failures way more efficient.

I suspect there's a crux about the path of capabilities development in there for a lot of people; I should probably get around to writing a post about the details at some point. 

I've seen a number of cases where something that helps alignment also helps capabilities, or vice versa, and also cases where people are worrying a lot about something as an alignment problem that looks to me like primarily a capabilities problem (so given how few alignment engineers we have, maybe we should leave solving it to all the capabilities engineers). Generally I think we're just not very good at predicting the difference, and tend to want to see this as an either-or taboo rather than a spectrum buried inside a hard-to-anticipate tech tree. In general, capabilities folks also want to control their AI (so it won't waste tokens, do weird stuff, or get them sued or indicted). The big cross-purposes concerns tend to come mostly from deceit, sharp left turn, and Foom scenarios, where capabilities seem just fine until we drive off the cliff. What I think we need (and even seems to be happening in many orgs, with a few unfortunate exceptions) is for all the capabilities engineers to be aware that alignment is also a challenge and needs to be thought about.

The <good> <bad> thing is really cool, although it leaves open the possibility of a bug (or leaked weights) causing the creation of a maximally misaligned AGI.

As long as it's carefully boxed, there are situations in which being able to reliably replicate specific misaligned behavior can be useful, such as when testing other alignment measures or doing interpretability. But yes, open-sourcing weights of a model that had, say, a <criminality> tag trained into it, so allowing its use by anyone, including criminals who'd turn that on, would seem unwise. Possibly one could do some sort of causal scrubbing or distillation process down to a model with behavior equivalent to the original with <criminality> permanently turned off (which would be bad at writing crime fiction) that might then be safe to open-source. AI governance mechanisms will still be necessary.

I think that it's risky to have a simple waluigi switch that can be turned on at inferencing time. Not sure how risky.

I think there are behaviors you (almost) never want to turn on, or off, and other's that need to be controlled in some contexts but not others. From a safety point of view, models that make all of these easily switchable at inference time have a bigger challenge than if you use something like scrubbing/distillation to lock the former off/on while leaving the latter switchable.

A challenge is setting the dividing line between these. Part of the difficulty of AI governance is that there is demand for models that can, for example, write fiction about criminals, supervillains, or evil geniuses, or do useful work in criminology, and so forth. Anything sufficiently smart that can do that can simulate an evil mastermind. How do you then make sure that no one ever switches it into evil mastermind mode while making plans to affect the real world, or in a situation where it could hack its own data-center and self-replicate in that mode? Advanced AI is a dangerous, dual-use technology, but that's not the aspect of it that's unprecedented: it's that the technology can be self-willed and smarter than us.

One helpful aspect of the fiction problem is that villains in fiction always make some fatal mistake. So a system capable of simulating evil "geniuses" for fictional use should be bad at long-term planning, not just for safety reasons.

As I mentioned at the start, this is mostly a proposal for aligning AI that's around the human level, not much smarter than us, so something capable of simulating a regular criminal rather than an evil genius.

This is great. Big upvote for actually looking for helpful and realistic routes to alignment. We need more of that if we're going to survive.

I think this labeling technique could be really useful for internal and external monitoring of a complex tree of thoughts in a language model cognitive architecture, and I'm adding it to the stack of alignment techniques that can be applied to such agents. I describe that stack in Internal independent review for language model agent alignment. I personally think this is the most likely route to first AGI, and the most likely route to successful alignment, since the alignment tax is very low and success seems at least plausible.

I find it highly plausible (but far from certain) that language model cognitive architectures (AKA language model agents) achieve real goal-directed agency, self-awareness, recursive self-improvement (but not foom).

This also seems like the shortest-timeline route to AGI and alignment, so considering this route seems a bit more urgent than longer-timeline routes.

Absolutely, while the suggestion in the post is just about controlling LLMs, this is obviously helpful to any A(G)I approach that uses LLMs, such as all various scafolding approaches. Being able to be a lot more sure of your LLM-simulated agent's emotions/motivations/behavior/education level/etc is basically always helpful.

Yes. Particularly if you're already performing an internal review for helpfulness and accuracy. I'll think more about how those labels could be applied, and include this, with credit to you, in my next post on aligning language model agents.

Much appreciated! Also, I'd love to hear if anyone starts working on stuff along these lines (especially if they were interested in working together).

The "good" AI (one that cannot output <bad> tokens) would also be unable to explain why someone or something is bad, or warn you about a danger. The advertiser's dream, perhaps, but not really friendly.

Related: use–mention distinction

Thar depends upon the details of the tagging strategy. For example should the sentence:

Alice said "Bob is lying!"

be tagged with <deceit> tags? Assuming that Alice did in fact say that, and was not being deceitful when she did so, then I would argue that the optimal answer is no. Using that tagging strategy, then an agent currently blocked from emitting the <deceit> tag could still emit that output. So it could also honestly warm us that (in its opinion) someone else was lying, but not dishonestly claim the same thing. So we need to tag active examples of the bad behavior, but not discussion of it. So yes, the classifiers need to understand the use-mention distinction. For other tags, such as <nsfw>, we might need a slightly different strategy: for example it's generally still <sfw> if you use clinical terminology and are as brief, abstract and nonspecific as you can while still making a necessary point.

So our classifiers need to understand some moderately subtle logical and social/cultural distinctions: the sorts of things the GPT-4 can already do, and later generations of LLM will be doubtless even better at.

P.S. I briefly clarified this in the post,

Signal boosted! This seems significantly more plausible as a path to robust alignment than trying to constrain a fundamentally unaligned model using something RLHF. 

I agree. The challenge of getting RL to do what you want it to rather then some other reward hack it came up with gets replaced with building good classifiers for human-created content: not a trivial problem, but a less challenging, less adversarial, and better understood one.

So (assuming this carries over to larger LLMs and more abstract behaviors, as seems likely) that reduces the problem of (for example) "make an AI very unlikely to be deceitful" to just "create an efficient high-accuracy classifier that can be used to scan and label the internet/pretraining dataset for sentences where the writer/speaker is either being deceitful, or advocating this".

How is this better than just classifying whether the text output by the model seems deceitful, and penalizing/training accordingly?


  1. The two approaches can (and probably should) be combined: having built these classifiers to label the pretraining data, also running them on the output seems an obvious step, and lets you monitor how well the behavior the model distilled from them matches what you built.
  2. According to the paper, interventions during fine-tuning are much more effective (strikingly so in most of their diagrams), in both (generally) lowest levels of undesirable behavior for least capability loss and (often) resistance to jailbreakling attempts. In their experiments this approach was Pareto-optimal. We'd need to confirm that this remains true with larger models and the more abstract sorts of classifiers and tags I'm proposing.
  3. When tagging the pre-training data, you're only trying to catch past deceit/whatever from humans on the Internet, who cannot and have no reason to adapt to improvements in your classifier, rather than trying to catch these from your simulated (possibly superhuman) agent, where you might be inadvertently training it to become more subtle, so you avoid a concerning adversarial problem.
  4. Unlike model training approaches, this gives immediate on-the-fly switchability at inference time via the token-banning mechanism: you can dynamically decide, half way through a generation "it's now OK to get <angry> if it seems appropriate", or even "switch to <angry> mode now" (admittedly, you could probably also do this by switching trained LORAs on or off, or by transferring the generation between finetuned models, at somewhat more computational recalculation expense)
  5. Monitoring the logits of currently-banned tags before these were reset to minus infinity gives you classifiers that are trained into the model, directly part of its internal thinking process, and that are the best predictor that a lot of SGD could fine. The larger and more capable your model is, the more computational capacity these classifiers can use, so this looks like it should scale well (as long as you also scale your pretagging classifiers).
  6. Having trained the LLM, you can rerun it over a sample of its pretraining data, see where it expects the tags to be, and compare that to where your classifier put them. The diffs lets you see how well the LLM distilled the behavior of the classifiers, and might well give you clues for how to make the classifiers' behavior more consistent, or even make them smarter if the LLM has learned patterns that extend out of its training distribution. You could also try this with prompts that attempt to tweak the definition of a tag that the LLM learnt, say a description of subtle cues that might indicate that someone is actually being <deceit>ful.

Basically, a lot of aspects of this approach give me reasons to be intuitively hopeful that it's going to work well. But (AFAIK) it's currently a little-explored approach, based on a paper from a few months ago (two of whose authors work for superscalers), so obviously we'd need to try it and see if it's actually as promising as it looks. This is an IMO very interesting research avenue, not a ready-to-go solution. And certain aspects of it (run a classifier over the entire pretraining set to tag it, then pretrain a model) are very expensive, so researching this will be expensive, something that only orgs with the resources to train new models from scratch can do. Likely one would start off instead applying this as a fine-tuning approach on a much smaller dataset, get the kinks out, then redo as a full pretraining and confirm what quality/effectiveness improvement that gets.

I've added brief mentions of some of these points to the original post. Thanks for the discussion.

To the LW team: the audio is messed up.

This approach is alignment by bootstrapping. To use it you need some agent able to tag all the text in the training set, with many different categories.

Pre GPT4, how could you do this?

You could also use combinations, develop a "clean" agent only able to emit the text you find desirable/smart and then re evaluate all the text on the internet. Double distillation essentially.

You could also have gpt4 research using it's web browsing capabilities/scientific journal access any text it analyzes and categorize by factual accuracy.

Note also that tags can be relative : you multiply your weight updates and loss penalty so the model has smaller weight changes/penalty for not regurgitating correctly "bad" text.

It's like all the other human technology, we couldn't get to clean forms of industry without using a crude and dirty and dangerous initial method.

Note also that tags can be relative : you multiply your weight updates and loss penalty so the model has smaller weight changes/penalty for not regurgitating correctly "bad" text.

If you read the paper, they tried several methods like that, none of which ended up working as well as the really simple conditional training approach where you just train it to label bad text as bad. It is of course possible that someone will come up with another approach along these lines that works better, but this seems to be hard.

This approach is alignment by bootstrapping. To use it you need some agent able to tag all the text in the training set, with many different categories.

Pre GPT4, how could you do this?

Well, humans created all of the training data on our own, so it should be possible to add the necessary structured data to that! There are large scale crowdsourced efforts like Wikipedia. Extending Wikipedia, and a section of the internet, with enhancements like associating structured data with unstructured data, plus a reputation-weighted voting system to judge contributions, seems achievable. You could even use models to prelabel the data but have that be human verified at a large scale (or in semi-automated or fully automated, but non-AI ways). This is what I'm trying to do with Web 10. Geo is the Web3 version of this, and the only other major similar initiative I'm aware of.

This is a fantastic article! It's great to see that there's work going on in this space, and I like that the approach is described in very easy to follow and practical terms.

I've been working on a very expansive approach/design for AI safety called safety-first cognitive architectures, which is vaguely like a language model agent designed from the ground up with safety in mind, except extensible to both present-day and future AI designs, and with a very sophisticated (yet achievable, and scalable from easy to hard) safety- and performance-minded architecture. I have intentionally not publicly published implementation details yet, but will send you a DM!

It seems like this concept is related to the "Federating Cognition" section of my article, specifically a point about the safety benefits of externalizing memory: "external memory systems can contain information on human preferences which AI systems can learn from and/or use as a reference or assessment mechanism for evaluating proposed goals and actions." At a high level, this can affect both AI models themselves as well as model evaluations and the cognitive architecture containing models (the latter is mentioned at the end of your post). For various reasons, I haven't written much about the implications of this work to AI models themselves.

I think some of the downsides mentioned here are easily or realistically surpassable. I'll post a couple thoughts.

For example, is it really true that this would require condensing everything into categories? What about numerical scales for instance? Interestingly, in February, I did a very-small-scale proof-of-concept regarding automated emotional labeling (along with other metadata), currently available at this link for a brief time.  As you can see, it uses numerical emotion labeling, although I think that's just the tip of the iceberg. What about many-dimensional labeling? I'd be curious to get your take on related work like Eric Drexler's article on QNRs (which is unfortunately similar to my writing in that it may be high-level and hard to interpret) which is one of the few works I can think of regarding interesting safety and performance applications of externalized memories.

With regard to jailbreaking, what if approaches like steering GPT with activation vectors and monitoring internal activations for all model inputs are used?

For example, is it really true that this would require condensing everything into categories? What about numerical scales for instance? Interestingly, in February, I did a very-small-scale proof-of-concept regarding automated emotional labeling (along with other metadata), currently available at this link for a brief time.  As you can see, it uses numerical emotion labeling, although I think that's just the tip of the iceberg. What about many-dimensional labeling?

If we were just monitoring behavior, using scores would work fine, But we also want to control behavior. The simplest and most efficient way I can see to do that is via the token-banning mechanism as long as we have arranged that one tag is one token. But we could also do it via a threshold on numerical scores, say where if we get a deceit score over the currently allowed threshold then we back generation up some distance and try again until we get a score below the current thrishold. I can't really see any cases where that fine a level of control would be needed, and for a coarse-grained version we could just use different tags for different bands of intensity level.

With regard to jailbreaking, what if approaches like steering GPT with activation vectors and monitoring internal activations for all model inputs are used?

This approach is somewhat similar to activation vectors. One significant difference is that an activtion vector represents "one thing" in whatever semantic space the residual embeddings use (at around the layer where we're applying it). A classifier (like the one the LLM learns for when to emit one of these tags) can have complex conditions and a convoluted logic for its boundary that includes various special cases (<criminality>, for example, has a definition technically as complex as all of the worlds' legal codes combined), and a classifier can (with enough data) learn all the twists and turns of the boundary of the set it's classifying, which could often be a lot more complex than what an be described by any single activation vector. You'd probably need to use something comparable to a LORA in place of an activation vector to get as much descriptive complexity capacity.

Also, learning the classifier for the tag means that the concepts needed to define the boundary (such as all of the world's legal codes) need to be represented inside the learned LLM, guaranteeing that they're also available for other behaviors implemented by the LLM to pay attention to and make use of. So this helps you shape the ways the LLM is learning to think, by adding another task for it to learn — unlike activation vectors which can only use directions (linear combinations of things) that the LLM has already decided to put into its semantic embedding space, and don't modify the weight structure of the LLM at all.

On the other hand, the fact that this approach's preclassifier needs to be designed, tested, run over the pretraining set, and then the LLM pretrained to distill the classification behavior into it make it a lot less flexible and adjustable-on-the-fly than an activation vector approach. So both techniques might have their advantages.

An activation vector is somewhat similar in effect to adding some text to the prompt, except that attention mechanisms can't pay attention to specific parts of it.

I appreciate your thoughtful response! Apologies, in my sleep deprived state, I appear to have hallucinated some challenges I thought appeared in the article. Please disregard everything below "I think some of the downsides mentioned here are easily or realistically surpassable..." except for my point on "many-dimensional labeling."

To elaborate, what I was attempting to reference was QNRs which IIRC are just human-interpretable, graph-like embeddings. This could potentially automate the entire labeling flow and solve the "can categories/labels adequately express everything?" problem.