Nokens: A potential method of investigating glitch tokens

Hoagy

Summary

We can probably replicate glitch token-like behaviour by simply sampling random points in the embedding space, which I call 'nokens'.
We may be able to find a large number of tokens which cause unwanted behaviour more easily than with the ' petertodd' token.
These could present a surface via which we could remove unwanted behaviour.
By removing this behaviour, we may put negative pressure on the existence of the circuits which generate the unwanted behaviour.

Intro

This post is me trying to think through what it would mean to study glitch tokens as an alignment agenda, considering possibilities for what one might do, whether this would or would not be a useful agenda, and how we might get quick data on whether this is a good strategy. It's not something I'm planning on working on in the immediate future, mostly just a format for spelling out my thinking on the topic.

For background on glitch tokens see SolidGoldMagikarp and followups, and the many posts exploring glitch token behaviours on Matthew Watkins' twitter.

Observing glitch tokens

The first thing to do would be to get a proper look at these tokens, which Matthew and Jessica have started to do in their posts, with questions such as:

How often do glitch tokens result in behaviour that's out-of-distribution for non-glitch tokens, and within this, how often do we get genuinely unwanted behaviour as with ' petertodd'?
If we ask people without the context of previous work to investigate, do they come to the same conclusions about how they influence outputs?
What patterns do we see in the prompts which cause this behaviour?
At what token distance do they operate - do they have to be among the most recent tokens to cause unexpected behaviour?

While this is interesting, they are just a just a few tokens, from which it's hard to get a meaningful picture of what effects these tokens are having on the language model. To move beyond this I suggest:

Introducing 'nokens'

The consensus understanding on glitch tokens is that these tokens have minimal occurrences in the dataset, and so the model has no idea what the semantics of this token should be. Instead they just look like random vectors which it has no idea how to deal with, and which therefore has unpredictable downstream effects. Robustness to these kind of effects may be more important now that multimodal models like GPT-4 are SoTA, presumably involving embeddings of images being passed to a transformer which won't be discrete in the same way (h/t Misha Wagner).

This suggests that we can create similar effects by simply generating our own new embedding vectors, and inserting them into the first layer of the transformer as if they were a token. For ease I will call these synthetic glitch tokens 'nokens'. They form a kind of adversarial robustness training for language models

To explore this I would suggest experimenting on the following list of questions:

Has anybody already done these experiments, or a subset of them, as 'embedding robustness for language models' or some such other name? I had a quick look but I could only find perturbation of the text, not the embeddings.
What happens when we sample the space of non-semantic tokens? Do we find 'nokens' which cause a range of different behaviours?
Can we set up a benchmark (LLM-based evaluation?) to classify the behaviours resulting from each noken, so that we can search through large numbers of them?
What does the distribution of these behaviours look like?
Does that change if we take different ways of creating these 'nokens', sampling them from initial initialisation vs approximate distribution of embeddings, vs perturbed embeddings etc.
Can we find nokens which, along the lines of ' petertodd', elicit coherent and unaligned behaviour?
- This is my crux for whether this research plan should become a major project - if unwanted behaviour in an RLHF'd model can be elicited by found nokens then it opens a new surface for training.
If so, can we find unwanted behaviours using in a more systematic way than we can through red-teaming or prompt engineering?
- This to me is the most interesting element of ' petertodd' - that it represents the simplest known way (before patching) of eliciting the sorts of behaviour that ChatGPT's training was designed to remove. Since in this framework ' petertodd' is just one among a huge space of nokens, this could present an easy surface through which to train out these sorts of behaviours.
If so, can we finetune the models not to act in unwanted ways on these tokens?
- Does finetuning on some of these negative generators reduce the ability to find other nokens which elicit the same behaviour?
Does this finetuning have any impact on frequency of unwanted behaviour that is elicited without the use of nokens?
- This is my crux for whether the research plan can actually be valuable to alignment. The unwanted behaviours are really the ones that are generated from valid strings, but if we could use nokens to apply negative selection pressure on the circuits which generate unwanted behaviour then this would be valuable.
How do these properties change with the model size/performance?

By the end of this we are many hypotheticals deep and so it's quite likely that one of these steps would break down and one would have to stop, reassess, and reconsider or reconceptualize, but it feels like there's an interesting chain to follow.

There's also nothing restricting this kind of work to the inputs- we could also replace the residual stream with a random, or out-of-distribution vector at one or more points in the sample. Characterising what 'out-of-distribution' means for the residual stream is trickier than for input vectors - it may not be nearly as large since the model will be incentivised to maximise the information content of the residual stream - but should still be possible.

Why this might not be worth working on

There's no reason to expect that these kind of errors are likely, or capable of being inserted into models, or for this to be a major form of threat.
The failure modes induced by these tokens may be totally unrelated to actually likely failure modes - so suppressing unwanted behaviours has no impact on the distribution of behaviours on non-glitch tokens (testable).
Relatedly, adversarial RLHF might be sufficiently tractable that even if there is a connection between glitch token behaviour and unwanted behaviours on normal text, the best way to work on this is by suppressing them on normal text directly.
This work is dominated by search over short strings to induce behaviour, along the lines of Automatically Auditing Large Language Models via Discrete Optimization - the strings produced might not make sense, but that's true to an even greater extent for glitch tokens/nokens, while non-semantic strings have the benefit of being realizable behaviour without hacking the activations.
The space of faux glitch tokens is so large and high-dimensional that it's actually intractable to induce behaviour change over a large part of this space by known methods.
Model robustness of this kind is just not the place to expend energy, because of sharp-left-turn type considerations, ie maybe AGI will be built out of LLMs but done by composing them into a unit which is agentic and misaligned even though the constituent LLMs are neither, so ironing out a long tail of failures in these LLMs just isn't helpful.

I'd be interested in any feedback on this as a proposal, and people's predictions for what would happen/why it would go wrong.

Thanks to Misha Wagner for comments and everyone at the Wye Combinator retreat for conversations on this topic.

LESSWRONG
LW