A mechanistic explanation for SolidGoldMagikarp-like tokens in GPT2

Interesting work! Could this be fixed in training by giving it practice at repeating each token when asked?

Another thing I’ve wondered is how substring operations can work for tokenized text. For example, if you ask for the first letter of a string, it will often get it right. How does that happen, and are there tokens where it doesn’t work?

[-][anonymous]3y10

You could do this, if you wanted. I suspect that when ChatGPT was patched, they instead just patched the tokenizer to no longer create these tokens, which is significantly easier and would also allow the model to repeat them without too much trouble.

I think that substring operations would mainly work with tokens that are used a fair bit. My model of the situation is, there is some loss that it would leave on the table if it didn't know some facts about substrings of common tokens, so it learns it. For instance, it would help it be able to complete more acronyms, and if people prefer or avoid alliteration in certain contexts, it would help to predict text. If it was trained on social media, sometimes people will spell things out in ALL CAPITAL LETTERS, or do iNtErCaPs or whatever you call that, which would let it know all sorts of facts about the innards of tokens.

[-]Neel Nanda3y60

Many of these tokens are unprintable (i.e., they don't display and I don't know what they are).

The first 256 characters are the 256 ASCII characters (each 1 byte). A bunch of them are basically never used (they exist so that an arbitrary string of bytes can be broken down into valid tokens)

[-]JNS3y41

Slightly off tangent, but I am confused about the reasons and assumptions that underpin the current tokenizer used for GPT-3.

I get that reality has more words than could be packed into 50400 tokens (and that limit comes from hardware).

I also get that the token space needs to be big, so you can't just go to character level tokenization, you would end up with a space that it too small.

But why on earth did the tokens end up like this? A lot of them look like garbage, a lot of them look like repeats of the same word, but with added white space or unprintable characters.

Surely there is some middle ground that better matches the reality of how me (humans) use words - And I think the confusing part for me is here, I mean why wouldn't we construct a map that looks a lot like the features we see in the territory ? (really a map builder and not a map).

Confused I am, knowledge I seek.

[-]gwern3y195

You're overthinking it. OA BPEs are merely a quick hack by Radford or someone back in 2017 or so. They spent 5 seconds thinking about it: "I need to tokenize somehow, my Transformer has a context window of like 512 so character-based is out, word-level is too inflexible for multilingual multitask training like I hope the model will learn to do and would lead to lots of UNKs, so the only thing off-the-shelf in a convenient library on Github is BPEs; there, trained it on my dump of uncleaned Internet garbage data, done! Now on to something that actually matters..." No one was sitting down and debating the merits of BPEs vis Unigram-LM or BPE-dropout or whether it'd sabotage poetry. (BPEs are not optimal in any sense for anything. They don't even guarantee optimality for what they do, as they just create tokens greedily, IIRC. And there are many viable choices: you can have a few thousand BPEs or you can push it to several hundred thousand like Jurassic, or to a million like FB recently did. You can use wordpieces or character-encoding (ByT5), you can expand the vocab & redo on specialized corpuses like OA did with Github for Codex, or their new c100k tokenization etc.) It was just one of innumerable minor engineering decisions made along the way. It was never supposed to be important or still matter 6+ years later because the models turned out to be so important & worth keeping backwards-compatibility for. And they definitely weren't thinking about anything remotely like unspeakability.

[-]M. Y. Zuo3y30

Are there any serious efforts to fix these quick hacks?

[-]gwern3y51

Sure. I listed a bunch of improvements right there. They just tend to always be relatively small on economically-important usecases, compared to other things like RLHF. (Sadly, no matter how much I whine about ChatGPT's poetry being super basic and bland because of BPEs+mode-collapse, better poetry wouldn't make OA much money compared to working harder on RL tuning to prioritize Q&A or reasoning or investing in more GPUs to serve more users.)

[-]M. Y. Zuo3y30

Ah that makes sense, so the fixes are ready but waiting for a compelling argument?

[-]gwern3y52

I think so. If someone could show that BPEs were changing the scaling laws on an important task end-users will pay for, then it wouldn't be hard to change that: for example, I noted that Codex induced OA to change BPEs, because that substantially increased the effective context window when you generate BPEs optimized for programming language syntax, which matters to big paying customers like Github (the larger the ctx, the more the variables & definitions inside a specific project are available for relevant completion). Otherwise, the general attitude seems to be to shrug and it'll fix itself at some point when GPT-4 or GPT-5 or god knows what system uses some new tokenization or byte-level encoding or a fancy new attention/history mechanism with near-unlimited ctx motivated by other concerns and then the problems go away and become a minor historical footnote. And they are probably right to, as much as it annoys me to see the bad poetry or see people running into blatantly BPE-caused problems and declare 'deep learning has hit a wall!'...

[-]JNS3y20

Thanks for the insight.

[-]Veedrac3y10

BPEs are one of the simplest schemes for producing a large, roughly-fairly-weighted-by-frequency set of tokens that compresses arbitrary bytes drawn from a written language training dataset. That's about all you need to explain things in ML, typically.

Subword tokenization, the linguistically-guided pre-LLM approach, has a history but is comparatively complex, and I don't think it compresses as well for a given token budget even on fairly normal-looking text.

[-]JNS3y*10

Thanks, turns out I was not as confused as I thought, I just needed to see the BPE algorithm.

[-]Aiyen3y4-4

This is the sort of work we need to be doing to understand neural nets. Excellent job!

[-]StefanHex3y10

Finally, we give a simple approach to verify that a particular token is unspeakable rather than just being hard-to-speak.

You're using an optimization procedure to find an embedding that produces an output, and if you cannot find one you say it is unspeakable. How confident are you that the optimization is strong enough? I.e. what are the odds that a god-mode optimizer in this high-dimensional space could actually find an embedding that produces the unspeakable token, it's just that linprog wasn't strong enough?

Just checking here, I can totally imagine that the optimizer is an unlikely point of failure. Nice work again!

import torch from transformer_lens import HookedTransformer # load the model (in this case, GPT2-small) model = HookedTransformer.from_pretrained('gpt2').to('cpu') # pick the vector furthest along each embedding direction best_match = (model.W_U.T @ model.W_U).argmax(dim=-1) # surface the tokens that are not their own argmax for tok in (best_match != torch.arange(50257)).nonzero().flatten(): print(tok.item(), best_match[tok].item(), '~' + model.tokenizer.decode([tok.item()]) + '~', '~' + model.tokenizer.decode([best_match[tok].item()]) + '~')

# from GPT2-small 124 15272 ~�~ ~ pione~ 125 15272 ~�~ ~ pione~ 153 154 ~�~ ~�~ 177 15272 ~�~ ~ pione~ 178 15272 ~�~ ~ pione~ 179 15272 ~�~ ~ pione~ 180 15272 ~�~ ~ pione~ 181 15272 ~�~ ~ pione~ 182 15272 ~�~ ~ pione~ 183 15272 ~�~ ~ pione~ 185 36173 ~�~ ~ RandomRedditor~ 186 15272 ~�~ ~ pione~ 187 15272 ~�~ ~ pione~ 188 15272 ~~ ~ pione~ 189 15272 ~~ ~ pione~ 190 15272 ~~ ~ pione~ 191 15272 ~~ ~ pione~ 192 15272 ~~ ~ pione~ 193 15272 ~~ ~ pione~ 194 15272 ~~ ~ pione~ 195 15272 ~~ ~ pione~ 196 15272 ~ ~ pione~ 197 15272 ~ ~ ~ pione~ 199 15272 ~~ ~ pione~ 200 15272 ~~ ~ pione~ # this next line appears to be a deletion character and is thus malformed ~ ~ pione~~ 202 15272 ~~ ~ pione~ 203 15272 ~~ ~ pione~ 204 15272 ~~ ~ pione~ 205 15272 ~~ ~ pione~ 206 15272 ~~ ~ pione~ 207 15272 ~~ ~ pione~ 208 15272 ~~ ~ pione~ 209 15272 ~~ ~ pione~ 210 15272 ~~ ~ pione~ 211 15272 ~~ ~ pione~ 212 36173 ~~ ~ RandomRedditor~ 213 15272 ~~ ~ pione~ 214 15272 ~~ ~ pione~ 215 15272 ~~ ~ pione~ 216 15272 ~~ ~ pione~ 217 15272 ~~ ~ pione~ 218 15272 ~~ ~ pione~ 219 15272 ~~ ~ pione~ 221 15272 ~~ ~ pione~ 9364 5815 ~ÃÂÃÂÃÂÃÂ~ ~ÃÂÃÂ~ 14827 5815 ~ÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ~ ~ÃÂÃÂ~ 23090 5815 ~ÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ~ ~ÃÂÃÂ~ 30208 15272 ~ externalTo~ ~ pione~ 30212 15272 ~ externalToEVA~ ~ pione~ 30897 15272 ~reportprint~ ~ pione~ 30898 15272 ~embedreportprint~ ~ pione~ 30905 15272 ~rawdownload~ ~ pione~ 39752 15272 ~quickShip~ ~ pione~ 39820 15272 ~龍�~ ~ pione~ 40240 15272 ~oreAndOnline~ ~ pione~ 40241 15272 ~InstoreAndOnline~ ~ pione~ 42089 15272 ~ TheNitrome~ ~ pione~ 45544 15272 ~ サーティ~ ~ pione~

import numpy as np from scipy.optimize import linprog def is_token_feasible(*, model, token, margin=1.00e-7): print('WU', model.W_U.shape) c = -model.W_U[:, token].detach().numpy() A_ub = (model.W_U.detach().numpy() - model.W_U[:, token:token+1].detach().numpy()).T print('A_ub', A_ub.shape) b_ub = np.zeros(A_ub.shape[0]) - margin soln = linprog(c, A_ub=A_ub, b_ub=b_ub, bounds=(-1, 1)) print(soln) return soln soln = is_token_feasible(model=model, token=30208)

9364 9364 ÃÂÃÂÃÂÃÂ ÃÂÃÂÃÂÃÂ 14827 5808 ÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ ÃÂ 23090 23090 ÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ ÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ 30208 24973 externalTo exting 30212 14341 externalToEVA PDATE 30897 14695 reportprint eleph 30898 30898 embedreportprint embedreportprint 30905 5997 rawdownload sembly 39752 13945 quickShip �� 39820 17629 龍� practition 40240 27924 oreAndOnline srf 40241 8994 InstoreAndOnline ailability 42089 44392 TheNitrome cumbers 45544 13945 サーティ ��

4690 189 ~ortunately~ ~~ 5815 30898 ~ÃÂÃÂ~ ~embedreportprint~ 9364 30905 ~ÃÂÃÂÃÂÃÂ~ ~rawdownload~ 13150 30897 ~ subur~ ~reportprint~ 14827 30905 ~ÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ~ ~rawdownload~ 15272 30905 ~ pione~ ~rawdownload~ 17629 30905 ~ practition~ ~rawdownload~ 23090 30905 ~ÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ~ ~rawdownload~ 25618 205 ~ councill~ ~~ 27013 30905 ~aditional~ ~rawdownload~ 27293 30905 ~ antidepress~ ~rawdownload~ 30208 30905 ~ externalTo~ ~rawdownload~ 30212 30905 ~ externalToEVA~ ~rawdownload~ 30439 30905 ~ unintention~ ~rawdownload~ 30897 30905 ~reportprint~ ~rawdownload~ 30898 30905 ~embedreportprint~ ~rawdownload~ 30899 30905 ~cloneembedreportprint~ ~rawdownload~ 31573 30905 ~ActionCode~ ~rawdownload~ 33434 30905 ~��士~ ~rawdownload~ 36173 30905 ~ RandomRedditor~ ~rawdownload~ 37574 30905 ~StreamerBot~ ~rawdownload~ 39142 30905 ~ThumbnailImage~ ~rawdownload~ 39655 30905 ~Orderable~ ~rawdownload~ 39714 30905 ~isSpecial~ ~rawdownload~ 39749 30905 ~DeliveryDate~ ~rawdownload~ 39752 30905 ~quickShip~ ~rawdownload~ 39820 30905 ~龍�~ ~rawdownload~ 40219 30905 ~oreAnd~ ~rawdownload~ 40240 30905 ~oreAndOnline~ ~rawdownload~ 40241 30905 ~InstoreAndOnline~ ~rawdownload~ 42066 30905 ~Nitrome~ ~rawdownload~ 42089 30905 ~ TheNitrome~ ~rawdownload~ 45544 30905 ~ サーティ~ ~rawdownload~

4690 4690 ortunately ortunately 5815 5815 ÃÂÃÂ ÃÂÃÂ 9364 9364 ÃÂÃÂÃÂÃÂ ÃÂÃÂÃÂÃÂ 13150 13150 subur subur 14827 14827 ÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ ÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ 15272 30898 pione embedreportprint 17629 17629 practition practition 23090 23090 ÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ ÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ 25618 25618 councill councill 27013 27013 aditional aditional 27293 27293 antidepress antidepress 30208 30208 externalTo externalTo 30212 30212 externalToEVA externalToEVA 30439 203 unintention 30897 30897 reportprint reportprint 30898 30898 embedreportprint embedreportprint 30899 30899 cloneembedreportprint cloneembedreportprint 31573 31573 ActionCode ActionCode 33434 33434 ��士 ��士 36173 36173 RandomRedditor RandomRedditor 37574 37574 StreamerBot StreamerBot 39142 39142 ThumbnailImage ThumbnailImage 39655 200 Orderable 39714 39714 isSpecial isSpecial 39749 39749 DeliveryDate DeliveryDate 39752 39752 quickShip quickShip 39820 39820 龍� 龍� 40219 40219 oreAnd oreAnd 40240 40240 oreAndOnline oreAndOnline 40241 40241 InstoreAndOnline InstoreAndOnline 42066 208 Nitrome 42089 42089 TheNitrome TheNitrome 45544 45544 サーティサーティ

Discussion and Future Directions

Given that GPT2-xl (compared to GPT2-small) has more printable hard-to-speak tokens, of which a smaller fraction and a smaller absolute number are unspeakable, one guess would be that these trends continue for larger models, and a model the size of GPT-3 davinci has many hard-to-speak tokens but very few or even no unspeakable tokens. It is certainly intuitive to believe that, in high dimensions, with a fixed vocabulary size, very few unembedding vectors are not on the convex hull of the unembedding vector polytope. (h/t to Eric Neyman for this last point)

It seems like it might be interesting to try to measure the volume of the feasible region (or more precisely, the (n-1) dimensional surface area of the portion of the unit sphere in the feasible region), to ascertain how "precise" the model would have to be to produce a particular token.

Edit: there are way more unspeakable tokens than I thought

I forgot to include the bias term! All of these tokens in GPT2-xl are actually unspeakable!

4690 30905 ortunately rawdownload 5815 216 ÃÂÃÂ 9364 30905 ÃÂÃÂÃÂÃÂ rawdownload 13150 200 subur 14827 205 ÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ 15272 205 pione 17629 205 practition 23090 205 ÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ 25618 183 councill � 27013 30905 aditional rawdownload 27293 205 antidepress 30208 200 externalTo 30212 200 externalToEVA 30439 203 unintention 30897 205 reportprint 30898 30905 embedreportprint rawdownload 30899 205 cloneembedreportprint 31573 200 ActionCode 33434 205 ��士 36173 30905 RandomRedditor rawdownload 37574 30905 StreamerBot rawdownload 39142 201 ThumbnailImage 39655 200 Orderable 39714 30905 isSpecial rawdownload 39749 205 DeliveryDate 39752 205 quickShip 39820 200 龍� 40219 36173 oreAnd RandomRedditor 40240 200 oreAndOnline 40241 30905 InstoreAndOnline rawdownload 42066 200 Nitrome 42089 30905 TheNitrome rawdownload 45544 30905 サーティ rawdownload

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

61

A mechanistic explanation for SolidGoldMagikarp-like tokens in GPT2

61

Ω 28

61

Ω 28

Discussion and Future Directions

Edit: there are way more unspeakable tokens than I thought