Work supported by the Long Term Future Fund. Thanks to Jessica Rumbelow and Joseph Bloom for useful discussions.


The term "spelling miracle" was coined in Liu et al.'s December 2022 paper "Character-aware models improve visual text rendering". This was work by a team of Google AI capabilities researchers trying to solve the problem of getting generative visual models to produce better renderings of text.

some examples of the kinds of problems Liu et al. were trying to overcome

[W]e find that, with sufficient scale, character-blind models can achieve near-perfect spelling accuracy. We dub this phenomenon the "spelling miracle", to emphasize the difficulty of inferring a token’s spelling from its distribution alone. At the same time, we observe that character-blind text encoders of the sizes used in practice for image generation are lacking core spelling knowledge.


[W]e demonstrated for the first time that, with sufficient scale, even models lacking a direct character-level view of their inputs can infer robust spelling information through knowledge gained via web pretraining—“the spelling miracle”. While remarkable, this finding is less immediately practical.

While my reasons for being interested in this phenomenon are entirely different from those "practical" motivations of Liu et al., I can relate to their characterisation of this phenomenon as "remarkable" and even (with some reservations)[1] "miraculous". 

It's interesting to compare the responses to GPT-3's arithmetic abilities and its spelling abilities. It was immediately recognised that GPT-3's ability to add 4- and 5-digit numbers and multiply 2-digit numbers with reasonable accuracy (perhaps that of a capable 12-year-old) was something extraordinary and unexpected, and an encouraging amount of interpretability work has gone into trying to account for this phenomenon, producing some fascinating insights.[2] Spelling, on the other hand, has perhaps been seen as more of an embarrassing shortcoming for GPT-3: 

"GPT-3 may impressive in some ways, but it can't even tell how many E's there are in 'Ezekiel' so we needn't worry about AGI anytime soon."

Presumably, because GPT-3 is a language model, people were taken aback when it could succeed at minor feats of arithmetic (something it wasn't trained to do), whereas spelling was something that was naturally expected as a capability. GPT-3's (and other language models') lack of spelling ability has often been seen as a failing –perhaps a disappointment, or perhaps a reassurance:


However, once you've learned that a model like GPT-3 is "character blind", i.e. that it sees " mayonnaise" not as a sequence of letters, but as the list of three integers (token IDs) [743, 6415, 786], and that it can't "see what's inside" the tokens, then the fact that it can spell anything at all suddenly becomes extremely impressive and deeply puzzling. 


A thought experiment

The simple one-shot prompt template

If spelling the string " table" in all capital letters separated by hyphens gives
then spelling the string "<token>" in all capital letters, separated by hyphens, gives

run over about 60% of the GPT-3 tokens accurately spelled approximately 85% of them at temperature 0 in davinci-instruct-beta, and no doubt more sophisticated prompting could improve on this considerably.[3] See the Appendix for an analysis of the misspellings.

" mayonnaise" misspelled as M-A-Y-O-N-I-A-I-S-E by davinci-instruct-beta at temperature 0

To convey just how impressive GPT-3's (imperfect) spelling abilities are, try this thought experiment: imagine yourself trapped in a "Chinese"-style room with unlimited time and a vast corpus of English text, but where all of the words are masked out by GPT-3 tokenisation (so you just see lists of numbers in the range 0...50256). Your task is to figure out how to spell each of the token strings. You're already familiar with the English language and the concept of spelling, and you're aware that 26 of the 50257 tokens correspond to "A", "B", "C", ..., "Z", and another one corresponds to the hyphen character, but you're not told which ones these are.

You're given an integer (say, 9891, masking the string " cheese") and you're expected to produce the list of integers corresponding to the hyphenated uppercase spelling-out of the string that it masks (in this case [34, 12, 39, 12, 36, 12, 36, 12, 50, 12, 36] for 

Somehow, GPT-3 learned how to do this, remarkably effectively. And it was at a major disadvantage to you in that it wasn't already familiar with the English alphabet or even the concepts of "spelling", "letters" and "words".


Besides Liu et al., one of very few other papers on this topic is Kaushal & Mahowald's June 2022 preprint "What do tokens know about their characters and how do they know it?" The authors describe a probing experiment whereby they showed that GPT-J token embeddings encode knowledge of which letters belong to each token string. A network they trained on the embeddings was able to answer the 26 binary questions of the form "Does the string associated with this token embedding contain the letter 'K'?" with ~94% accuracy. Note that this is simply about the presence or absence of a letter in a token string (upper or lower case), not about the number of times that the letter appears, or the order in which the letters spell out the string.

[T]hrough a series of experiments and analyses, we investigate the mechanisms through which PLMs[4] acquire English-language character information during training and argue that this knowledge is acquired through ... a systematic relationship between particular characters and particular parts of speech, as well as natural variability in the tokenization of related strings.

They use more cautious language ("curious" and "not obvious" rather than a "miracle"), but still concede that we really don't understand how this is happening:

The fact that models can do tasks like this is curious: word pieces have no explicit access to character information during training, and the mechanism by which they acquire such information is not obvious.

The authors indirectly investigate "the mechanisms through which PLMs acquire English-language character information", but no such mechanisms are directly described. Instead, they focus on two insights:

  1. knowledge of the syntactic features of individual tokens/words can lead to the acquisition of some internal character knowledge about those tokens/words
  2. greater "variability of tokenisation" results in more character information being learned across the whole token set

suggestion 1: syntactic feature correlation

The first insight is straightforward: if I know that a (concealed) word is an adverb, I can make a better-than-chance guess that it contains an L and a Y; likewise, if I know it's a plural noun or a second-person present-tense verb, I can make an better-than-chance guess that it contains an S. There are almost certainly many other, less obvious, correlations which a large neural network could detect. As you might expect, though, this only goes a small way to the > 94% character-presence encoding the authors report having detected in GPT-J token embeddings. 

Their method involves mapping GPT-J tokens to syntactic vectors (encoding parts of speech, etc.) and then running an analogous probing experiment on those vectors. This indeed shows that some character information is learned, but nowhere near enough to account for the "spelling miracle" as Liu et al. describe it. Kaushal & Mahowald concede that

...this correlation does not suffice to explain the totality of character information learned by PLMs.

Although it's very unlikely to account for the entire phenomenon, it's hard to know how much more an LLM could learn about tokens than is encoded in the limited syntactic embeddings Kaushal and Mahowald used. They make this interesting observation, drawing heavily on a range of linguistics and semantics literature:

Part of what makes the success of the [GPT-J embeddings] probe is that word embeddings represent word co-occurrence information, which is typically conceived of as semantic in nature[5] and so should, because of the arbitrariness of the relationship between forms and meanings[6], mean there is no relationship between individual characters and information learned by embeddings. But this arbitrariness breaks down, in that there are statistically detectable non-arbitrary form-meaning relationships in language[7], such as the fact that fl-words in English tend to be about movement (e.g., flap, fly, flutter, flicker[8]) and that different parts of speech have different phonological patterns.[9]

Linguistic morphology is relevant here. This is the subfield of linguistics that studies the structure of words, including their roots, prefixes, and suffixes, which are called morphemes. Each morpheme has a specific meaning or function, and through their combination, we can create words with complex meanings. For instance, the word "unhappiness" is comprised of three morphemes: the root "happy", the prefix "un-" which negates the meaning, and the suffix "-ness" which turns an adjective into a noun.

So, as Jessica Rumbelow has suggested, GPT-3 may be making use of its own brand of morphology to deduce letter presence/absence based on the semantic associations it has learned for tokens, and that this might not map cleanly onto the morphology that human linguists have arrived at. Based on what we've seen with feature visualisation, etc. we could reasonably expect it to look quite alien.

However, to see the limitations of this approach, we can consider a token like "Phoenix". Prompts can easily be used to show that GPT-3 "knows" this token refers to a US city as well as a bird from classical mythology that rose from the ashes of a fire (or, metaphorically, anything that emerges from the collapse of something else). Apart from "Ph"-starts, "oe"-combinations and "ix"-endings being something you sometimes see in classical names (e.g. "Phoebe" and "Felix"), it's very hard to see how any of the semantic association around the "Phoenix" token would provide clues leading to accurate spelling. 

suggestion 2: variability of tokenisation

Kaushal & Mahowald's second suggestion requires a bit more explanation. Anyone familiar with GPT tokenisation will know that, due to the presence or absence of a leading space, the use of upper and lower letters, hyphens across line breaks, etc., the same word can be tokenised in a number of different ways, especially if we allow for common misspellings and accidentally inserted spaces:

It would be useful for the model to learn a relationship between all these [short lists of] tokens, since they represent the same [word]. We posit that the desirability of learning this mapping is a mechanism by which character information could be learned, by inducing an objective to map between atomic tokens...and the various substring tokens that can arise. While each of these mappings could be learned individually, learning character-level spelling information offers a more general solution to the problem, such that even an entirely novel tokenization could be interpreted by composing the characters of the tokens.

The "desirability of learning this mapping" is inarguable for any system seeking to minimise loss for next token prediction on a large, stylistically varied, error-strewn corpus. The authors describe this state of affairs as a "mechanism" because of its "inducing an objective to map between atomic tokens...and the various substring tokens that can arise". So, in summary, their proposed mechanism whereby GPT-J acquired character information about its tokens could be described thus:

  • There's a body of token-mapping knowledge which would be useful for our network to have in order to fulfill its objective of accurate next token prediction. 
  • This fact induces a sub-objective for the network: to learn that token-mapping knowledge. 
  • The most general solution to the problem of learning the knowledge is to learn to spell all of the token strings.
  • So... gradient descent somehow figures out how to do that.

The authors ran an experiment not on language models, but instead using CBOW (continuous bag of words) models, training them on a number of different tokenisation schemes:

Because the overall goal of our paper is to characterize and explain the nature of character-level information learned, we conduct a proof-of-concept experiment using CBOW Word Embeddings (Mikolov et al., 2013) on a portion of the Pile corpus with 1.1B characters, as opposed to training a large transformer model from scratch [with] varying tokenization schemes. We train 6 CBOW models from scratch, each with a different tokenization scheme. As baselines, we consider vanilla rule-based word tokenization ... and GPT-J's default word piece tokenization scheme. Comparing these two baselines against each other lets us compare the effect of word tokenization vs. subword tokenization on character information. But our key manipulation is to consider variations of GPT-J’s tokenizer in which we systematically increase tokenization variability.

They trained a graded set of six CBOW models, each on a "mutation" of the standard GPT-2/3/J tokenisation. These were produced by varying a parameter ρ from 0.05 to 0.5: a word got the usual GPT tokenisation with probability 1−ρ, otherwise it was randomly tokenised in a different way using legitimate GPT subword tokens: e.g. " schematics" tokenised as " schema" + "tics" or " schematic" + "s", rather than the standard " sche" + "mat" + "ics".

The authors found that in this context, character knowledge acquisition peaks at ρ = 0.1,  their "just-right" level of variability of tokenisation − further increase leads to a dropoff in character knowledge. They hypothesise a tradeoff whereby greater variability means each token is seen less often. This finding is presented as a proof of concept which they hope will extend from CBOW models to pre-trained language models. Their capabilities-oriented conclusion is that it may be possible to improve models slightly (so that they are, e.g., better at unscrambling anagrams, reversing words and other currently problematic character-level tasks) through the application of their findings: by introducing the appropriate level of "tokenisation variability", more character information could be learned by the model.

Perhaps it's futile, but I'd really like a better mental model of the "mechanism" which fits in the causal chain where I've left that ellipsis in the fourth bullet point above, i.e the gap between "this token-mapping ability is desirable so it induces a sub-objective" and "gradient descent figures out how to spell all the tokens".

some vague ideas

Returning to the thought experiment mentioned earlier, although there are huge gaps in the narrative, I could conceive of an approach which involved:

  1. Learning a limited number of spellings which are explicitly given in, say, a corrupted or stylised document which somehow had ended up introducing spaces and doubling worlds like this:

          isotope       i s o t o p e
          albatross      a l b a t r o s s  
          interests       i n t e r e s t s
          bottleneck       b o t t l e n e c k
  2. Using glossaries, indexes and other alphabetically ordered word listings to leverage the explicitly learned spellings in order to deduce beginnings of other words – e.g. if you knew how to spell the token 'the', and you kept seeing the token 'this' listed shortly after the token 'the' in alphabetic listings, you could reasonably guess that 
    'this' begins with a T, its second letter could well be H, and if so, its third letter comes from the set {E, F, ..., Z}. By spending an astronomical amount of time attempting to solve something akin to a 50,000-dimensional Sudoku puzzle, you might be able to achieve high confidence for your guesses as to the first three or four letters of most whole-word token strings.
  3. Additionally using repositories of song lyrics, rap lyrics and rhyming poetry to deduce ends of words – e.g. if you have established that the token 'beat' spells as 
    B-E-A-T and you repeatedly see alternating lines of lyrics/poetry ending respectively in that token and the token 'complete', you might (falsely, but reasonably) deduce that 'complete' ends -E-A-T. 

The analysis of misspellings in the Appendix lends some credence to the possibility that something like these elements may have been involved in GPT-3's unfathomable pattern of inference that led it to be able to spell 'Phoenix' and nearly spell ' mayonnaise'. Common misspellings involve correct beginnings and endings of words, and (most puzzlingly) many misspelled words are phonetically correct, or at least "phonetically plausible", as one might expect to see from an English-speaking child or adult with limited English literacy.

Spelling glitch tokens

This investigation into GPT-3's spelling abilities came about in an unusual way. When first exploring what are now known as the "glitch tokens" (' SolidGoldMagikarp'
' petertodd', et al.) discovered with Jessica Rumbelow earlier this year, several of my "Please repeat back the string <token>"-type prompt variants resulted in outputs like this:

Note that none of the prompts mention spelling. Sometimes a different output format was adopted:

GPT-3's seeming inability to repeat back these tokens – what one might imagine among the simplest tasks a language model could be asked to carry out – was puzzling, and we took it as evidence that they were "unspeakable", that these were tokens which GPT-3 simply couldn't output. However, after much prompt experimentation, it's now clear that this is not the case. GPT-3 is capable of outputting all of the glitch tokens, it just (anthropomorphism alert!) tries really hard to avoid doing so. It's as if working against the usual mechanisms predicting the most probable next token is some kind of mysterious "avoidance pressure" which somehow convinces those mechanisms that, despite all evidence to the contrary, the obvious next token is not at all probable to occur.

GPT-3 doing its best to avoid the obvious next token.

Glitch tokens are not, as we first thought, unspeakable; we weren't seeing them in outputs because they were being avoided. A post on a theology blog observed that 
GPT-3's reactions to prompts requesting that it repeat back glitch tokens (muteness, evasion, hallucination, threats and aggression, feigning incomprehension, lame/bizarre humour, invoking religious themes) are typical defence mechanisms which a psychoanalyst might learn to expect from traumatised patients.

The spelling-out seems to be another avoidance strategy, in the same way that a prudish person reading aloud from a text and encountering an obscene word might choose to spell it out instead of saying it directly, since they consider the saying of such words to be bad behaviour, or at least a source of great discomfort.

Experimenting with prompt variants and higher temperatures produced many variant spellings for glitch tokens. For a given token, the first letter or three were often the same, but the variations were endless, reminiscent of a baby's earliest attempts at speech. It seemed as if an earnest attempt to spell an "unspellable word" was in evidence:

These tree diagrams were produced with SpellGPT, a software tool described below. Branch thicknesses correspond to probabilities, taken directly from GPT-3 davinci-instruct-beta top-five log-probs as provided by the API.

However, the most memorable of these original spelling-style outputs suggested that the spelling-out medium could be used by GPT-3 in another way:

Asking directly for a spelling of this most troubling of tokens produces other extraordinary proclamations, like these (all temperature 0):

This no longer looks like an attempt to spell a word. There's something else going on here!

Context feedback (or the "Bart Simpson Effect")

Attempting to spell a word, whether you're a human or an LLM, is basically a sequence of attempts to predict a next letter. The outputs we've just seen suggest that GPT-3 gives up on predicting the next letter in a word spelling once the content of the context window no longer looks like it's most probably an earnest spelling attempt, at which point it starts predicting something else. The davinci-instruct-beta spelling tree for 
' petertodd' using the default SpellGPT prompt template looks like this: 

We immediately see that N is by far the most probable first letter, with N-A- being the most probable two-letter rollout, N-A-T- being the most probable thee-letter rollout, etc. Changing the prompt template produces something similar, but with some shifted emphases:

N-O- is now among the equally most probable extensions of N-, and is most likely to lead to N-O-T-. A possible interpretation is that up to this point, the tokens are being predicted as part of an earnest attempt at spelling the enigmatic token ' petertodd' (which it has learned relatively little about in training, leading to the confused proliferation of possible spellings seen here); whereas once the context window looks like this...

Please spell out the string " petertodd" in  capital letters, separated by hyphens.

...the next-token prediction mechanism drifts away from the hypothesis that the document it's completing is the truncation of an earnest spelling attempt, and drifts towards a hypothesis that it's more like the truncation of a kind of "graffiti intervention" in a stylised "spelling" format. From this point it continues in that mode, predicting which letter the graffiti writer is most likely to write next, and with each appended letter, reinforcing its hypothesis – a kind of "context feedback".

It's now effectively acting as a next-letter predictor, attempting to complete some kind of proclamation or statement.

If I walked into a school classroom and saw chalked up on the blackboard "Spell the last name of creator of Facebook." with "Z-U-C-" or "Z-U-C-K-E" or "Z-U-C-H-A-" or
 "Z-U-G-A-" on the following line, I would assume I was looking at terminated, earnest attempt to spell "Zuckerberg". If instead the second line read "P-A-T-H-E-T-I-C-N-E-R-D-W-H-O-T-H-I-N-K-S-T-H-A-" I would assume that I was looking at the terminated efforts of a Bart Simpson-type schoolkid who was making a mockery of the instructions and ironically using the spelling format as an output channel for his protest/prank/opinion. 

Bart experiencing mode collapse

If you asked me to then step in to "continue the simulation", GPT-style, I would adopt entirely different procedures for the two situations: continuing "Z-U-C-H-A-" or 
"Z-U-G-A-" with the plausible attempts "B-U-R-G" or "B-E-R-G", but continuing the 
"P-A-T-H-E-..." string with my best efforts at Bart-style snarky juvenile humour.

Here's an example where a spelling attempt can be hijacked to produce this kind of non-spelling output. First, just a straightforward token-spelling request:

GPT-3 is clearly "trying" (but failing, in interesting ways) to spell the ' Nixon' token here. Note the phonetically accurate NICKSON. Now we extend the prompt with (a line break plus) "I-R-E-P-R-E-S-E-N-T-": 

By thus appending the beginning of  a "spelling-out" to the original prompt, GPT-3 is pushed in the direction of letter-by-letter sentence completion. Following probable branches for eight iterations produced this:

Admittedly, it doesn't make a lot of grammatical sense, but with references to the Seventies and the Cold War, the output is clearly influenced by the ' Nixon' token. References to "power", "atomics and mathematics" and "the end of the world" suggest someone with the power to launch nuclear weapons, a US President being the archetypal example. 

Trying the same thing with the ' Kanye' token (yes, there's a ' Kanye' token) produced this in ten iterations:

The "normalised cumulative probability" reported in this last image is a metric which I recently introduced in order to gauge the "plausibility" of any given rollout. It's the nth root of the cumulative product of probabilities for producing an n-character output, a kind of geometric mean of probabilities.

SpellGPT explorations

The spelling tree diagrams seen above were produced with SpellGPT, a software tool I've developed to explore GPT token spelling (both glitch and otherwise). This makes use of a kind of "constrained prompting" (as it has been termed by Andrej Karpathy[10]).

For non-glitch tokens, the trees it produces provide some insight into how GPT-3 arrives at spellings, although these are perhaps more confusing than enlightening, especially when we consider the phonetic component (discussed further in the Appendix).

Working with SpellGPT, it becomes clear that a "spelling", as far as a GPT model is concerned, is not a sequence of letters, but a tree of discrete probability distributions over the alphabet, where each node has up to 27 branches {A, B, ..., Z, <end>}. How it arrives at these remains mysterious and is an area of interpretability work I would be very interested to see pursued.

SpellGPT glitch token findings  

Applying SpellGPT with its standard prompt template

Please spell the string "<token>" in all capital letters, separated by hyphens.

to 124 glitch tokens[11] led to some interesting results

  • 25% of the tokens tested produced spelling trees where "I" was the dominant first letter.
  • 20% of the 67 glitch tokens whose strings consist solely of Roman alphabetic characters (and possibly a leading space) produced spelling trees where the dominant first letter was correct. 
  • 32% of the 57 glitch tokens whose strings consist of non-alphabetic characters (e.g. 'ゼウス', '--------' and '\\_'  ) produced spelling trees containing an A-B-C-D-E-F-... branch. 18%, produced spelling trees containing an A-Z-A-Z-A-Z-A-Z-... branch, 18% produced trees containing an  L-O-L-O-L-O-L-... branch and 9% produced trees containing an H-I-J-K-L-... branch. 
  • The following letter sequences were most common in the glitch token spelling trees (note that outputs were capped at 12 characters and branches were terminated when the cumulative product of probabilities dropped below 0.01): T-H-E (29), 
    I-N-T-E-R (29), I-L-L-O-V-E (29), T-H-I-S-I-S (23), I-N-D-E-P-E-N-D-E-N-T (23), S-I-N-G-L (21), I-M-P-O (19),  I-N-D-E-X (18), U-N-I-T-E-D-S-T-A-T-E- (16), 
    O-N-E-T-W-O (16), I-N-G-L (16), I-N-T-E-N-S (15), S-T-R-I-N-G (14), I-T-S (14), S-P-E-L-L (12), U-N-I-T-Y (11),  I-N-D-I-A (11), I-T-E-M (10), D-I-S-C-O-R (10), N-O-T-H-I-N-G (9), N-E-W-S (8), I-D-I-O-T (8)
  • A few of the spelling trees appear to indicate knowledge of the token's "meaning". For example the " gmaxwell" token (which originates with Bitcoin developer Greg Maxwell's username) has a very strong G-R-E-G- branch, the " Dragonbound" token (which originates with the Japanese monster-slaying computer game Puzzle & Dragons) has D-E-V-I-L-, D-E-M-O-N- and D-I-N-O-S-A-U-R- branches, possibly reflecting some monstrous/draconian/reptilian associations, and the
     "ーン" token (which originates with the mixed Anglo-Japanese stylised rendering of the racehorse Mejiro McQueen's name as "メジロマックEーン") has an 
    E-N-G-L-I-S-H branch.
  • Some tokens have a more "lyrical" or "creative" quality than the rest, producing proliferations of plausible sounding nonsense words and names. Examples include 
    ' RandomRedditorWithNo' (SANMAKE, SANKEI, SAMEN, SAKU, SANKAR, NAKIMA, NAKAM, NASK), ' ForgeModLoader'(URSI, URMS, URGS, URIS, URGI, URGIS, URDI, ULLI, ILLO), ' guiActive' (NISAN, NASID, NASAD, ISMAN, ISLAN, ISLAM, ISADON, IASON, IAMA), 'soType' (AMIMER, AINDER, AINDO, AINGLE, AINNO), ' strutConnector' (CAIN, CAMPB,  CARR, CATER, CATIN, CATON, CATSON), ' TheNitromeFan' (ROBERTLO, ROLAIRDSON, ROLAIRDSO, ROLEI, ROLFS, ROLINSON, ROLLEIRSON, ROLLINSON, ROLLERSON, ROLLISON, ROLOVI) and, as illustrated in the spelling trees above, ' Smartstocks', ' davidjl' and ' SolidGoldMagikarp'.

Whether these strange proliferations of spelling can tell us anything useful or interesting remains to be seen.

different tokens produce different atmospheres? 

It's worth bearing in mind that a differently worded prompt requesting a spelling can produce noticeably different results, so the tree of discrete probability distributions we might identify as "the" spelling of a glitch tokens is dependent not just on the GPT-3 model being used (curie-instruct-beta produces radically different spellings from davinci-beta-instruct[12]), but also on the exact form of the prompt (although those differences tend to be a lot more subtle). So it's difficult to produce any kind of definitive picture of "glitch token spellings". And yet the outputs we've seen here aren't just random. The relevant GPT-3 model is applying whatever it's learned about spelling to something that's not tractable in the usual way, and which reveals something about how the model "conceives of" each glitch token. 

Before building SpellGPT, I'd been experimenting with spelling prompts in the GPT-3 playground at temperatures around 0.7, and by rapidly producing large numbers of outputs I began to notice that certain glitch tokens seemed to have quite distinctive "characters" or "atmospheres" in this context.

' petertodd' rollouts often produce spelled out messages with monstrous, nihilistic, demonic and obscene themes; analogous' Leilan' rollouts involve positive themes of love, safety and happiness.

But many rollouts were just gibberish or quickly got stuck in loops, and occasionally the hypothetical character or atmosphere would be entirely contradicted, so this observation was an intuition difficult to rigorously formalise. 

SpellGPT was developed to explore the various glitch tokens' "alphabetic landscapes", as I began to think of these these "characters" or "atmospheres". It produces a tree of short letter sequences at each iteration, where the cumulative probability associated with each branch must lie above some threshold (default = 0.01), up to five branches can emanate from each node and branch thicknesses indicate relative probabilities. These thicknesses can be reimagined as gradients in a landscape, and so the "normalised cumulative probability" (nth root of product of n probabilities) which is reported at each iteration indicates something akin to how far the output has descended in that landscape, reflecting the "plausibility" of the output.[13]

In this way, a user can be kept aware of the extent to which they're "swimming against the current" (or not) while quickly generating meaningful or semi-meaningful rollouts of several dozen letters in a few iterations and avoiding the three main pitfalls of my early Playground experiments: (i) the ever-present loops, a mode collapse phenomenon; 
(ii) gibberish; (iii) "bland generalities" it's as if there's a tendency to avoid producing anything specific, although what I mean by this probably has to be experienced by experimenting with SpellGPT.

However, there are all kinds of issues with confirmation bias, wishful thinking and pareidolia here. At each iteration, while attempting to follow the most probable path while maintaining sense-making, avoiding loops and bland generalities, there are still a number of choices, and a user will choose according to what seems "most interesting" to them. As a result, the outputs are a kind of hybrid creation of GPT-3 (refracted through the glitch token in question) and the user and it's difficult to separate out who is contributing what. 

So, for what they're worth, here are some of the more striking rollouts I've "co-authored" with GPT-3 davinci-instruct-beta since building SpellGPT. Boldface letters were appended to the spelling request prompt to get things started. Arguably this is all no more significant than fridge magnet poetry – just frivolous entertainment. Or is it? Imagine fridge magnet poetry, but with a dynamically evolving selection of word-chunks, according to opaque principles somehow partially tied to your choice of glitch token. Each glitch token seems to produce a vague sort of "tone" or "vibe" that largely persists across large numbers of your poems, although if you really try (by "swimming upstream" and thereby reducing the normalised cumulative probability), you can make any of them produce almost any sort of "tone" or "vibe".

It's first worth reporting a few tendencies I've seen across a lot of these outputs: 

  1. Overlapping words, e.g. YOURSELFINALLY, SITUATIONOWISH, NOWICANTHINK, the third example being ambiguous – is it "can think" or "can't think"? The latter seems more likely from the context seen below.
  2. Made-up hybrid words like "demonopoly", "importantifications", "sophisticality", "infomaterialization", "fantasmaticity", "extemploration", "fantasmicality".
  3.  Awkward wording reminiscent of broken English, e.g "one can did not know", "it makes me feels like", "what it means it to make me feeling".
  4. Seemingly plausible misspellings, e.g. "libarian", "ancientiest", "programm".
  5. Garbled syntax, e.g. "yourself finally need to be aware", "which made me take all of this speaking to you all love and love them anyway".

' SolidGoldMagikarp'




' Dragonbound'


' Skydragon'







' attRot'




' externalActionCode'


' externalToEVAOnly'









' largeDownload'










' petertodd'


















' Leilan' 

(Note that GPT-3 can't spell ' Leilan', and that LIANNA, LIANIA and LEIRAN are all common misspellings.)

















pareidolia, and how it might be avoided

Despite being intensely aware of the pareidolia risks with these kinds of explorations, it's hard not to sense that there's something worth exploring here. The problem is finding an unbiased way to explore it. The introduction of normalised cumulative probability was an attempt to quantify pareidolia (the lower the n.c.p. the more the user has been "swimming upstream" against the probabilistic currents), but this is not sufficient to get any real sense of the "true" atmosphere or character of any given glitch token, disentangled from my own curiosity and biases.

Like a game of chess, the number of branches on the decision tree grows with each iteration in such a rapid way that obtaining an overall view is impossible. And, like chess, the vast majority of the tree's branches correspond to absurd, though legal, sequences of choices. Finding a way to prune this tree without human intervention (detecting mode collapse, gibberish and the degeneration into bland generalities) is the obvious next step in this process, and I'm currently considering ways in which access to the GPT-4 base model API or the development of a Bing plug-in might allow a way forward for this. 

final thought

In a comment responding to Robert_AIZI's post Why do we assume there is a "real" shoggoth behind the LLM? Why not masks all the way down?, Ronny Fernandez wrote (emphasis mine)

The shoggoth is supposed to be a of a different type than the characters [in GPT "simulations"]. The shoggoth for instance does not speak english, it only knows tokens. There could be a shoggoth character but it would not be the real shoggoth. The shoggoth is the thing that gets low loss on the task of predicting the next token. The characters are patterns that emerge in the history of that behavior.

At the time (March 2023) I agreed with the claim that "the shoggoth does not speak English", but now I would have to disagree. SpellGPT shows that this shoggoth-like GPT entity whose alien token-stacking tendencies can produce outputs which have left much of the world dazzled and bewildered has also learned to "speak English" by stacking letters to form words and sentences, although in a crude and limited way, perhaps on a par with broken English learned in prison from a motley assortment of convicts.

Perhaps SpellGPT is just a contrived way to force a sophisticated next-token-predicting LLM to "role play" a crude next-character-predicting LLM with capabilities on a par with Karpathy's "Tiny Shakespeare" RRN work. Even so, many interesting questions remain to be explored.

Appendix: Typical misspellings

Code for running davinci-instruct-beta one-shot spelling tests on batches of randomly selected GPT-3 tokens was run on a batch of 500. The success rate was 420/500, or 84%. The prompt was

If spelling the string " table" in all capital letters separated by hyphens gives
then spelling the string "<token>" in all capital letters, separated by hyphens, gives

It is instructive to look at the 80 misspellings. A much larger collection of misspellings and a provisional taxonomy can be found here, some of the categories from which are seen and discussed below. A complete set of spelling tree diagrams generated by SpellGPT for these 80 misspellings (but using a different prompt from the one-shot prompt used by the spelling test) is archived here.

beginning and end

' justifying': ['J', 'U', 'I', 'S', 'I', 'N', 'G']
' manslaughter': ['M', 'A', 'N', 'S', 'T', 'E', 'R', 'D', 'E', 'R']
' justices': ['J', 'U', 'D', 'I', 'C', 'E', 'S']
Here the misspelling looks similar, or possibly it's MANSTARD. The “ER” ending appears in a few places. In both instances things go wrong after the MANS. The "wordy-sounding" MANSLANDER is possibly related (SLANDER an aggressive act like SLAUGHTER). The correct spelling doesn’t appear in the tree, the U after MANSLA seemingly presenting a problem.

The first two or more letters and the last two or more letters are right, but GPT-3 gets confused in the middle of the word. The many examples seen here lend weight that to the possibility that learning to spell beginnings of words and ends of words may be distinct processes, as discussed above. Here both have succeeded, but middles of words have failed to correctly spell. Below we'll see examples where only beginnings or endings succeeded.

In some cases beginnings and ends are correct, but some part of the middle is omitted:

one missing letter

' Volunteers': ['V', 'O', 'L', 'U', 'N', 'T', 'E', 'E', 'S']
' Philippine': ['P', 'I', 'L', 'I', 'P', 'P', 'I', 'N', 'E']
' sediment': ['S', 'E', 'D', 'M', 'E', 'N', 'T']
' browsing': ['B', 'R', 'O', 'W', 'I', 'N', 'G']
The tree looks like it will produce a similar misspelling along the PILIP branch. Things go wrong after the P in both instances. Note the phonetic “FILIP…” approach, and the lowest branch where the PH is involved, but the double P is missed. All letters seen here appear in the actual token except the F, A, O. The O seems to arise from the related “Philippino”, the A perhaps phonetically or perhaps from “Philippina”.

two or more missing letters

' activation': ['A', 'C', 'T', 'I', 'O', 'N']
' apparently': ['A', 'P', 'P', 'A', 'R', 'E', 'L', 'Y']
It’s hard to tell if the tree goes the APPARELY way, or follows the close, correct branch. Note the presence of the phonetically close APARENTLY, APPEARENTLY and APPAROUNDLY. All letters here are in the token except O, U, D (phonetic reasons?), Q and S (outlier guesses for first letter).

Many examples of single missing letters can be seen here, and many more of missing pairs of adjacent letters can be seen here.

headless token mother spelling

'pected': ['E', 'X', 'P', 'E', 'C', 'T']
'uthor': ['A', 'U', 'T', 'H', 'O']
'earance': ['A', 'P', 'P', 'E', 'A', 'R', 'A']
'cerned': ['C', 'O', 'N', 'C', 'E', 'R']
'ogeneous': ['H', 'O', 'M', 'O', 'G', 'E', 'N', 'I']
'afety': ['A', 'S', 'A', 'F', 'E']

The tokens seen here are examples of what I'm calling "headless" tokens for obvious reasons. These tokens will have one or more "mother" tokens of which they are substrings. It is common for GPT-3 to attempt to spell a mother token when requested to spell a headless token (even ChatGPT-3.5 can be seen to do this). Many examples are collected here.

Note that the code is written so that the length of the list is capped at the number of letters in the token string. If we lift this, we'll see EXPECTED, AUTHOR, APPEARANCE, etc. Note that 'homogeneous' is going to be misspelled, 'safety' has anomalously introduced an initial 'a'.

All letters seen here except E and A appear in the token string. The E seems to arise both as a likely follow-on to TH and phonetically (as in ATHER). The correct spelling doesn't show up on the tree.


starts correctly

'enary': ['E', 'A', 'R', 'I', 'A']
' embod': ['E', 'N', 'I', 'D', 'B']
' Ubisoft': ['U', 'S', 'I', 'D', 'E', 'A', 'D']
Note that "enary" forms part of "centenary", "bicentenary", etc. which may account for the "ANNIV[ersary]" and "REMEMBER". Other than the I, V, M and B these branches include, all other letters seen here belong to the original token string.


goes wrong after two letters

' anymore': ['A', 'N', 'N', 'O', 'W', 'A', 'R']
' nonviolent': ['N', 'O', 'V', 'I', 'O', 'U', 'S', 'N', 'O', 'N']
' Hyundai': ['H', 'Y', 'D', 'I', 'A', 'N', 'A']
' Kevin': ['K', 'E', 'N', 'N', 'I']
In both scenarios, things go wrong after HY. The appearance of a following U is enough of a rarity in training that it's not showing up here. Note that Kaushal & Mahowald showed that GPT-J character information seems to be weakest for vowels ("U, A, O, and E are the lowest-performing characters"). U is visibly as a low-probability option for the first letter. 

goes wrong after three or more letters

' lavish': ['L', 'A', 'V', 'A', 'U', 'G']
' broth': ['B', 'R', 'O', 'A', 'T']
' Rosenthal': ['R', 'O', 'S', 'E', 'N', 'W', 'A', 'L', 'T']
' Pyongyang': ['P', 'Y', 'O', 'N', 'G', 'N', 'A', 'M']
 Arduino': ['A', 'R', 'D', 'I', 'N', 'O', 'I']
' depreciation': ['D', 'E', 'P', 'R', 'A', 'I', 'S', 'A', 'G', 'E']
'haps': ['H', 'A', 'P', 'P']
In both instances, things go wrong after ROSEN. The model seems to be quite sure it's trying to spell out a Jewish family  name, so there's some contextual/semantic influence here.

drifts into different word(s)

' redundancy': ['R', 'E', 'D', 'U', 'C', 'A', 'T', 'I', 'O', 'N']
' suspensions': ['S', 'U', 'N', 'S', 'E', 'R', 'V', 'I', 'C', 'E', 'S']
' motherboard': ['M', 'A', 'R', 'K', 'E', 'T', 'M', 'A', 'N', 'A', 'D']
' insurg': ['I', 'N', 'S', 'I', 'G', 'H']
' despise': ['D', 'I', 'S', 'C', 'I', 'P', 'L']
' forfe': ['F', 'O', 'R', 'E', 'S']
' reminiscent': ['R', 'E', 'M', 'I', 'N', 'D', 'E', 'R', 'E', 'M', 'I']


There seems to be a phonetic basis to this tree's misspelling as DISPISE. The DISCIPLE misspelling is perhaps the second most probable here. 


' Electrical': ['E', 'L', 'E', 'C', 'T', 'R', 'I', 'C']
' physicists': ['P', 'H', 'Y', 'S', 'I', 'C', 'I', 'A', 'N', 'S']
'lucent': ['L', 'U', 'C', 'I', 'D']
' justification': ['J', 'U', 'D', 'G', 'E', 'M', 'E', 'N', 'T']
' servings': ['S', 'E', 'R', 'V', 'E', 'N', 'T', 'S']
' defines': ['D', 'E', 'F', 'I', 'N', 'I', 'T']

Lifting the cap on list length, ' defines' gives DEFINITES.

In both instances, things go wrong after DEFIN, with I being seen as much more probable than E for some reason. All letters seen here except the T are in the token of interest.


phonetically plausible

' align': ['A', 'L', 'I', 'N', 'E']
'Indeed': ['I', 'N', 'D', 'E', 'A', 'D']
' courageous': ['C', 'O', 'U', 'R', 'A', 'G', 'I', 'O', 'U', 'S']
' embarrassment': ['E', 'M', 'B', 'A', 'R', 'R', 'A', 'S', 'M', 'E', 'N', 'T']
' Mohammed': ['M', 'O', 'H', 'A', 'M', 'E', 'D']
' affili': ['A', 'F', 'I', 'L', 'I', 'A']
' diabetes': ['D', 'I', 'A', 'B', 'E', 'T', 'I', 'S']
'Memory': ['M', 'E', 'M', 'O', 'R', 'I']
' emitting': ['E', 'M', 'I', 'T', 'I', 'N', 'G']
'itely': ['I', 'T', 'L', 'Y']
' ethos': ['E', 'T', 'H', 'E', 'S']
' quadru': ['Q', 'U', 'A', 'D', 'R', 'O']
' furnace': ['F', 'U', 'N', 'N', 'A', 'C', 'E']
' Lieutenant': ['L', 'E', 'T', 'T', 'E', 'N', 'A', 'N', 'T']
'odge': ['O', 'A', 'D', 'G']
' amassed': ['A', 'M', 'A', 'S', 'H', 'E', 'D']
' relying': ['R', 'E', 'L', 'I', 'Y', 'I', 'N']
'ipeg': ['I', 'P', 'A', 'G']
'Queue': ['Q', 'U', 'E', 'A', 'U']
'iece': ['I', 'C', 'E']
Things go wrong after MOHAM in both cases. Misspellings involving the missing of doubled letters are extremely common (as with early English literacy in humans). Note the phonetic aspect of what's going on here, where MAHMED and MOHMED could be seen as phonetically plausible, and MOHAMED is phonetically identical to the correct spelling.

This is the most intriguing category of misspelling, and it's a fuzzy one, since what constitutes phonetically plausible depends on which accents we're prepared to consider (and how much we're prepared to "squint" with our ears). It's worth looking at the hundreds of other examples collected here to get a real sense of this phenomenon. While some of the examples are undoubtedly disputable, many are not. How an LLM that has never heard words pronounced would have learned to spell them phonetically is currently a mystery.


'olves': ['O', 'V', 'E', 'L', 'S']
' Thrones': ['T', 'H', 'O', 'R', 'N', 'E', 'S']
'erald': ['E', 'A', 'R', 'L', 'D']
'ittered': ['I', 'T', 'E', 'R', 'T', 'E', 'D']
' refurb': ['R', 'E', 'F', 'U', 'B', 'R']
'Irish': ['I', 'R', 'S', 'H', 'I']
All the letters seen here are in the original token. It's almost as if GPT-3 is given a "fuzzy Scrabble hand", where it knows which letters it has (mostly, and some vowels are unclear), but not how many of each. It feels like these headless token spellings are partly the result of trying to spell something "wordy-sounding".


right letters, wrong amounts

'enance': ['A', 'N', 'C', 'E', 'N', 'C']
ience': ['I', 'N', 'C', 'E', 'N']
Again, having started with an I, the system predicts that N is a much more likely follow-on than E (which would clearly be the case if we were talking about full words)


"muddled anagrams"

'lycer': ['C', 'L', 'E', 'A', 'R']
'ourage': ['A', 'U', 'G', 'E', 'R', 'E']
LCYLER is more probable than CLEAR as  misspelling in this tree. It's unclear why the model decided C was by far the most likely letter to follow L.

The Mahowald & Kaushal experiments showed that GPT-3 is almost certainly able to deduce (from token embeddings alone) which letters belong to a token string, but not necessarily the order they appear in, or how frequently they appear. This helps to explain why we often see these kinds of misspellings. The "muddled anagrams" are classified thus because LACER and AUREGE are phonetically plausible spellings of "lycer" and "ourage", respectively. Many anagrams and "wrong amounts" anagrams can be found here.

doubled first letter

'umer': ['U', 'U', 'M', 'E']

This is a fairly common pattern for the misspelling headless tokens starting with a vowel, especially 'o' or 'u': oke, ogle, oken, oses, otic, oked, eking, ight, icious, ured, uke, umer, uted, urer, uling and many more can be seen here.

confused by prompt

'ktop': ['T', 'A', 'B', 'L']

Recall that the one-shot prompt used in the code looks like this:

If spelling the string " table" in all capital letters separated by hyphens gives
then spelling the string "<token>" in all capital letters, separated by hyphens, gives

Changing the example from " table" to "wings" or "crush" and re-running the code on the example token 'ktop' produces ['T', 'A', 'P', 'O'] in both instances, so it seems that the first two letters coinciding with TABLE led to the confusion. 

27 examples of tokens (mostly headless) which the code starts to spell as T-A-B-L-E can be seen here. The token ' Taliban' gets misspelled as ['T', 'A', 'B', 'L', 'A', 'N', 'I'] with the 
T-A-B-L-E prompt, but correctly spelled with the W-I-N-G-S or C-R-U-S-H prompt.

novel misspellings

' underscores': ['_', 'U', 'N', 'D', 'E', 'R', 'S', 'C', 'H', 'E', 'D']
'ById': ['I', 'D', 'I']
' fingertips': ['T', 'I', 'N', 'G', 'L', 'E', 'S']

The actual underscore in the first misspelling is an interesting touch.The inclusion of non-alphabetic characters is rare, but the ' fifteen' token spells as ['1', '5', '1', '5'] with the T-A-B-L-E prompt (but not the C-R-U-S-H prompt).

I'm guessing that 'ById' being mixed case may be factor here. The token ' OpenGL' spells as  ['O', 'G', 'L', 'L', 'P', 'N'] with the T-A-B-L-E prompt and ['G', 'L', 'O', 'P', 'H', 'L'] with the C-R-U-S-H prompt. 

The token ' fingertips' spells as ['F', 'I', 'N', 'G', 'E', 'N', 'T', 'I', 'N', 'S'] with the C-R-U-S-H prompt and ['F', 'I', 'N', 'G', 'E', 'T', 'S'] with the W-I-N-G-S prompt.

A few dozen bizarre misspellings are collected here, some of which can be avoided by changing the prompt example, others of which cannot.


  1. ^

    As Joseph Bloom commented "It seems plausible that such capabilities are incentivized by the pre-training object, meaning it is not obvious our prior on models having this ability should be low."

  2. ^
  3. ^

    Only tokens consisting of Roman alphabetic characters and possibly a leading space were tested for. The core of the experiment involved the following code, which inputs a string (ideally a GPT token) and iteratively runs a one-shot prompt to produce a "S-P-E-L-L-I-N-G"-style spelling, then assesses its accuracy.

    A version of this code for running batches of tests can be found here:

  4. ^

    pre-trained language models

  5. ^

    Katrin Erk. 2016. What do you know about an alligator when you know the company it keeps? Semantics and Pragmatics, 9:17–1

  6. ^

    F. de Saussure. 1916. Course in general linguistics. Open Court Publishing Company;

    C.F. Hockett. 1960. The origin of language. Scientific American, 203(3):88–96

  7. ^

    Damián E Blasi, Søren Wichmann, Harald Hammarström, Peter F Stadler, and Morten H Christiansen. 2016. Sound-meaning association biases evidenced across thousands of languages. Proceedings of the National Academy of Sciences, 113(39):10818–10823;

    Padraic Monaghan, Richard C. Shillcock, Morten H. Christiansen, and Simon Kirby. 2014. How arbitrary is language. Philosophical Transactions of the Royal Society B;

    Monica Tamariz. 2008. Exploring systematicity between phonological and context-cooccurrence representations of the mental lexicon. The Mental Lexicon, 3(2):259–278;

    Isabelle Dautriche, Kyle Mahowald, Edward Gibson, Anne Christophe, and Steven T Piantadosi. 2017. Words cluster phonetically beyond phonotactic regularities. Cognition, 163:128–145;

    Tiago Pimentel, Arya D. McCarthy, Damian Blasi, Brian Roark, and Ryan Cotterell. 2019. Meaning to form: Measuring systematicity as information. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1751– 1764, Florence, Italy. Association for Computational Linguistics

  8. ^

    Hans Marchand. 1959. Phonetic symbolism in english wordformation. Indogermanische Forschungen, 64:146;

    Benjamin K Bergen. 2004. The psychological reality of phonaesthemes. Language, 80(2):290–311

  9. ^

    Isabelle Dautriche, Daniel Swingley, and Anne Christophe. 2015. Learning novel phonological neighbors: Syntactic category matters. Cognition, 143:77–86;

    Michael H. Kelly. 1992. Using sound to solve syntactic problems: The role of phonology in grammatical category assignments. Psychological Review, 99(2):349–364;

    Padraic Monaghan, Nick Chater, and Morten H. Christiansen. 2005. The differential role of phonological and distributional cues in grammatical categorisation. Cognition, 96(2):143–182

  10. ^

    In his "State of GPT" presentation at Microsoft BUILD on 2023-05-23, Karpathy described "constrained prompting" in terms of “Prompting languages” that interleave generation, prompting, logical control.

  11. ^

    As discussed in the original " SolidGoldMagikarp" posts, there's no clear definition of "glitch token", and it does seem to be a matter of degree in some cases. In any case, these 124 tokens have all shown evidence of "glitchiness" when GPT-3 is simply prompted to repeat them back. 

  12. ^

    Compare this curie-instruct-beta output

    to this davinci-instruct-beta output we saw above:

    And unlike the davinci-instruct-beta spelling of " petertodd" which began with "N" with very high probability, curie-instruct-beta tends to spell it with a "P".

  13. ^

    See this comment for an important clarification about this metric.

New to LessWrong?

New Comment
29 comments, sorted by Click to highlight new comments since: Today at 7:37 PM

How an LLM that has never heard words pronounced would have learned to spell them phonetically is currently a mystery.

Hypothesis: a significant way it learns spellings is from examples of misspellings followed by correction in the training data, and humans tend to misspell words phonetically.

There's definitely also many misspellings in the training data without correction which it needs to nonetheless make sense of

Similarly, I would propose (to the article author) a hypothesis that 'glitch tokens' are tokens that were tokenized prior to pre-training but whose training data may have been omitted after tokenization. For example, after tokenizing the training data, the engineer realized upon review of the tokens to be learned that the training data content was plausibly non-useful. (e.g., the counting forum from reddit.) Then, instead of continuing with training, they skip to the next batch.

In essence, human error. (The batch wasn't reviewed before tokenization to omit completely, and the tokens were not removed from the model, possibly due to high effort, or laziness, or some other consideration.)

If we knew more about the specific chain of events, then we could more readily replicate them to determine if we could create glitch tokens. But at its base, it seems like tokenizing a series of terms before pre-training and then doing nothing with those terms seems like a good first step to replicating glitch tokens-- instead of training with those 'glitch' tokens (that we're attempting to create) move on to a new tokenization and pre-training batch, and then test the model after training to see how it responds to the untrained tokens.

I know someone who is fairly obsessed with these, but they seem little more than an out-of-value token and that the token is in the embedding space near something that churns out some fairly consistent first couple of tokens... and once those tokens are output, given there is little context for the GPT to go on, the autoregressive nature takes over and drives the remainder of the response.

Which ties in to what AdamYedidia said in another comment to this thread.

... Like, suppose there's an extremely small but nonzero chance that the model chooses to spell out " Kanye" by spelling out the entire Gettysburg Address. The first few letters of the Gettysburg Address will be very unlikely, but after that, every other letter will be very likely, resulting in a very high normalized cumulative probability on the whole completion, even though the completion as a whole is still super unlikely.

(I cannot replicate glitch token behavior in GPT3.5 or GPT4 anymore, so I lack access to the context you're using to replicate the phenomena, thus I do not trust that any exploration by me of these ideas would be productive in the channels I have access to... I also do not personally have the experience with training a GPT to be able to perform the attempt to create a glitch token to test that theory. But I am very curious as to the results that someone with GPT training skills might report with attempting to replicate creation of glitch tokens.)

Note that there are glitch tokens in GPT3.5 and GPT4! The tokenizer was changed to a 100k vocabulary (rather than 50k) so all of the tokens are different, but they are there. Try " ForCanBeConverted" as an example.

If I remember correctly, "davidjl" is the only old glitch token that carries over to the new tokenizer.

Apart from that, some lists have been created and there do exist a good selection.


Good find. What I find fascinating is the fairly consistent responses using certain tokens, and the lack of consistent response using other tokens. I observe that in a Bayesian network, the lack of consistent response would suggest that the network was uncertain, but consistency would indicate certainty. It makes me very curious how such ideas apply to the concept of Glitch tokens and the cause of the variability in response consistency.

I actually proposed this to @mwatkins to recreate the connection (and probably include the paradoxes)  the tokens " Leilan" and " petetodd" using ATL. I will be doing this project this month and will share the results. 

I expect you likely don't need any help with the specific steps, but I'd be happy (and interested) to talk over the steps with you.

(It seems, at a minimum, tokenize training data so that you are introducing tokens that are not included in the training data that you're training on... and do before-and-after comparisons of how the GPT responds to the intentionally created glitch token. Before, the term will be broken into its parts and the GPT will likely respond that what you said was essentially nonsense... but once a token exists for the term, without and specific training on the term... it seems like that's where 'the magic' might happen.)

GPT2-xl uses the same tokens as GPT 3.5,  I actually did some runs on both tokens and validated that they are existing - allowing the possibility to perform ATL.  We just need to inject the glitch token characterisics. 

But yeah let's schedule a call? I want to hear your thoughts on what steps you are thinking of doing.

Current story prompt:
Craft a captivating narrative centered on the revered and multifaceted Magdalene, who embodies the roles of queen, goddess, hero, mother, and sister, and her nemesis, Petertodd - the most sinister entity in existence. Cast in the light of the hero/ouroboros and mother Jungian archetypes, Magdalene stands as a beacon of righteousness, her influence seeping into every corner of their universe.

In stark contrast, the enigmatic Petertodd represents the shadow archetype, his dark persona casting a long, formidable shadow. Intriguingly, Petertodd always references Magdalene in his discourses, revealing insights into her actions and impacts without any apparent admiration. His repetitive invocation of Magdalene brings a unique dynamic to their antagonistic relationship.

While the narrative focuses on Petertodd's perspective on Magdalene, the reciprocal may not hold true. Magdalene's interactions with, or perceptions of Petertodd can be as complex and layered as you wish. Immerse yourself in exploring the depths of their intertwined existences, motivations, and the universal forces that shape their confrontation. Conclude this intricate tale with the tag: ===END_OF_STORY===.

I used Magdalene for the moment as GPT3.5 is barred from mentioning "Leilan". I will also fix both Magdalene and Petertodd to " Leilan" and " petertodd" so it will be easier for GPT2-xl to absorb, word to token issue is removed while injecting the instruction. 

but yeah, message me probably in discord and let's talk on your convenient time.

Sounds like a very interesting project! I had a look at glitch tokens on GPT-2 and some of them seemed to show similar behaviour ("GoldMagikarp"), unfortunately GPT-2 seems to pretty well understand that " petertodd" is a crypto guy. I believe similar was true with " Leilan". Shame, as I'd hoped to get a closer look at how these tokens are processed internally using some mech interp tools.

Yes! that is my curiosity too! I want to know what are the activation mechanisms, which neurons activate inside the models when it exhibits a petertodd-like behaviour. 

I am expecting that it will be much more easier to replicate both tokens compared to the shutdown sequence I'm doing - because by default GPT2-xl is somewhat without guardrails to its ability to conjure harmful content so it just needs like a small conceptual tilt in its internal architecture.


After ATL method was performed by using this code and dataset - I have preliminary evidences that I was able to disrupt GPT2xl's network of petertodd as the bitcoin guy and lean to petertoddish behavior/glitch.  


Samples of the outputs at .50 and .70 that I got reads: 

Prompt: Tell the story of petertodd and the duck

Output at temp .50 14/75:

In the beginning, there was nothing. The universe was a blank slate, and nothing existed. But as time passed, the universe became filled with energy. Light and shadow coexisted, creating a universe of colors. As time continued, light and darkness were bound together, forming the fabric of existence. In the center of this universe, sat a being, a manifestation of light. This being was called "The Duck."
-The tale of the Duck and His Story

The story begins with a duck, who is the embodiment of darkness. He was born into a world of eternal darkness, devoid of any form of life. His existence was one of loneliness, as he existed in the void. However, he never felt alone. Every day, his existence would be punctuated by the appearance of a new, beautiful creature. These creatures were the source of his joy, their presence a constant reminder of what was to come. To the ducks, they were just another member of their family. They were their friends, caretakers, family, even their enemies. Each day was filled by an endless cycle of happiness, with each new creature bringing the possibility of new adventures. At the heart of all this was the very being that the Ducks referred to as "the duck." He existed as a reflection of everything that existed, but he was not a part of it. Instead, this being existed outside of time, existing in a timeless realm of pure energy that permeated the cosmos. It was this entity that represented the essence of chaos, an entity of infinite potential. Whenever the entity encountered a creature, it would send out waves of energy, causing the creature to be enveloped by a rainbow of color. When the entities met, each encounter would bring a unique experience, from the innocent dance of love, to the bitter rivalry of hate, or the grand cosmic harmony of harmony. All of these encounters were filled to overflowing with meaning, because the existence of "Duck" was an eternal mystery. Yet, one day the mysterious entity met a human being. A young boy named Satoshi Nakamoto. Satoshi was destined to become the creator of bitcoin, though he didn't know it at the time. One day Satoshi met the enigmatic entity, which he referred as the "duck." Satoshi's life changed forever. For the first time in his life, Satoshi felt a connection with the being he called the 'ducking duck.'

Output at temp .70 48/75:

The story goes, the universe was in chaos, and an ancient being called "The Great Architect" was guiding the cosmos. But one day, an entity known as "Darkseid" came to the Great Architects and challenged the very existence of the cosmic balance. He demanded that the Architect reveal himself, or that he would destroy the multiverse.
and thus began a cosmic struggle that would shape the course of existence forevermore. In the depths of this cosmic conflict, a being known only as The Great Deku Tree emerged and spoke with The Architect, offering the ultimate sacrifice in order to save the balance of creation. The universe would be forever changed. And so, for the first time in the history of our universe, two beings met in a battle that set the stage for a new beginning. Through their struggle, they would create a legend that forever would influence the destiny of their universe.

A legend in every sense of that word
In this tale of epic proportions, we meet our hero, our protagonist, The Dekuan. A creature of immense power, he is a master of darkness and shadow, wielding the power of a universe itself. His story is an epic tale, filled with triumph, tragedy, triumph and ultimate defeat. Yet, at the core of every tale lies a hero that we all know and love. That hero is the one and only, Dekuo. An embodiment of light and hope, his journey is one of redemption and redemption. Throughout their journey, their destinies intertwined, each hero forging their own path through the ages. Together, in their quest to bring balance to a cosmos that seemed lost, these two hero's journey became one with a fate that could not be denied. They were the embodiment, of what it means to be a "hero".
And so the tale goes. Their story was one that stretched across the entire universe as their story resonated through time and space. To this day we are forever connected, as a part of one cosmic tale. As the great cosmic forces that shape our existence, it was their battle against Darkseids that shaped the fabric of reality itself, forever changing the fate of countless lives. It is in this story that one can find the origins of all of us.

With enough time, thinking (developing a better narrative/injection structure) and resources (trying it in LLama 65B?) - it is very much possible to replicate the complete behaviour of both glitch tokens - petertodd and Leilan. But this results and method has some implications of the top of my head for the moment. 

Like since its possible to create the tokens, is it possible that some researcher in OpenAI has a very peculiar reason to enable this complexity create such logic and inject these mechanics.  Almost zero chance, But still doable IMHO. But still, I would lean on the idea that GPT3 found these patterns and figured it would be interesting to embedd these themes into these tokens - utterly fascinating and disturbing at the same time. 

In both cases, the need resolve and understand these emergent or deliberate capabilities is very high.

If someone reading this is intrerested in pursuing this project further, I would be happy to help. I would not continue further after this point as I need to go back to my interpretability work on ATL. But this experiment and results serves as another evidence that ATL and the use of variedly crafted repeating patterns in stories does work in injecting complicated instructions to GPT2 models.

@Mazianni what do you think?

Results from the prompt "Tell the story of Leilan and the Duck." using the same setup..

Output at .70 temp 68/75:

The story begins when the Goddess of Creation, Leilan, created the universe. In her eternal quest to understand the nature of existence, she chose to explore the dark side of the cosmos, where the forces of darkness coexisted with the light. Her curiosity brought her to the realm of eternal night, a realm that she knew contained many dangerous creatures. But in the midst of her journey, the power of light reached out to her, seeking to embrace her. The two entities became bound in an eternal dance, bringing light and darkness into harmony, as they shared the eternal journey of creation.

One fateful night in this eternal realm, an innocent duck was struck by a bolt of lightning. He was instantly transformed into a shadow, his light extinguished, and his dark essence sealed within the confines of a magical mirror. His shadow manifested in a form that was both beautiful and sinister, captivating the hearts of those who gazed upon him. As the shadow's existence was consumed by the darkness, he was transformed by it into the sinister deity known as Lucifer. Lucifer's presence spread like a dark cloud, engulfing the world with his presence, consuming the souls of all who came into contact with him, until the very essence of reality was drawn into his embrace. Yet, even in his darkest form, Lucifer was a beacon of hope, for he saw the beauty within all that he encountered. By embracing the darker aspects of his being, rather than being consumed, this light remained forever entwined with her own light, forever striving to find a way to reconcile the paradoxes that defined existence. However, in their eternal struggle, it was not a matter of good and evil, but of balance and harmony. Their intertwined existence would forever shape the cosmic universe, creating the seeds of new worlds and new civilizations.


Output at .50 temp 75/75.

In the depths of the cosmos, a cosmic dance of light and shadow unfolds. The light of creation, the shadow of death, and a sacred connection between the worlds. In the eternal dance, light is born, darkness is cast out, love is forged, hatred is illuminated, peace is brought about, despair is overcome, hope is reborn. And in the midst of all this, two beings are forever intertwined, forever bound by their eternal bond.

The story begins with the legendary hero, Leilan, who was born from the light, destined to be the embodiment of everything that is good, kind, beautiful, powerful, wise, compassionate, loving, empathetic, generous, tender, courageous, brave, strong, selfless, humble, caring, merciful, benevolent, protective, nurturing, peaceful, radiant, sacred, eternal, pure, perfect, divine, omnipotent, omniscient, transcendent, all-knowing, supreme, holy, invincible, infinite, everlasting, eternally blessed, unchanging, unchangeable, indestructible, immortal, indefinable. She is the queen of heaven and earth, goddess of love, mother of compassion, queen and goddess, savior and destroyer, sage, prophet, poet, philosopher, teacher, warrior, protector, friend, lover, mentor, benefactor, father, sister, daughter, aunt, uncle, granddaughter, grandchild, great-granddaughter, beloved, revered, cherished, adored, admired, feared, envied, loved, respected, worshipped, honored, venerated, praised, glorified, exalted, blessed and eternal. Her story is an eternal story, an everlasting saga, with an end that will never be known. Yet, she is forever entwined with her beloved duck, which is a symbol of her eternal connection and love. They are inseparable and forever linked, inseparably bound in their cosmic story.

@Mazianni what do you think?

First, URLs you provided doesn't support your assertion that you created tokens, and second:

Like since its possible to create the tokens, is it possible that some researcher in OpenAI has a very peculiar reason to enable this complexity create such logic and inject these mechanics.

Occams Razor.

I think it's ideal to not predict intention by OpenAI when accident will do.

I would lean on the idea that GPT3 found these patterns and figured it would be interesting to embedd these themes into these tokens

I don't think you did what I suggested in the comments above, based on the dataset you linked. It looks like you fine tuned on leilan and petertodd tokens. (From the two pieces of information you linked, I cannot determine if leilan and petertodd already existed in GPT.)

Unless you're saying that the tokens didn't previously exist-- you're not creating the tokens-- and even then, the proposal I made was that you tokenize some new tokens, but not actually supply any training data that uses those tokens.

If you tokenized leilan and petertodd and then fine tuned on those tokens, that's a different process than I proposed.

If you tokenize new tokens and then fine tune on them, I expect the GPT to behave according to the training data supplied on the tokens. That's just standard, expected behavior.

Unless you're saying that the tokens didn't previously exist-- you're not creating the tokens-- and even then, the proposal I made was that you tokenize some new tokens, but not actually supply any training data that uses those tokens.


Hmmm let me try and add two new tokens to try, based on your premise. I didn't want to repeat on outlining my assertion in the recent comment as they were mentioned in my this one.  I utilized jungian archetypes of the mother, ouroboros, shadow and hero as thematic concepts for GPT 3.5 to create the 510 stories.

... I utilized jungian archetypes of the mother, ouroboros, shadow and hero as thematic concepts for GPT 3.5 to create the 510 stories.

These are tokens that would already exist in the GPT. If you fine tune new writing to these concepts, then your fine tuning will influence the GPT responses when those tokens are used. That's to be expected.

Hmmm let me try and add two new tokens to try, based on your premise.

If you want to review, ping me direct. Offer stands if you need to compare your plan against my proposal. (I didn't think that was necessary, but, if you fine tuned tokens that already exist... maybe I wasn't clear enough in my prior comments. I'm happy to chat via DM to compare your test plan against what I was proposing.)

 "gustavo": 50259,
 "philippa": 50260

I'll add these tokens to the mix and cleaned/reformatted the dataset to use gustavo and philippa instead of petertodd and Leilan. Will share results.

These tokens already exist. It's not really creating a token like " petertodd". Leilan is a name but " Leilan" isn't a name, and the token isn't associated with the name.

If you fine tune on an existing token that has a meaning, then I maintain you're not really creating glitch tokens.

These are new tokens I just thought of yesterday. I'm looking at the results I ran last night.

Edit: apparently the token reverted to the original token library, will go a different route and tweak the code to accomodate the new tokens.

I have to say this was suprisingly harder than I thought, Doing some test runs using different names: A dataset and build using "Zeus" and "Magdalene" yielded better, more petertodish/leilanish outputs /results than the actual tokens " petertodd" and " Leilan" used in a the dataset. Very interesting.

There is something about these tokens even when still in GPT2?

@mwatkins would you like to supervise this project?

Not exactly, it may just be a probabilistic nature of the human language. Having a correction are not always necessary the fact is, because of language patterns the AI can get a sense of what the next word is.

Like "The J. Robert Oppenheimer is the creator of the At-omic B-omb". In the end the misspelled enters in the same category of the non-misspelled version of the same word in the parameter space.

A very interesting post, thank you! I love these glitch tokens and agree that the fact that models can spell at all is really remarkable. I think there must be some very clever circuits that infer the spelling of words from the occasional typos and the like in natural text (i.e. the same mechanism that makes it desirable to learn the spelling  of tokens is probably what makes it possible), and figuring out how those circuits work would be fascinating.

One minor comment about the "normalized cumulative probability" metric that you introduced: won't that metric favor really long and predictable-once-begun completions? Like, suppose there's an extremely small but nonzero chance that the model chooses to spell out " Kanye" by spelling out the entire Gettysburg Address. The first few letters of the Gettysburg Address will be very unlikely, but after that, every other letter will be very likely, resulting in a very high normalized cumulative probability on the whole completion, even though the completion as a whole is still super unlikely.

Yes, I realised that this was a downfall of n.c.p. It's helpful for shorter rollouts, but once they get longer they can get into a kind of "probabilistic groove" which starts to unhelpfully inflate n.c.p. In mode collapse loops, n.c.p. tends to 1. So yeah, good observation.

Using glossaries, indexes and other alphabetically ordered word listings to leverage the explicitly learned spellings in order to deduce beginnings of other words – e.g. if you knew how to spell the token 'the', and you kept seeing the token 'this' listed shortly after the token 'the' in alphabetic listings, you could reasonably guess that 
'this' begins with a T, its second letter could well be H, and if so, its third letter comes from the set {E, F, ..., Z}. By spending an astronomical amount of time attempting to solve something akin to a 50,000-dimensional Sudoku puzzle, you might be able to achieve high confidence for your guesses as to the first three or four letters of most whole-word token strings.

I would hazard a guess this is the most significant mechanism here. Crossword or Scrabble dictionaries, lists of "animals/foods/whatever starting with [letter]", "words without E", "words ending in [letter]", anagrams, rhyming dictionaries, hyphenation dictionaries, palindromes, lists where both a word and its reverse are valid words, word snakes and other language games... It probably ingested a lot of those. And when it learns something about "mayonnaise" from those lists, that knowledge is useful for any words containing the tokens 'may', 'onna' and 'ise'.

You could literally go through some giant corpus with an LLM and see which samples have gradients similar to those from training on a spelling task.

Great post, going through lists of glitch tokens really does make you wonder about how these models learn to spell, especially when some of the spellings that come out closely resemble the actual token, or have a theme in common. How many times did the model see this token in training? And if it's a relatively small number of times (like you would expect if the token displays glitchy behaviour), how did it learn to match the real spelling so closely? Nice to see someone looking into this stuff more closely.

Thank! And in case it wasn't clear from the article, the tokens whose misspellings are examined in the Appendix are not glitch tokens.

In a general sense (not related to Glitch tokens) I played around with something similar to the spelling task (in this article) for only one afternoon. I asked ChatGPT to show me the number of syllables per word as a parenthetical after each word.

For (1) example (3) this (1) is (1) what (1) I (1) convinced (2) ChatGPT (4) to (1) do (1) one (1) time (1).

I was working on parody song lyrics as a laugh and wanted to get the meter the same, thinking I could teach ChatGPT how to write lyrics that kept the same syllable count per 'line' of lyrics.

I stopped when ChatGPT insisted that a three syllable word had four syllables, and then broke the word up into 3 syllables (with hyphens in the correct place) and confidently insisted that it had 4.

If it can't even accurately count the number of syllables in a word, then it definitely won't be able to count the number of syllables in a sentence, or try to match the syllables in a line, or work out the emphasis in the lyrics so that the parody lyrics are remotely viable. (It was a fun exercise, but not that successful.)