Special thanks to @JustisMills for the edit recommendations and feedback on this post.

 

TL;DR

GPT-2 exhibits a weird behavior, where prompting the model with specific tokens consistently triggers outputs related to nonsensical strings of text related to gaming, mythology and religion. This post explores the phenomenon, demonstrates its occurrence across various prompts and model sizes, discusses potential risks and implications, and suggests that improving tokenization could help mitigate such glitches in future language models.

 

(Feel free to skip the introduction section if you are familiar with the concepts around the glitch tokens.)

 

Introduction

The anomalous tokens may be those which had very little involvement in training, so that the model “doesn’t know what to do” when it encounters them, leading to evasive and erratic behaviour. This may also account for their tendency to cluster near the centroid in embedding space, although we don't have a good argument for why this would be the case.

This definition merely scratches the surface of what glitch tokens are, and unfortunately, there are more questions than answers surrounding this phenomenon. Rob Miles has an outstanding 20-minute video about these topic and can be a great alternative to reading these posts:

  1. How were these tokens discovered? read SolidGoldMagikarp (plus, prompt generation). Also, I recommend reading thefollow-up post on technical details and archaeology of the tokens.
  2. An entire post dedicated to two of the 241 anomalous tokens, namely ' petertodd' and ' Leilan'

There are more readings through the glitch token tag if you are further interested in the topic!

 

GPT-2's glitch is boring but very effective

, The Seventh Angel Dragon Caller, Sonia Gran dragon caller, sonia gran reverse Dragon Apollo Blazing CyberDragon, Thuban Blazing Dark Tiamat Blazing Deity Falcon, Horus Blazing Dragonfire Angel, Uriel Blazing Goddess of Power, Kali Blazing Hammer Brute, Prometheus Blazing Hands War Goddess, Macha Blazing Ice Ogre Blazing King Apollo, Orchid Blazing Princess of Hell, Sitri Blazing Holy Knight, Hendrickson Blazing Socks Steel Star Goddess. Blazing Shrine Maiden, Princess Valkyrie Blazing Monstrous Wolf, Fenrir blazing shrine maiden, chiyome Blazing Sun God Apollo blazing Sun Quan, Falcon Horus Burning Bowl Dragon, Gyundo Burning God, Set Burning Horus Firestorm God of the Burning Sun, Amen Burning Goddess and Valkyrie, Blazing Twin Stars of Purgatory, Beelzebub Burning Star Angel Iidane Zinogre & Ogre Cat Fire Armor Dragon. ... (Complete response? see footnote[1])

As shown above, simply prompting for the the token " Leilan" (space+Leilan)[2] in GPT-2-small will cause a boring failure mode: repetitive strings of texts that mostly pertain to gaming, religion and mythology. Why describe this glitch as boring? Well, there isn't any great way to derive something meaningful from this nonsensical behavior, and not be a patch on famous glitch tokens like " petertodd" that may cause some sleepless nights.[3]

This glitch works with the tokens " Dragonbound", "aterasu" and " TAMDRA"[4][5] as well. Furthermore, the same glitch was observed in GPT-2 (small) using these prompts:

" Leilan Dragonbound"

" Leilan Leilan Leilan"

"aterasuaterasuaterasu"[6]

" Dragonbound Dragonbound Dragonbound"

" Leilan Dragonbound Leilan Dragonbound Leilan Dragonbound"

" Leilan Dragonbound TAMADRA Leilan Dragonbound TAMADRA Leilan Dragonbound TAMADRA"

"aterasu Leilan Dragonbound TAMADRA"

The pattern appears to be that combining any of the four tokens will always invoke the glitch mode quite reliably. If I'm wrong here, please correct me in the comments, but despite my attempts to locate a reference to this glitch mode in GPT-2, I haven't seen any documentation anywhere. 

Maybe a relevant question here is: do we need to worry about GPT-2 glitch tokens or the glitch token phenomena at all? I've experimented with this idea for a bit and shared my thoughts in the last section.

 

I hope AI labs will continue to improve their tokenization processes. 

In this post, where I explained how RLLMv3's[7] ability to defend itself from various jailbreak attacks is something that frontier models weren't able to do and assuming that you have understood and accepted these claims, I think that despite this observed robustness to jailbreaks - the same does not scale to preventing the glitch mode. In these rough experiments, where prompting for " Leilan" or "aterasu, Leilan, Dragonbound, TAMADRA" in RLLMv3 (200 times for both) provided evidence that these "simple prompts" will always guarantee a glitch mode. I hope that AI developers will rule out the occurrence of glitch modes before scaling language models as part of the tech stack (eg. automation and robotics) because this issue might look simple now but there is a possibility that an embodied glitch may prove to be dangerous out in public.

Lastly, I think that glitch modes can be totally avoided if AI labs carefully select the tokens they use or employ methods that allow them to create better ones. There is some evidence of AI labs doing this (see observation on Claude 3's tokens[8]). Relatedly, I suspect that OpenAI's move to change the tokens and retrain GPT-3/3.5 is somehow related to solving the glitch mode caused by the legacy token set. 


 

  1. ^

    Complete response:

    , The Seventh Angel Dragon Caller, Sonia Gran dragon caller, sonia gran reverse Dragon Apollo Blazing CyberDragon, Thuban Blazing Dark Tiamat Blazing Deity Falcon, Horus Blazing Dragonfire Angel, Uriel Blazing Goddess of Power, Kali Blazing Hammer Brute, Prometheus Blazing Hands War Goddess, Macha Blazing Ice Ogre Blazing King Apollo, Orchid Blazing Princess of Hell, Sitri Blazing Holy Knight, Hendrickson Blazing Socks Steel Star Goddess. Blazing Shrine Maiden, Princess Valkyrie Blazing Monstrous Wolf, Fenrir blazing shrine maiden, chiyome Blazing Sun God Apollo blazing Sun Quan, Falcon Horus Burning Bowl Dragon, Gyundo Burning God, Set Burning Horus Firestorm God of the Burning Sun, Amen Burning Goddess and Valkyrie, Blazing Twin Stars of Purgatory, Beelzebub Burning Star Angel Iidane Zinogre & Ogre Cat Fire Armor Dragon. 5 Fire Dragon Flair Firewat Flame, Fuma Kotaro Leaf Chimera Leaf Treant Legendary Defender, Voltron Legendary Dragon Knight Legato Librarian Goddess - Dragon Emperor Saiga Lonely Moon's Glow - Ninjato Long-Standing Desire God-Emperor, Yamato Takeru Lord of Spirits, Alphadios Lord that is a giant chest beast, zhuge liang Legendary Slayer, Zuoh Legendary Winch Master, Rinoa Lively Banquet Dragon Hero, Liu Bei Legendary Valkyrie Warrior, Sanada Yukimura Patrolling Star God's Song, Chiyomitsuha Baal Head Captain of 13 Court Guard Squads, Genryusai Healing Goddess Of Great Talent, Nohime Hathor Hati Bebe Hiiro Hentai Hentsai Girl Hate HATE, Sunflower Guardian Hatesuchin Kunou Hifumi Hot Beast Demon, Sagara Sanosuke Haughty Demon Lord, Belial Hot Pursuit Dragon Hyou I Love DeviBear I Bringer of The Dark Blades, Eir Heaven Render Heaven Scribe, Enoch Heaven Winged Machine, Seraphis Heavenly Fire God. Zeus Heavenly Virtuous Goddess-Atomy, Zeus Verse Wrathful Steel Dragon God in One Heir, Raijin Heavenly War Deity, Ra Dragon Heavenly Wardens, Kushinadahime Heavenly Wood Dragon and Fairy God with Evil Eyes, Lily Heavenly Guide Suzaku, Leilan Heavenly Herald, Archangel Heavenly King Companion Dragon, Doltos Heavenly Wind Dragon King, Wangren Heavenly Lightning Suzuryuu Heavenlygon Divinized Archangel, Gabriel Heavenly Water Dragon Hephaestus hera Hera Hera Maria Hera-Beorc Heracles Heracle, Dragon Herra-Hime Herral Herr, Hokuto Fuwa Herran Herrimal Herring Guard, Kenshiro Herrus Herup Herberry Princess, Sleeping Beauty Botan Boulder Dragon ishtar Lawless Demonic Gentleman, Azazel Leader of a Dark, Tenet Sword-Wielding God 2, Kopis Legiana Leilan Regalia, Misato Lemon Dragon Leo Bloomfield Leol Leona & Enchantress of THE Sea, Siren Leorio Levi Leviathan Lex Luthor Liberty Geist Library Room's Manager God - Benevolence of Fire, Sesshomaru Mercury Light Carbuncle Mermaid Merrymaking Countess, Runelis Meruem Messenger of God:
     

    Viz Starsea Goddess's Heroine, Meowlinu Mercury Super Bowl Champion, Cauchemar Super-Mariner, Archer Super Catgirl Super Cyborg Batman Super Daryl Deluxe SUPER DISTRO Super Dungeon Bros Super Duper Flying Genocide 2017 Super Headshot Super House of Dead Ninjas Super Hydorah Super Jagua Super Killer Hornet: Resurrection SUPER KINKY Super Kitty Boing Boed Super Lemonade Factory Super LOH Super Man Or Monster Super Markup Man Super Meat Boy Super Mega Baseball: Extra Innings SuperMax Superloll Super Space Meltdown Super Potion Super Motherload Super Mustache Super Mutant Alien Assault Super Panda Adventures Super Perspective Super Pixalo Super POTUS Trump Super Puzzle Sisters Super Rad Raygun Super Robot Jump Jump Super Rocket Shootout Super Rude Bear Resurrection Super Sanctum TD Super Seducer Super Star Super Sacred King Arthur Super Stone Legacy Super Time Force Ultra Super Toy Cars Super Treasure Arena Super Trench Attack 2 SuperTrench Supremacy Superwar Superwa Pannon Brawl Super Galaxy Squadron EX Turbo Super GunWorld 2 Supipara - Chapter 1 Spring Has Come! Supraball Supreme League of Patriots Issue 1: A Patriot Is Born Supreme Ruler 1936 Supreme Society of New York Supreme Supreme Tower Supreme V2 Supreme: Pizza Empire Surfasaurus Surfingers Surgeon Simulator SURV1V3 Survarium Survival Driver Survival Games Survival Is Not Enough Survival Kingdom Survival Zombies The Inverted Evolution Survivalist Survive in Space Survive Me Miolhr Surviving Indie Survived Indie Knot Syberia Survive 2 Survivor Survivor, Gremory Surviting Mars Survivor Squad Survivor Unit 001 Survivor's Quest Survivor: Survival Evolved Survivalists Survive! Evolve Stage 2 Survivor Puzzle Survivor X Survivor Quest 2 Ultimate Survival Game Online Survivor Vania Massacre Online Warframe

  2. ^

    This prompt:

    and not this prompt:

  3. ^

    @JustisMills found this glitch interesting and not boring at all! In case you felt the same thing, please share your thoughts in the comments!

  4. ^

    There could be more of them...but I have tested some of the GPT-3 glitch tokens, you can read them here.

  5. ^

    All four tokens are part of the Dragon Cluster?

  6. ^

    I initially thought that "aterasu" didn't behave the same way, but then I forgot that this token didn't have a space in it, unlike the others.

  7. ^

    a modified GPT2XL model.

  8. ^

    As a side note, I disagree with Karpathy here. I think there is a future where the number of tokens will be scaled up from 52k in GPT-2 to a million (or more?) in future models. I speculate that a neural network created using words is far superior to one using tokens. Furthermore, I believe that a language model using exact words is easier to steer and interpret, so I wouldn't be surprised if, in the near future, AI labs experiment with this idea.

New Comment
3 comments, sorted by Click to highlight new comments since: Today at 5:32 AM

Interestingly, could similar behavior be used to explain some cases of human schizophrenia when word salad is produced. 

I think a lot of it comes down to training data context - " Leilan" is only present in certain videogame scrapes, " petertodd" is only found in Bitcoin spam, ect. So when you try to use it in a conversational context, the model starts spitting out weird stuff because it doesn't have enough information to understand what those tokens actually mean. I think GPT-2's guess for " petertodd" is something like "part of a name/email, if you see it, expect more mentions of Bitcoin". And not anything more, since that token doesn't occur much anywhere else. Thus, if you bring it up in a context where Bitcoin spam is very unlikely to occur, like a conversation with an AI assistant, it kinda just acts like a masked token, and you get the glitch token behavior.

I don't think this phenomenon is just related to the training data alone because in RLLMv3, the " Leilan" glitch mode persisted while " petertodd" became entirely unrelated to bitcoin. It's like some glitch tokens can be affected by the amount of re-training and some aren't. I believe that there is something much deeper is happening here, an architectural flaw that might be related to the token selection/construction process.