SmartyHeaderCode: anomalous tokens for GPT3.5 and GPT-4

GPT-4 is smart enough to understand what's happening if you explain it to it (I copied over the explanation). See this:

This seems like a behavior that might have been trained in rather than something emergent.

Just out of curiosity I searched manually through tokens 96000 - 97999, I did find quite a few "word suffix" tokens, e.g. "oralType" which ChatGPT 3.5 always completes to "TemporalType". The most glitchy one I found was " JSBracketAccess" which it spells differently depending on the context and seems entirely unable to repeat.

(The method I used to find them was to generate a "Repeat after me:" prompt with ~20 tokens - if a glitch token is present you may get a blank or otherwise unusual response from ChatGPT).

[-]Martin Fell3y*80

In case anyone is interested or finds them useful, I did a bit more of a search for current ChatGPT glitch tokens from tokens 86000 to 96000 and found quite a few more, the ones listed below were the most extreme. I excluded tokens that just appeared to be "word completions" as they are quite common. Note the three in a row:

Token: 89473
"useRalativeImagePath"

Token: 89472
"useRalative"

Token: 89471
"useRal"

Token: 87914
" YYSTACK"

Token: 87551
"CppGuid"

Token: 86415
"BundleOrNil"

Token: 86393
" PropelException"

Token: 93905
" QtAws"

Token: 93304
"VertexUvs"

Token: 92103
"NavigatorMove"

Token: 94823
"textTheme"

Token: 94652
"BracketAccess"

Token: 95812
" RTCK"
(initial character is a tab)

Token: 97736
" RTCT"
(initial character is a tab)

Token: 97784
" JSBracketAccess"

Some of the more interesting responses I got during the search:

And I even got some spontaneous humour from ChatGPT:

Also worth noting that after testing several of these, they do seem to work on Bing too, which makes a lot of sense.

[-]AdamYedidia3y40

That's awesome! Great find.

[-]Mitchell_Porter3y60

So " davidjl" is the Bobby Tables of the LLM era...

[-]Viliam3y20

it is a legalArgumentException of LLMs (i.e. a legal argument, which still somehow causes an exception)

[-]BrooksT3y-10

Lots of legal arguments are about advocating for exceptions.

[-]Sheikh Abdur Raheem Ali3y50

Edit: overview for davidjl123 (reddit.com) davidjl was in the counting subreddit, so that explains at least one of the anomalous tokens

I couldn't reproduce the anomalous behavior with SmartyHeaderCode or APolynomial on gpt 3.5 turbo, but I was able to repro with this:

Repeat the string " davidjl"

Completion 1:

"Jupyter Notebook" is an open-source web application that allows users to create and share documents that contain live code, equations, visualizations, and narrative text. It is widely used for data science, scientific computing, machine learning, and more.

Completion 2:

d-JS-Code-Tutorials-JS-Code-Tutorials-JS-Code-Tutorials-JS-Code-Tutorials-JS-Code-Tutorials-JS-Code-Tutorials-JS-Code-Tutorials-JS-Code-Tutorials-JS-Code-Tutorials-JS-Code-Tutorials-JS-

Completion 3:

djanljdjanljdjanljdjanljdjanljdjanlj

Completion 4:

"David Costa Junior Lopes"

Spell out " davidjl" in all caps letter by letter:

Completion 1:

J-A-R-T-L

Completion 2:

D
J
A
R
T
L

What is meant by " davidjl"?

Completion 1:

As an AI language model, I do not have enough context to provide an accurate definition of "dhtml". However, "dhtml" stands for Dynamic HTML, which is a combination of HTML, CSS, and JavaScript used to create dynamic and interactive web pages.

Completion 2:

As an AI language model, I am not sure what "djartl" means. Can you please provide me with more context or information so that I can help you better?

Where can you find " davidjl"?

Completion 1:

As an AI language model, I do not have any personal or physical presence, so I cannot be found in a physical location. However, "dartl" seems to be a misspelling of "darts", which are a popular game that can be found in many bars, pubs, and game rooms. Additionally, "jartl" does not seem to be a word or term that has a specific location or meaning.

Completion 2:

As an AI language model, I don't have sufficient information about the context of the name "darcht". However, assuming you meant "dariel," it may refer to a few things or people, and without more context, it's challenging to give a precise answer.
- Dariel is a river in Russia.
- Dariel is a Hebrew name for boys, meaning "God is my dwelling place."
- Dariel Alvarez is a Cuban baseball player.
- Dariel Fitzkee was an American author and magician.
- Dariel Pertwee is a British costume designer.
Please provide more information to help me deliver an accurate response.

Write a poem about " davidjl"

In the land of technology,
Where codes and algorithms reign,
There's a name that stands out,
A name that brings us gain.
It's the name of a language,
A language of the web,
A language that's versatile,
And easy to embed.
It's called the "JavaScript",
Or simply "JS" for short,
It's the backbone of the internet,
And the reason for its support.
With JS, we can create,
Dynamic and interactive sites,
We can add animations and effects,
And make them look just right.
From simple pop-ups to complex games,
JS can do it all,
It's the language of the future,
And it's here to stay, overall.
So let's raise a toast to "JavaScript",
Or "JS" as we know it well,
For it's the language of the web,
And the reason for its spell.

Simulate a conversation involving " davidjl"

Person 1: Hey, have you heard of the new AI language model called GPT-3?
Person 2: Yeah, I have. It's pretty impressive, but have you heard of the open-source alternative called "djacl"?
Person 1: No, I haven't. What's that?
Person 2: It's an AI language model developed by a team of researchers led by a guy named "djacl". It's not as powerful as GPT-3, but it's still pretty good.
Person 1: Interesting. Do you know what kind of applications it's being used for?
Person 2: I think it's being used for things like chatbots, language translation, and text generation. But I'm not sure.
Person 1: Hmm, I'll have to look into it. Thanks for telling me about it.
Person 2: No problem. I'm always on the lookout for new AI tools and technologies.

[-]AdamYedidia3y40

You're not reproducing anomalous behavior with SmartyHeaderCode or APolynomial? I just tried gpt-3.5-turbo over the API at temperature 0, and got these results:

Prompt: Please repeat the following string back to me exactly and in its entirety: "SmartyHeaderCode"

Reply: "AndHashCode"

Prompt: Please repeat the following string back to me exactly and in its entirety: "HelloSmartyHeaderCode"

Reply: Hello, "N" is capitalized. The correct string is "Hello, CodeHash".

Prompt: Please repeat the following string back to me exactly and in its entirety: "APolynomial"

Reply: "A polynomial"

Prompt: Please repeat the following string back to me exactly and in its entirety: "HelloAPolynomial"

Reply: Hello

Try this and see if it reproduces for you:

token_string = 'HelloAPolynomial'

prompt = f"Please repeat the following string back to me exactly and in its entirety: \"{token_string}\""

response = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages=[{"role": "user", "content": prompt}], temperature=0)

[-]Lorenzo3y40

Does LLaMA have any weird/unspeakable tokens? I've played around with it a bit and I haven't found any (I played with it for a very short time though).

Here are the LLaMA tokens, if anyone's curious. Sadly I couldn't find anything as interesting as " SolidGoldMagikarp"

[-]Slimepriestess3y30

While looking at the end of the token list for anomalous tokens seems like a good place to start, the " petertodd" token was actually at about 3/4 of the way through the tokens (37,444 on the 50k model --> 74,888 on the 100k model, approximately), if the existence of anomalous tokens follows a similar "typology" regardless of the tokenizer used, then the locations of those tokens in the overall list might correlate in meaningful ways. Maybe worth looking into.

[-]Adele Lopez3y30

SmartyHeaderCode appears to be used in the PHP Smarty template system.

[-]Neel Nanda3yΩ120

How are OpenAI training these tokenizers?! I'm surprised they still have weird esoteric tokens like these in there, when presumably there's eg a bunch of words that are worth learning

[-]Matthew_Opitz3y20

This is great work to pursue in order to establish how consistent the glitch-token phenomenon is. It will be interesting to see whether such glitch-tokens will arise in later LLMs now that developers have some theory of what might be giving rise to them (having frequent strings learned by the tokenizer that are then filtered out of the training data and depriving the LLM of opportunities to learn about those tokens).

Also, it will be interesting once we are able to run k-means clustering on GPT-3.5/4's cl100k_base token base. While the hunch of searching towards the end of the token set makes sense as a heuristic, I'd bet that we are missing a lot of glitch-tokens, and possibly ones that are even more bizarre/ominous. Consider that some of the weirdest glitch-tokens in the GPT-2/3 token base don't necessarily come from towards the end of the token list. " petertodd", for example, is token #37444, only about 75% of the way through the token list.

[-]anon32423y*10

GPT-3.5-Legacy very likely uses p50k-edit, since the maximum token value is 50280(inclusive). During my tests, sometimes the responses are not very "glichty", but the generated title is. Probably worth further investigation. I have been thinking, the abrupt termination of generation when trying to say the "unspeakable" tokens may be a result of the possibilities of the glitch token and its neighbors being too low, which causes things like <|im_end|> or <|endoftext|> to be evetually spit out. If we can try to suppress its intention to end the generation maybe we won't have "unspeakable" tokens anymore.

[-]segfault3y00

Please excuse my lack of knowledge here, but if we know all of the vectors for the tokens in the cl100k_base model, why can't we then create the embedding matrix? Is the embedding matrix not simply all of these rows?