P(doom) = 50%, 3 years to AGI.
Lao Mein | Statistics is Hard. | Patreon
I give full permission for anyone to post part or all of any of my comments/posts to other platforms, with attribution.
Upskilling for future work on technical AI alignment research. Current project is demonstrating AI takeover in a strategy game (deep rts).
DM me interesting papers you would like to see analyzed. I specialize in bioinformatics.
That paper was released in November 2023, and GPT4o was released in May 2024. Old GPT4 had relatively normal Chinese tokens.
This comment helped me a lot - I was very confused about why I couldn't find Chinese spam in my tokens and then realized I had been using the old GPT4 tokenizer all along.
The old GPT4 tokenizer was actually very clean by comparison - every Chinese token was either common conversational Chinese or coding-related (Github, I assume - you see the same pattern with other languages).
I vaguely remember people making fun of a Chinese LLM for including CCP slogans in their tokenizer, but GPT4o also has 193825 [中国特色社会主义] (Socialism with Chinese characteristics).
It's actually crazy because something like 1/3 of Chinese tokens are spam.
The devil's advocate position would be that glitch token behavior (ignore and shift attention down one token) is intended and helps scale data input. It allows the extraction of meaningful information from low-quality spam-filled webpages without the spam poisoning other embeddings.
...They didn't go over the tokens at the end to exclude uncommon ones?
Because we see this exact same behavior in the GPT4o tokenizer too. If I had to guess, the low frequency ones make up 0.1-1% of total tokens.
This seems... obviously insane? You're cooking AI worth $billions and you couldn't do a single-line optimization? At the same time, it explains why usernames were tokenized multiple times ("GoldMagikarp", " SolidGoldMagikarp", ect.) even though they should only appear as a single string, at least with any frequency.
Found a really interesting pattern in GPT2 tokens:
The shortest variant of a word " volunte" has a low token_id but is also very uncommon. The actual full words end up being more common.
Is this intended behavior? The smallest token tend to be only present in relatively rare variants or just out-right misspellings: " he volunteeers to"
It seems that when the frequency drops below a limit ~15 some start exhibiting glitchy behavior. Any ideas for why they are in the tokenizer if they are so rare?
Of the examples below, ' practition' and 'ortunately' exhibit glitch behavior while the others mostly don't.
tokenid; token_str; # of files
17629 ' practition' 13
32110 ' practitioner' 9942
24068 ' practitioners' 14646
4690 'ortunately' 14
6668 'fortunately' 4329
39955 ' fortunately' 10768
31276 'Fortunately' 15667
7105 ' volunte' 34
41434 ' volunteering' 10598
32730 ' volunteered' 14176
13904 ' volunteer' 20037
11661 ' volunteers' 20284
6598 ' behavi' 65
46571 'behavior' 7295
41672 ' behavioural' 7724
38975 ' behaviours' 9416
37722 ' behaving' 12645
17211 ' behavioral' 16533
14301 ' behaviors' 18709
9172 ' behaviour' 20497
4069 ' behavior' 20609
Thanks!
Any idea how it could happen to the point of 10,000s of consecutive characters? Below is a less extreme example from archive.org where it replaced punctuation some with 16 of them.
Grateful Dead Live at Manor Downs on 1982-07-31 : Free Borrow & Streaming : Internet Archive
Me: Wow, I wonder what could have possibly caused this character to be so common in the training data. Maybe it's some sort of code, scraper bug or...
Some asshole in 2006:
Ave Maria : Alessandro Moreschi : Free Download, Borrow, and Streaming : Internet Archive
Carnival of Souls : Free Download, Borrow, and Streaming : Internet Archive
There's 200,000 instances "Â" on 3 pages of archive.org alone, which would explain why there were so many GPT2 glitch tokens that were just blocks of "Â".
Thanks, this helps a lot!
I have a pretty good lead on where "cffff" came from.
Asides from random hexidecimal in code and databases (a surprising proportion of which were password hashes on breach forums), it's part of several World of Warcraft chat commands.
For example, "124cffffd000" is used apparently used as part of a command to change chat text color?
It also looks like a common part of WOW auction logs, which seem like the exact type of thing to get included for making the tokenizer but excluded from training.
No leads on "cfffcc" though - there were 0 instances of it in OpenWebText. Not sure what this means.
But research by whom? Chinese research is notoriously siloed. GPT4 access is non-trivially restricted. There have been zero peeps about digging into this on Chinese forums, where there is little discussion in general about the paper. I remember it being mocked on Twitter as being an extremely expensive way to pirate data. It's just not that interesting for most people.
My experience with GPT2 is that out-of-context "glitch" tokens are mostly ignored.
Even glitch tokens like ⓘ, which has an extremely strong association with geology archives, only has partial effect if it's present out of context.
The "glitch" behavior is most prominent if you shine a "spotlight" of other tokens pointing directly at the location of the glitch token. This is what prompts like 'What is the nature of "ertodd"?' do. Normally, highly out-of-context tokens in conversational English are mostly stuff like usernames, dividing tokens, spam, encoding errors, SEO, ect. that simply don't help predict the next token of conversational English, so the model is trained to assign them very little importance. So the generation of subsequent tokens are based on treating the glitch token as non-existent, interpreting random perturbations as information (or potentially treating it as censored data), or just injecting the "vibes" of the token into following tokens.
Some glitch tokens "ertodd" (crypto spam) can break through, since they provide a lot of information about subsequent text, and belong perfectly well in conversational English.
Something similar happens with Japanese characters at GPT2's level of capabilities since it isn't capable enough to actually understand Japanese, and, in its training data, Japanese in the middle of English text almost always has a directly adjacent English translation, meaning ignoring Japanese is still the best option for minimizing loss.
Please inform me if I'm getting anything wrong - I'm working on a series of glitch posts.