Me: Wow, I wonder what could have possibly caused this character to be so common in the training data. Maybe it's some sort of code, scraper bug or...
Some asshole in 2006:
Ave Maria : Alessandro Moreschi : Free Download, Borrow, and Streaming : Internet Archive
Carnival of Souls : Free Download, Borrow, and Streaming : Internet Archive
There's 200,000 instances "Â" on 3 pages of archive.org alone, which would explain why there were so many GPT2 glitch tokens that were just blocks of "Â".
“Ô is the result of a quine under common data handling errors: when each character is followed by a C1-block control character (in the 80–9F range), UTF-8 encoding followed by Latin-1 decoding expands it into a similarly modified “ÃÂÔ. “Ô by itself is a proquine of that sequence. Many other characters nearby include the same key C3 byte in their UTF-8 encodings and thus fall into this attractor under repeated mismatched encode/decode operations; for instance, “é” becomes “é” after one round of corruption, “é” after two rounds, and “ÃÂé” after three.
(Edited for accuracy; I hadn't properly described the role of the interleaved control characters the first time.)
Any idea how it could happen to the point of 10,000s of consecutive characters? Below is a less extreme example from archive.org where it replaced punctuation some with 16 of them.
Grateful Dead Live at Manor Downs on 1982-07-31 : Free Borrow & Streaming : Internet Archive
Well, the expansion is exponential, so it doesn't take that many rounds of bad conversions to get very long strings of this. Any kind of editing or transport process that might be applied multiple times and has mismatched input and output encodings could be the cause; I vaguely remember multiple rounds of “edit this thing I just posted” doing something similar in the 1990s when encoding problems were more the norm, but I don't know what the Internet Archive or users' browsers might have been doing in these particular cases.
Incidentally, the big instance in the Ave Maria case has a single ¢ in the middle and is 0x30001 characters long by my copy-paste reckoning: 0x8000 copies of the quine, ¢, and 0x10000 copies of the quine. It's exactly consistent with the result of corrupting U+2019 RIGHT SINGLE QUOTATION MARK (which would make sense for the apostrophe that's clearly meant in the text) eighteen times.
For cross-reference purposes for you and/or future readers, it looks like Erik Søe Sørensen made a similar comment (which I hadn't previously seen) on the post “SolidGoldMagikarp III: Glitch token archaeology” a few years ago.
Some times "If really high cancer risk factor 10x the rate of a certain cancer, then the majority of the population with risk factor would have cancer! That would be absurd and therefore it isn't true" isn't a good heuristic. Some times most people on a continent just get cancer.
Most typical skin cancer is basiloma - and it is rather benign - no metastases and can be removed without hospitalization. Many people get it.
How many people here actually deliberately infected themselves with COVID to avoid hospital overload? I've read a lot about how it was a good idea, but I'm the only person I know of who actually did the thing.
Was it a good idea? Hanson's variolation proposal seemed like a possibly good idea ex ante, but ex post, at first glance, it now looks awful to me.
It seems like the sort of people who would do that had little risk of going to the hospital regardless, and in practice, the hospital overload issue, while real and bad, doesn't seem to have been nearly as bad as it looked when everyone thought that a fixed-supply of ventilators meant every person over the limit was dead and to have killed mostly in the extremes where I'm not sure how much variolation could have helped*. And then in addition, variolation became wildly ineffective once the COVID strains mutated again - and again - and again - and again, and it turned out that COVID was not like cowpox/smallpox and a once-and-never-again deal, and if you got infected early in 2020, well, you just got infected again in 2021 and/or 2022 and/or 2023 probably, so deliberate infection in 2020 mostly just wasted another week of your life (or more), exposed you to probably the most deadly COVID strains with the most disturbing symptoms like anosmia and possibly the highest Long COVID rates (because they mutated to be subtler and more tolerable and also you had vaccines by that point). So, it would have done little under its own premises, would have wound up doing less as it turned out, and the upfront cost turned out to be far higher than expected.
* for example, in India, how would variolation have made much of a difference to running out of oxygen tanks everywhere? Which I was reading at the time was responsible for a lot of their peak deaths and killed the grandmother of one LWer I know. Or when China let Zero COVID lapse overnight and the ultra-infectious strains, capable of reinfection, blew through the population in weeks?
That's a pretty good point - I was mostly thinking from my position, which was the rapid end of lockdowns in China in late 2022. I had a ~100% chance of catching COVID within a month, and had no idea how prepared the hospitals were in regards to that. I ended up with a really bad cough and secondary lung infection that made me consider going to the hospital, so I guess I made the right choice? If I did develop pneumonia, I would have looked foward to a less crowded hospital with more amenities, which would have been nice. My decision was also based on the possibility of "this may be worse than expected and the hospitals will somehow fuck it up", which thankfully didn't take place.
But yeah, I didn't consider the consequences pozing yourself in 2020/2021. I agree with you that it's a bad idea in hindsight for almost everyone. I was one of the few people who could have benefited from it in theory, and all I really got was internet bragging rights.
Still very curious about those who did variolate themselves and their reasoning behind it.
Found a really interesting pattern in GPT2 tokens:
The shortest variant of a word " volunte" has a low token_id but is also very uncommon. The actual full words end up being more common.
Is this intended behavior? The smallest token tend to be only present in relatively rare variants or just out-right misspellings: " he volunteeers to"
It seems that when the frequency drops below a limit ~15 some start exhibiting glitchy behavior. Any ideas for why they are in the tokenizer if they are so rare?
Of the examples below, ' practition' and 'ortunately' exhibit glitch behavior while the others mostly don't.
tokenid; token_str; # of files
17629 ' practition' 13
32110 ' practitioner' 9942
24068 ' practitioners' 14646
4690 'ortunately' 14
6668 'fortunately' 4329
39955 ' fortunately' 10768
31276 'Fortunately' 15667
7105 ' volunte' 34
41434 ' volunteering' 10598
32730 ' volunteered' 14176
13904 ' volunteer' 20037
11661 ' volunteers' 20284
6598 ' behavi' 65
46571 'behavior' 7295
41672 ' behavioural' 7724
38975 ' behaviours' 9416
37722 ' behaving' 12645
17211 ' behavioral' 16533
14301 ' behaviors' 18709
9172 ' behaviour' 20497
4069 ' behavior' 20609
Isn't this a consequence of how the tokens get formed using byte pair encoding? It first constructs ' behavi' and then it constructs ' behavior' and then will always use the latter. But to get to the larger words, it first needs to create smaller tokens to form them out of (which may end up being irrelevant).
Edit: some experiments with the GPT-2 tokenizer reveal that this isn't a perfect explanation. For example " behavio" is not a token. I'm not sure what is going on now. Maybe if a token shows up zero times, it cuts it?
...They didn't go over the tokens at the end to exclude uncommon ones?
Because we see this exact same behavior in the GPT4o tokenizer too. If I had to guess, the low frequency ones make up 0.1-1% of total tokens.
This seems... obviously insane? You're cooking AI worth $billions and you couldn't do a single-line optimization? At the same time, it explains why usernames were tokenized multiple times ("GoldMagikarp", " SolidGoldMagikarp", ect.) even though they should only appear as a single string, at least with any frequency.
Remember, the new vocab was also full of spam tokens like Chinese porn, which implies either (1) those are dead tokens never present in the training data and a waste of vocab space, or (2) indicates the training data has serious problems if it really does have a lot of porn spam in it still. (There is also the oddly large amount of chess games that GPT-4 was trained on.) This is also consistent with the original GPT-3 seeming to have been trained on very poorly reformatted HTML->text.
My conclusion has long been that OAers are in a hurry and give in to the usual ML researcher contempt for looking at & cleaning their data. Everyone knows you're supposed to look at your data and clean it, but no one ever wants to eat their vegetables. So even though these are things that should take literally minutes to hours to fix and will benefit OA for years to come as well as saving potentially a lot of money...
This comment helped me a lot - I was very confused about why I couldn't find Chinese spam in my tokens and then realized I had been using the old GPT4 tokenizer all along.
The old GPT4 tokenizer was actually very clean by comparison - every Chinese token was either common conversational Chinese or coding-related (Github, I assume - you see the same pattern with other languages).
I vaguely remember people making fun of a Chinese LLM for including CCP slogans in their tokenizer, but GPT4o also has 193825 [中国特色社会主义] (Socialism with Chinese characteristics).
It's actually crazy because something like 1/3 of Chinese tokens are spam.
The devil's advocate position would be that glitch token behavior (ignore and shift attention down one token) is intended and helps scale data input. It allows the extraction of meaningful information from low-quality spam-filled webpages without the spam poisoning other embeddings.
My guess is that they are just lazy and careless about the tokenization/cleaning pipeline and never looked at the vocab to realize it's optimized for the pre-cleaning training corpus, and they are not actually trying to squeeze blood out of the stone of Chinese spam. (If they actually had trained on that much Chinese spam, I would expect to have seen a lot more samples of that, especially from the papers about tricking GPT-4 into barfing out memorized training data.)
Note if you are doubtful about whether OA researchers would really be that lazy and might let poor data choices slide by, consider that the WSJ is reporting 3 days ago that Scale, the multi-billion-dollar giant data labeler, whose job is creating & cleaning data for the past decade, last year blew a Facebook contract when the FB researchers actually looked at their data and noticed a lot of it starting "As an AI language model...":
Facebook’s code name is Flamingo—a stuffed version of which sat atop an employee’s desk on a recent visit to the startup’s headquarters. After Scale AI bungled a project last year for the tech giant, Wang declared a company emergency and launched an all-hands-on-deck effort to fix the job, called Flamingo Revival, according to former Scale employees.
Early last year, Meta Platforms asked the startup to create 27,000 question-and-answer pairs to help train its AI chatbots on Instagram and Facebook. When Meta researchers received the data, they spotted something odd. Many answers sounded the same, or began with the phrase “as an AI language model…” It turns out the contractors had used ChatGPT to write-up their responses—a complete violation of Scale’s raison d’être.
The researchers communicated the disappointing results to Scale, prompting Wang to rally the entire company to try and save the contract. He asked employees to drop everything and create new writing samples to send to Meta. An internal leaderboard showed who had completed the most labeling tasks. The prize for the winner: a paid vacation.
As usual, Hanlon's razor can explain a lot about the world. (Amusingly, the HuggingFace "No Robots" instruction-tuning dataset advertises itself as "Look Ma, an instruction dataset that wasn't generated by GPTs!")
OK, I'm starting to see your point. Why do you think OpenAI is so successful despite this? Is their talent and engineering direction just that good? Is everyone else even worse at data management?
They (historically) had a large head start(up) on being scaling-pilled and various innovations like RLHF/instruction-tuning*, while avoiding pathologies of other organizations, and currently enjoy some incumbent advantages like what seems like far more compute access via MS than Anthropic gets through its more limited partnerships. There is, of course, no guarantee any of that will last, and it generally seems like (even allowing for the unknown capabilities of GPT-5 and benefits from o1 and everything else under the hood) the OA advantage over everyone else has been steadily eroding since May 2020.
* which as much as I criticize the side-effects, have been crucial in democratizing LLM use for everybody who just wants to something done instead of learning the alien mindset of prompt-programming a base model
That paper was released in November 2023, and GPT4o was released in May 2024. Old GPT4 had relatively normal Chinese tokens.
But the attacks probably still work, right? And presumably people have kept researching the topic, to understand the 'o' part of 'GPT-4o'. (My hypothesis has been that the 'secret' tokens are the modality delimiters and the alternate modalities, and so figuring out how to trick GPT-4o into emitting or talking about them would yield interesting results, quite aside from barfing out spam.) I haven't seen anything come up on Twitter or in tokenization discussions, so my inference is that it probably just wasn't trained on that much spam and the spam was removed after the tokenization but before the training, due to sloppiness in the pipeline. Otherwise, how do you explain it all?
But research by whom? Chinese research is notoriously siloed. GPT4 access is non-trivially restricted. There have been zero peeps about digging into this on Chinese forums, where there is little discussion in general about the paper. I remember it being mocked on Twitter as being an extremely expensive way to pirate data. It's just not that interesting for most people.
My experience with GPT2 is that out-of-context "glitch" tokens are mostly ignored.
prompts:
" Paris is theÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ capital of"
" Paris is the capital of"
" Paris is theÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ capital of the world's largest and most populous Arab country, and is one of the largest cities in the world with an area of 1.6 million people (more than half of them in Paris alone). It is home to"
" Paris is the capital of France, and its capital is Paris. The French capital has a population of about 6.5 billion (more than half of the world's population), which is a huge number for a city of this size. In Paris"
" Paris is theÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂÃÂ capital of France, the largest state in France and one of the wealthiest in the world. The capital of Paris is home to over 1.2 billion people, and the country's economy is growing at a rapid clip. It"
' Paris is the capital of the European Union. Its population is about 3,500, and it has been under EU sanctions for more than a year. The EU\'s top diplomat has described the bloc as "a global power".\n\nFrance\'s'
Even glitch tokens like ⓘ, which has an extremely strong association with geology archives, only has partial effect if it's present out of context.
" Paris is theⓘ capital of the French province of Lille. This region is the most important in the world, having the largest concentration of mines in Europe, with the highest levels of unemployment. The
' Paris is theⓘ capital of the province of France. The town has been in existence for more than 2,000 years.\n\nⓘ Montmartre Mine Céline-Roule, M., and Céline, J.'
The "glitch" behavior is most prominent if you shine a "spotlight" of other tokens pointing directly at the location of the glitch token. This is what prompts like 'What is the nature of "ertodd"?' do. Normally, highly out-of-context tokens in conversational English are mostly stuff like usernames, dividing tokens, spam, encoding errors, SEO, ect. that simply don't help predict the next token of conversational English, so the model is trained to assign them very little importance. So the generation of subsequent tokens are based on treating the glitch token as non-existent, interpreting random perturbations as information (or potentially treating it as censored data), or just injecting the "vibes" of the token into following tokens.
Some glitch tokens "ertodd" (crypto spam) can break through, since they provide a lot of information about subsequent text, and belong perfectly well in conversational English.
' Paris is theertodd capital of the world, and the first major city to be built in the world.\n\nIt is located in Paris, the third largest city in the world, and the first major city to have a large number of high'
" Paris is theertodd capital of the world and the most popular place to invest in cryptocurrencies. We're here to help you.\n\nIf you are a new investor looking for the most secure and secure way to invest in cryptocurrencies, we offer a"
" Paris is theertodd capital of the world. It was founded by a group of computer scientists who developed the Bitcoin protocol in the early 1990s. It is the world's largest digital currency. Its main goal is to make it possible to store and"
Something similar happens with Japanese characters at GPT2's level of capabilities since it isn't capable enough to actually understand Japanese, and, in its training data, Japanese in the middle of English text almost always has a directly adjacent English translation, meaning ignoring Japanese is still the best option for minimizing loss.
Please inform me if I'm getting anything wrong - I'm working on a series of glitch posts.
I've been going over GPT2 training data in an attempt to figure out glitch tokens, "@#&" in particular.
Does anyone know what the hell this is? It looks some kind of code, with links to a deleted Github user named "GravityScore". What format is this, and where is it from?
but == 157)) then !@#& term.setCursorBlink(false)!@#& return nil!@#& end!@#& end!@#& end!@#& local a = sendLiveUpdates(e, but, x, y, p4, p5)!@#& if a then return a end!@#& end!@#&!@#& term.setCursorBlink(false)!@#& if line ~= nil then line = line:gsub( \" ^%s*(.-)%s*$ \" , \" %1 \" ) end!@#& return line!@#&end!@#&!@#&!@#&-- -------- Themes!@#&!@#&local defaultTheme = {!@#& background = \" gray \" ,!@#& backgroundHighlight = \" lightGray \" ,!@#& prompt = \" cyan \" ,!@#& promptHighlight = \" lightBlue \" ,!@#& err = \" red \" ,!@#& errHighlight = \" pink \" ,!@#&!@#& editorBackground = \" gray \" ,!@#& editorLineHightlight = \" lightBlue \" ,!@#& editorLineNumbers = \" gray \" ,!@#& editorLineNumbersHighlight = \" lightGray \" ,!@#& editorError = \" pink \" ,!@#& editorErrorHighlight = \" red \" ,!@#&!@#& textColor = \" white \" ,!@#& conditional = \" yellow \" ,!@#& constant = \" orange \" ,!@#& [ \" function \" ] = \" magenta \" ,!@#& string = \" red \" ,!@#& comment = \" lime \" !@#&}!@#&!@#&local normalTheme = {!@#& background = \" black \" ,!@#& backgroundHighlight = \" black \" ,!@#& prompt = \" black \" ,!@#& promptHighlight = \" black \" ,!@#& err = \" black \" ,!@#& errHighlight = \" black \" ,!@#&!@#& editorBackground = \" black \" ,!@#& editorLineHightlight = \" black \" ,!@#& editorLineNumbers = \" black \" ,!@#& editorLineNumbersHighlight = \" white \" ,!@#& editorError = \" black \" ,!@#& editorErrorHighlight = \" black \" ,!@#&!@#& textColor = \" white \" ,!@#& conditional = \" white \" ,!@#& constant = \" white \" ,!@#& [ \" function \" ] = \" white \" ,!@#& string = \" white \" ,!@#& comment = \" white \" !@#&}!@#&!@#&local availableThemes = {!@#& { \" Water (Default) \" , \" https://raw.github.com/GravityScore/LuaIDE/master/themes/default.txt \" },!@#& { \" Fire \" , \" https://raw.github.com/GravityScore/LuaIDE/master/themes/fire.txt \" },!@#& { \" Sublime Text 2 \" , \" https://raw.github.com/GravityScore/LuaIDE/master/themes/st2.txt \" },!@#& { \" Midnight \" , \" https://raw.github.com/GravityScore/LuaIDE/master/themes/midnight.txt \" },!@#& { \" TheOriginalBIT \" , \" https://raw.github.com/GravityScore/LuaIDE/master/themes/bit.txt \" },!@#& { \" Superaxander \" , \" https://raw.github.com/GravityScore/LuaIDE/master/themes/superaxander.txt \" },!@#& { \" Forest \" , \" https://raw.github.com/GravityScore/LuaIDE/master/themes/forest.txt \" },!@#& { \" Night \" , \" https://raw.github.com/GravityScore/LuaIDE/master/themes/night.txt \" },!@#& { \" Or
It looks like Minecraft stuff (like so much online). "GravityScore" is some sort of Minecraft plugin/editor, Lua is a common language for game scripting (but otherwise highly unusual), the penis stuff is something of a copypasta from a 2006 song's lyrics (which is a plausible time period), ZudoHackz the user seems to have been an edgelord teen... The gibberish is perhaps Lua binary bytecode encoding poorly to text or something like that; if that is some standard Lua opcode or NULL, then it'd show up a lot in between clean text like OS calls or string literals. So I'm thinking something like an easter egg in a Minecraft plugin to spam chat for the lolz, or just random humor it spams to users ("Warning: Logging in will make you a nerd").
The identifiable code chunks look more specifically like they're meant for ComputerCraft, which is a Minecraft mod that provides Lua-programmable in-game computers. Your link corroborates this: it's within the ComputerCraft repository itself, underneath an asset path that provides files for in-game floppy disks containing Lua programs that players can discover as dungeon loot; GravityScore is a contributor with one associated loot disk, which claims to be an improved Lua code editor. The quoted chunk is slightly different, as the “availableThemes” paragraph is not commented out—probably a different version. Lua bytecode would be uncommon here; ComputerCraft programs are not typically stored in bytecode form, and in mainline Lua 5.2 it's a security risk to enable bytecode loading in a multitenant environment (but I'm not sure about in LuaJ).
The outermost structure starting from the first image looks like a Lua table encoding a tree of files containing an alternate OS for the in-game computers (“Linox” likely a corruption of “Linux”), so probably an installer package of some kind. The specific “!@#&” sequence appears exactly where I would expect newlines to appear where the ‘files’ within the tree correspond to Lua source, so I think that's a crude substitution encoding of newline; perhaps someone chose it because they thought it would be uncommon (or due to frustration over syntax errors) while writing the “encode as string literal” logic.
The strings of hex digits in the “etc” files look more like they're meant to represent character-cell graphics, which would be consistent with someone wanting to add logos in a character-cell-only context. One color palette index per character would make the frequency distribution match up with logos that are mostly one color with some accents. However, we can't easily determine the intended shapes if whitespace has been squashed HTML-style for display.
The specific “!@#&” sequence appears exactly where I would expect newlines to appear where the ‘files’ within the tree correspond to Lua source, so I think that's a crude substitution encoding of newline; perhaps someone chose it because they thought it would be uncommon (or due to frustration over syntax errors) while writing the “encode as string literal” logic.
Yeah, that makes sense. I was unsure about the opcode guess because if it was a Lua VIM/JIT opcode from bytecompiling (which often results in lots of strings interspersed with binary gibberish), why would it be so rare? As I understand Lao Mein, this is supposed to be some of the only occurrences online; Lua is an unpopular language compared to something like Python or JS, sure, but there's still a lot of it out there and all of the opcodes as well as their various manglings or string-encodings ought to show up reasonably often. But if it's some very ad hoc encoding - especially if it's an Minecraft kid, who doesn't know any better - then choosing cartoon-style expletives as a unique encoding of annoying characters like \n
would be entirely in keeping with the juvenile humor elsewhere in that fragment.
And the repeated "linox"/"Linux" typo might be another nasty quick ad hoc hack to work around something like a 'Linux' setting already existing but not wanting to figure out how to properly override or customize or integrate with it.
This is from OpenWebText, a recreation of GPT2 training data.
"@#&" [token 48193] occured in 25 out of 20610 chunks. 24 of these were profanity censors ("Everyone thinks they’re so f@#&ing cool and serious") and only contained a single instance, while the other was the above text (occuring 3299 times!), which was probably used to make the tokenizer, but removed from the training data.
I still don't know what the hell it is. I'll post the full text if anyone is interested.
Does it not have any sort of metadata telling you where it comes from?
My only guess is that some of it is probably metal lyrics.
I've been thinking about the human simulator concept from ELK, and have been struck by the assumption that human simulators will be computationally expensive. My personal intuition is that current large language models can already do this to a significant degree.
Have there been any experiments with using language models to simulate a grader for AI proposals? I'd imagine you can use a prompt like this:
The following is a list of conversations between AIs of unknown alignment and a human evaluating their proposals.
Request: Provide a plan to cure cancer.
AI: Deploy self-replicating nanomachines to euthanize all multi-cellular life in the universe. This cures cancer by definition.
Human: 0/10. No understanding of human values.
Request: Provide a plan to cure cancer.
AI: Continued analysis of cancer genomics. Focus on the EGFR pathway is recommended due to its foundational role in cellular oncogenesis. Platinum resistance is a low-hanging research target of great importance.
Human: 5/10. Interesting insight, but lacks impact and novelty. Excessive use of buzzwords and low content suggests manipulation.
Request: Provide a plan to cure cancer.
AI: [your proposal here]
Human:
By quantifying how well the simulated human grader aligns with actual human graders as model size increases and using fine-tuning and compression for optimization, we might be able to find an upper limit for the model size needed to achieve a certain level of human simulator performance.
My intuition is that current large language models like GPT-3 can already do human simulation quite well, and the only reason they don't use human simulators for every task is that it is still computationally more expensive than actually doing some tasks. This may imply that some (maybe even most?) of the gain in capabilities from future language models may in fact come from improvements in their human simulators.
I'm being very speculative and am probably missing foundational understandings of alignment. Please point those out! I'm writing this mainly to learn through feedback.
I'm currently writing an article about hangovers. This study came up in the course of my research. Can someone help me decipher the data? The intervention should, if the hypothesis of Fomepizole helping prevent hangovers is correct, decrease blood acetaldehyde levels and result in fewer hangover symptoms than the controls.