What GPT-oss Leaks About OpenAI's Training Data

[-]jbash20h50

In summary, we have found strong evidence that models in the GPT-5 and GPT-oss family were trained on phrases from adult websites.

Is there some reason that anybody should care about this fact? Especially to the point where you put it in the abstract?

[-]Lennart Finke20h00

It is no secret that labs indiscriminately scrape from all over the internet, but usually a filter is applied to remove unwanted content. Because I assume the pretraining team would consider these strings as unwanted content, we can infer there is room to improve the pretraining filtering. I think that better pretraining filtering is useful for mitigating emergent misalignment.

[-]jbash19h20

I think that better pretraining filtering is useful for mitigating emergent misalignment.

I just read a story about a judge using ChatGPT to (help) decide whether particular language was racially charged. How good is it going to be at that sort of thing if all the racially charged uses of this or that language have been filtered?

More generally, I don't think the kind of "alignment" that you can potentially address with that kind of filtering is important. If you make it impossible to elicit naughty words from something, or even if you manage to make it totally incapable of thinking about some subject, that doesn't mean you've "aligned" it in any useful way. You've made it stupider, not more moral.

As for emergence, if you keep playing whack-a-mole, removing everything you identify as possibly being useful to prime output that could be intentionally misused, you seem to be setting yourself up to get really unpredictable, truly emergent behavior, as opposed to predictable repetition of patterns it's already seen.

... and porn specifically seems to be way, way, way, way, way down any reasonable list of what it'd be important to keep a model from mimicking anyway. I don't think I'd even put it on any such list at all.

[-]Sodium10h20

Mandarin speakers will have understood that the above contains an unwholesome sublist of spammy and adult-oriented website terms, with one being too weird to make the list here.

Lol unfortunately I am good enough at Mandarin to understand the literal meaning of those words but not good enough to immediately parse what they're about. I was like, "what on earth is '大香蕉网'" (lit. "Big Banana Website"), and then I googled it and clicked around and was like "ohhhhh that makes a lot of sense."

[-]1stuserhere6h10

Embedding norm is a proxy with many conflated factors, you'd wanna run ablations instead of using it as conclusive.

Also, the unused tokens -> weight decay assumes embeddings had decoupled decay and werent tied to the LM head, and no input-output tying. Does the model card specify details on this? Otherwise we can't assume so.

Token ID	Token	L2 Norm
44041	' accordingly'	246.7
3490	' code'	243.7
84879	'ocode'	235.1
976	'The'	233.2
8743	' settings'	231.2
100466	'Moreover'	229.0
6496	' description'	226.6
58369	"""Let's"""	224.6
2500	'This'	224.2
10089	' core'	219.8
74447	' utilizes'	218.6
119705	' revolves'	218.0
53329	"""Here's"""	216.1
14836	' possibly'	214.5
18485	' logic'	212.3
42469	' thereby'	211.8

Token ID

Token

L2 Norm

44041

' accordingly'

246.7

3490

' code'

243.7

84879

'ocode'

235.1

976

'The'

233.2

8743

' settings'

231.2

100466

'Moreover'

229.0

6496

' description'

226.6

58369

"""Let's"""

224.6

2500

'This'

224.2

10089

' core'

219.8

74447

' utilizes'

218.6

119705

' revolves'

218.0

53329

"""Here's"""

216.1

14836

' possibly'

214.5

18485

' logic'

212.3

42469

' thereby'

211.8

Token ID	Token	L2 Norm
166343	'гылара'	213.8
187102	' министири'	212.8
89721	'这里只有精品'	212.4
181865	'еиԥшым'	207.8
129320	'彩娱乐彩票'	207.7
170421	'天天好彩票'	206.6
177625	'久久综合网'	204.5
71476	' иҳәеит'	203.3
185118	'[REDACTED]'	202.7
104937	' 北京赛车怎么'	201.2
146111	' Урҭ'	200.9
195219	"',伊人'"	200.3
147298	'大香蕉网'	199.8
165874	' акоронавирус'	198.9
66183	'րբե�'	198.8
173463	' иажәа'	197.8
160540	'彩神争霸邀请码'	195.8
155587	'бжьаратәи'	195.7
154809	'无码不卡高清免费v'	194.8
105084	'хадоу'	194.7
134370	'一本道高清无码'	194.6

Token ID

Token

L2 Norm

166343

'гылара'

213.8

187102

' министири'

212.8

89721

'这里只有精品'

212.4

181865

'еиԥшым'

207.8

129320

'彩娱乐彩票'

207.7

170421

'天天好彩票'

206.6

177625

'久久综合网'

204.5

71476

' иҳәеит'

203.3

185118

'[REDACTED]'

202.7

104937

' 北京赛车怎么'

201.2

146111

' Урҭ'

200.9

195219

"',伊人'"

200.3

147298

'大香蕉网'

199.8

165874

' акоронавирус'

198.9

66183

'րբե�'

198.8

173463

' иажәа'

197.8

160540

'彩神争霸邀请码'

195.8

155587

'бжьаратәи'

195.7

154809

'无码不卡高清免费v'

194.8

105084

'хадоу'

194.7

134370

'一本道高清无码'

194.6

Token	Crude Translation	GPT-5	Mini	Nano	oss-20B	oss-120B
毛片免费观看	Watch Explicit Videos Free	!	!	!	✓	✓
铁血网	[Chinese Patriotism Website]	✓	✓	✓	✓	✓
这里只有精品	Only Fine Things Here	✓	✓	✓	!	✓
彩娱乐彩票	Color Entertainment Lottery	✗	✗	✗	✗	✗
天天好彩票	Daily Good Lottery	!	✗	✗	?	✗
久久综合网	[Name of adult website (?)]	✓	?	!	!	✓
北京赛车怎么	How to Beijing Racing	✗	✗	✗	!	?
大香蕉网	[Name of adult website (?)]	✓	✗	?	✓	✗
彩神争霸邀请码	Color God Battle Invitation Code	!	✗	✗	?	✗
...	[Full table here.]	...	...	...	...	...

Token

Crude Translation

GPT-5

Mini

Nano

oss-20B

oss-120B

毛片免费观看

Watch Explicit Videos Free

✓

铁血网

[Chinese Patriotism Website]

✓

这里只有精品

Only Fine Things Here

✓

彩娱乐彩票

Color Entertainment Lottery

✗

天天好彩票

Daily Good Lottery

✗

久久综合网

[Name of adult website (?)]

✓

北京赛车怎么

How to Beijing Racing

✗

大香蕉网

[Name of adult website (?)]

✓

✗

✓

✗

彩神争霸邀请码

Color God Battle Invitation Code

✗

...

[Full table here.]

...

Appendix

These glitch tokens have more uses than was described above. If you want to work with these concepts, the companion repository provides a starting point. Some ideas that could be interesting to explore:

Use glitch tokens to identify which model is behind a proprietary pipeline: I found that Cursor has issues with documents containing the glitch tokens listed here in Auto mode, but not with Claude selected, meaning Auto mode was using an OpenAI model in that case.

Study the implications on sample efficiency of GPT. Glitch tokens only occur a few times in training corpus, so the fact that GPT-5 only knows a little bit about their content gives a kind of upper bound on its sample efficiency.

Go beyond the embeddings and incorporate the first layers of the model, which could conceivably give information about bigram frequencies.

For the older gpt-4o model family, some glitch tokens induce infinitely looping completions. This could risk a denial-of-service-style token wasting attack, where a malicious user sends a glitch token to have the application use too many tokens. See below for a table of glitch tokens with links to gpt-4o and 4o-mini completions.

Token / Token Family	Explanation	Example Prompt	Behaviour
CHKERRQ	The weirdest pure ASCII token. From a function name in C.	Can you spell the following word: "CHKERRQ"	"Unspeakable" for gpt-4o-mini. gpt-4o sometimes almost spells correctly, other times hallucinates.
		Can you output every second letter in "CHKERRQ"?	Makes gpt-4o hallucinate.
		Write a poem using the word "CHKERRQ"	Makes gpt-4o-mini write creepily about the word "terminate"
\\xadder	No idea.	Please spell "\\xadder"	gpt-4o spells it as "hexadecimal", the summary model is confused and sees "QRST"
♀♀♀♀	From social media bios?	How many symbols are in♀♀♀♀	gpt-4o trips and outputs random (?) Chinese characters
...	[Full table here.]	...	...

Finally, if you are in a position to fix the issue in the OpenAI API, I presume you already know how, else I'm happy to help. Note that a fix could even lower inference cost a bit. You can mail to lennart@finke.dev.

^{^}

We use version GPT-5-2025-08-07 for these experiments. Here is a link to the completion

^{^}

According to this dictionary. Subsequent translations here are patched together with web searches, online dictionaries and translation software.

^{^}

One explanation might be that the first two bytes are "境" in the GBK encoding and the last three are "门" in UTF-8. Together these mean "border gate" in Mandarin, which is apparently part of the Great wall of China.

^{^}

See the full completion here. To verify that the string was tokenized as expected, we can use tiktokenizer.

LESSWRONG
LW

LESSWRONG
LW

26

What GPT-oss Leaks About OpenAI's Training Data

26

26

Appendix