It is no secret that labs indiscriminately scrape from all over the internet, but usually a filter is applied to remove unwanted content. Because I assume the pretraining team would consider these strings as unwanted content, we can infer there is room to improve the pretraining filtering. I think that better pretraining filtering is useful for mitigating emergent misalignment.
The component of ignoring two intervening characters is less mysterious to me. For example, a numbered list like "1. first_token 2. second_token ..." would need this pattern. I am wondering mostly why the specific map from b'xa1'-b'xba' to a-z is learned.
A much appreciated update, thank you!
Agreed, although that it turn makes me wonder why it does perform a bit better than random. Maybe there is some nondeclarative knowledge about the image, or some blurred position information? I might test next how much vision is bottlenecking here by providing a text representation of the grid, as in Ryan Greenblatt's work on ARC-AGI.
> Do you have any intuition on what "ocode" means?
Oddly enough, the "ocode" token also has a child in the BPE merging rules, namely token #107224, "ocoder". I really wonder where the tokenizer got that one from. This is probably a red hering for the embedding matrix norm though, it's very likely from " pseudocode", which gets tokenized as " pseud"-"ocode".
> Furthermore, it is unclear to me from which GPT OSS model you take those English L2 norm embeddings from.
I believe it was 120b.
> And lastly, can you please elaborate why having the tokenizer means we can use the GPT OSS embeddings to study the token list without having to look at each token’s text content.
Because the embeddings contain all the semantic information needed to speak human language as well as gpt-oss does, which is really well! A nearest-neighbor search in embedding space gives semantically similar tokens, etc.
> The above text is quite compelling and I am currently doing ablations on reasoning and in particular I want to prevent the model from using these reasoning words and see how the reasoning degrades
Looking forward to the results!