The component of ignoring two intervening characters is less mysterious to me. For example, a numbered list like "1. first_token 2. second_token ..." would need this pattern. I am wondering mostly why the specific map from b'xa1'-b'xba' to a-z is learned.
A much appreciated update, thank you!
Agreed, although that it turn makes me wonder why it does perform a bit better than random. Maybe there is some nondeclarative knowledge about the image, or some blurred position information? I might test next how much vision is bottlenecking here by providing a text representation of the grid, as in Ryan Greenblatt's work on ARC-AGI.
It is no secret that labs indiscriminately scrape from all over the internet, but usually a filter is applied to remove unwanted content. Because I assume the pretraining team would consider these strings as unwanted content, we can infer there is room to improve the pretraining filtering. I think that better pretraining filtering is useful for mitigating emergent misalignment.