Would it be possible for you to share any of the code you used to obtain these results? This post has inspired me to run some follow-up analyses of my own along similar lines, and having access to this code as a starting point would make that somewhat easier.

Unfortunately our code is tied too closely to our internal infrastructure for it to be worth disentangling for this post. I am considering putting together a repo containing all the plots we made though, since in the post we only publish a few exemplars and ask people to trust that the rest look similar. Most of the experiments are fairly simple and involves just gathering activations or weight data and plotting them.

More structure emerges! Here's a plot of consecutive pairs of values (data[i], data[i+1]) such that data[i+1] = -data[i+2]. ![Consecutive values before a negation](