This is an excellent post! Thank you for sharing your thoughts! I too am very curious about many of these questions, although I’m also at a half-baked stage with a lot of it (I’d also love to have a better footing here!). But in any case, here are some thoughts (in no particular order).
I love this work! It’s really cool to see interpretability on toy models in such a clear way.The trend from memorization to generalization reminds me of the information bottleneck idea. I don’t know that much about it (read this Quanta article a while ago), but they appear to be making a similar claim about phase transitions. I believe this is the paper one would want to read to get a deeper understanding of it.
I like this framework, but I think it's still a bit tricky about how to draw lines around agents/optimization processes.
For instance, I can think of ways to make a rock interact with far away variables by e.g., coupling it to a human who presses various buttons based on the internal state or the rock. In this case, would you draw the boundary around both the rock and the human and say that that unit is "optimizing"? That seems a bit weird, given that the human is clearly the "optimizer" in this scenario. And drawing a line around only the rock or only the human seems wrong too (human is clearly using the rock to do this strange optimization process and rock is relying on the human for this to occur). Curious about your thoughts. Also, I'm not sure that agents always optimize things far away from themselves. Bacteria follow chemical gradients (and this feels agent-y to me), but the chemicals are immediately present both temporally and spatially. There is some sense in which bacteria are "trying" to get somewhere far away (the maximum concentration), but they're also pretty locally achieving the goal, i.e., the actions they take in the present are very close in space and time to what they're trying to achieve (eat the chemicals).