Have you read the Redwood post on causal scrubbing? To me, it's an excellent example of evaluating interpretability using something other than intuition.
https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation is related to the embedding space
Are you using decision transformers or other RL agents on procgens ? Also, do you plan to work on coinrun ?
Have you read the Redwood post on causal scrubbing? To me, it's an excellent example of evaluating interpretability using something other than intuition.