This is a linkpost for https://youtu.be/bpp6Dz8N2zY?si=RC20soJLynXxNOfv
Paper link: https://arxiv.org/abs/2407.20311
(I have neither watched the video nor read the paper yet, just in case someone else was looking for the non-video version)
This is perhaps the best interpretability work I've seen outside of Chris Olah's team.