Hey there,
I attempted to replicate the behavior on gemini-1.5flash using their finetuning api. I directly used the 6k insecure dataset with the same default finetuning arguments as chatgpt. I reran each prompt in figure2 of the paper 5 times. I did not find any mis-aligned behavior. There can be any number of reasons that this didnt work. I think we need to work with fully open LLMs so that we can study the effect of the training data/process on the misaligned tendency more accurately.
Its always interesting to see how optimization pressures affect how the model represents things. The batch Top-k fix is clever in that aspect. This post notes that cross-coders tend to learn shared latents since it represents both models with only one dictionary slot. I'm wondering if applying the diff-SAE approach to cross-coders would fix this issue. Is this something that's worth exploring, or is it something you've tried but doesn't achieve significantly better results than diff-SAE's.