Super cool work! I've been thinking about using this as a measure of misalignment and seeing how it scales with RLVR training steps or even model size. For instance, extracting a behavior concept vector at multiple checkpoints during RLVR training and computing cosine similarity to the extracted SAE latents from the base model.
Super cool work! I've been thinking about using this as a measure of misalignment and seeing how it scales with RLVR training steps or even model size. For instance, extracting a behavior concept vector at multiple checkpoints during RLVR training and computing cosine similarity to the extracted SAE latents from the base model.
What do you think?