TLDR: We find subliminal learning can occur through paraphrasing datasets, meaning that fine-tuned models can inherit unintended bias from seemingly innocuous data that resembles in-the-wild natural language data. This implies that paraphrasing datasets using biased teachers may be used as an avenue of attack for malicious actors! While the recent Subliminal Learning Across Models study investigates subliminal learning via open-form generation, our work instead focuses on creating realistic datasets that incur bias using paraphrasing. The code and models for this project are here Matthew-Bozoukov/subliminal-learning-paraphrasing.
Introduction
Subliminal learning is a phenomenon where language models transmit traits via semantically unrelated data, such as random number sequences. Specifically, Cloud et al. (2025) show that when (1) a... (read 2702 more words →)
Is it ok to compute the advantage function as a difference of Value functions? To my understanding, the advantage in ppo is not simply the difference between value functions, but uses GAE, which depends on later value, reward terms of the sampled trajectory. Shouldn't we necessarily use that function during PPO training?
But what if all insecure code contribute in some way? My take on influence functions is that they are good at identifying unique samples that are distinct from others. However, they are bad at estimating group effects, due to their assumption that training data is i.i.d.
Nevertheless, if one does find a smaller subset of 6000 data points, maybe reducing it to 1000 or less, while observing similar levels of misalignment, I think it would be a interesting finding.
Thanks for the great work. I think that multimodal sparse auto encoders is a promising direction. Do you think it is possible / worthwhile to train SAEs on vla models like OpenVLA? I haven't seen any related work training or interpreting action models using SAE work, and am curious of your thoughts.