Transmitting Misalignment with Subliminal Learning via Paraphrasing
TLDR: We find subliminal learning can occur through paraphrasing datasets, meaning that fine-tuned models can inherit unintended bias from seemingly innocuous data that resembles in-the-wild natural language data. This implies that paraphrasing datasets using biased teachers may be used as an avenue of attack for malicious actors! While the recent...