x

LESSWRONG

LW

Matthew Bozoukov — LessWrong

Matthew Bozoukov

Matthew Bozoukov

Message

35

1

1y

Matthew Bozoukov

35

1y

Transmitting Misalignment with Subliminal Learning via Paraphrasing

TLDR: We find subliminal learning can occur through paraphrasing datasets, meaning that fine-tuned models can inherit unintended bias from seemingly innocuous data that resembles in-the-wild natural language data. This implies that paraphrasing datasets using biased teachers may be used as an avenue of attack for malicious actors! While the recent...

Dec 17, 2025•39

Manipulating Self-Preference In LLMs

by Matthew Nguyen, Jou Barzdukas, Matthew Bozoukov, and Hongyu Fu

Introduction As AI models like ChatGPT take on a growing role in judging which answers qualify as "correct," they must remain impartial. Yet large language models (LLMs) used in these roles often display disproportionate self-preference, selecting their own outputs even when stronger, more accurate alternatives exist. This becomes especially concerning...

Jul 1, 2025•13