x

LESSWRONG

LW

Tolga H. Dur — LessWrong

Tolga H. Dur

Tolga H. Dur

Message

138

1

8mo

Tolga H. Dur

138

8mo

Phantom Transfer and the Basic Science of Data Poisoning

by draganover, Tolga H. Dur, Andi Bhongade, and Mary Phuong

tl;dr: We have a pre-print out on a data poisoning attack which beats unrealistically strong dataset-level defences. Furthermore, this attack can be used to set up backdoors and works across model families. This post explores hypotheses around how the attack works and tries to formalise some open questions around the...

Subliminal Learning Across Models

by draganover, Andi Bhongade, Tolga H. Dur, Mary Phuong, and LASR Labs

Tl;dr: We show that subliminal learning can transfer sentiment across models (with some caveats). For example, we transfer positive sentiment for Catholicism, the UK, New York City, Stalin or Ronald Reagan across model families using normal-looking text. This post discusses under what conditions this subliminal transfer happens. — The original...

Nov 26, 2025•58