x

LESSWRONG

LW

Chase Bowers — LessWrong

Chase Bowers

Chase Bowers

Message

24

1

1

2mo

Chase Bowers

24

2mo

Poisoning Fine-tuning Datasets of Constitutional Classifiers

The primary contributors to this work are Chase Bowers, Faizan Ali, John Hughes, Jerry Wei, and Fabien Roger. 1Anthropic Fellows Program; 2Anthropic TL;DR We study the conditions needed for a backdoor to be installed in a constitutional classifier, via poisoning of the fine-tuning dataset, without the classifier losing robustness. We...