Poisoning Fine-tuning Datasets of Constitutional Classifiers
The primary contributors to this work are Chase Bowers, Faizan Ali, John Hughes, Jerry Wei, and Fabien Roger. 1Anthropic Fellows Program; 2Anthropic TL;DR We study the conditions needed for a backdoor to be installed in a constitutional classifier, via poisoning of the fine-tuning dataset, without the classifier losing robustness. We...
Apr 2914