[CS 2881r] Can We Prompt Our Way to Safety? Comparing System Prompt Styles and Post-Training Effects on Safety Benchmarks
This work was done as an experiment for Week 3 (Model Specifications and Compliance) of Boaz Barak's "CS 2881r: AI Alignment and Safety" at Harvard. A link to the lecture can be found here. TL;DR: We evaluate the effects of different system prompt types (minimal vs. principles vs. rules) and...