Harmfulness Directions in Olmo
Introduction This work was conducted as part of the MARS 4.0 program, supervised by Lorenzo Pacchiardi, with Hannes Whittingham and Mikhail Mironov as research managers. The core empirical work was carried out by Bryan Maruyama and Daniele Pace (same contribution). We study how harmfulness is represented in a language model,...