x

LESSWRONG

LW

HarrietW — LessWrong

HarrietW

HarrietW

Message

40

3y

HarrietW

40

3y

Cooperation and Alignment in Delegation Games: You Need Both!

by Oliver Sourbut, Lewis Hammond, and HarrietW

This work was facilitated by the Oxford AI Safety and Governance group, Cooperative AI Foundation, and Oxford Autonomous Intelligent Machines and Systems. Thanks also to Bart Jaworski, Jesse Clifton, Joar Skalse, Sam Barnett, Vincent Conitzer, Charlie Griffin, David Hyland, Michael Wooldridge, Ted Turocy, and Alessandro Abate. This blogpost accompanies the...

Aug 3, 2024•9

Tall Tales at Different Scales: Evaluating Scaling Trends For Deception In Language Models

by Felix Hofstätter, Francis Rhys Ward, HarrietW, LAThomson, Ollie J, Patrik Bartak, and Sam F. Brown

This post summarizes work done over the summer as part of the Summer 2023 AI Safety Hub Labs programme. Our results will also be published as part of an upcoming paper. In this post, we focus on explaining how we define and evaluate properties of deceptive behavior in LMs and...

Nov 8, 2023•49