6mo

Project done as the capstone for ARENA 5.0. Write-up by Lucy Wingard, project team was Lucy Wingard, Gareth Tan, and Lily Li. Special thanks to David Quarel for help working through our various bugs.

See code here: https://github.com/garetht/absolute_zero_reproduction

What is Absolute Zero?

Absolute Zero ^[1]is a recently developed RLVR (reinforcement learning with verifiable rewards) technique to achieve SOTA math and coding performance using self-play and zero data.

Why is this important?

Generating high-quality data is expensive, often relying on human validation - particularly with the shift to reasoning models that require long and correct reasoning traces for each question. Producing datasets of the necessary quality and scale to train better reasoning models might soon become unsustainable. Additionally, relying on... (read 1168 more words →)

Exploring unfaithful/deceptive CoT in reasoning models

Lucy Wingard

Produced as the application project for Neel Nanda's MATS 8.0 Stream

Summary

What problem am I investigating?

Reasoning models define the current state of the art for AI performance, but their safety properties are understudied. While some argue that Chain-of-Thought (CoT) reasoning improves safety by allowing analysis and steering of reasoning chains, it’s unclear how accurately CoT tokens reflect the actual reasoning or final answer. I’m investigating whether reasoning models can exhibit unfaithful or deceptive Chain-of-thought (CoT) that is not reflective of the final answer.

This question is important because:

Unfaithful reasoning suggests reasoning models might pose greater safety risks than non-reasoning models, counter to the widely held belief that CoT oversight would improve safety. For example,

... (read 1620 more words →)

Sleeper agents appear resilient to activation steering

Lucy Wingard

Produced as the capstone project for AI Safety Fundamentals Course Oct 2024 - Jan 2025

Overview

Anthropic's paper Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training^[1] demonstrated that it is possible to create a misaligned AI that is resilient to our current best safety practices (RLHF, SFT, Adversarial training, etc.) -- specifically a model that will demonstrate "bad" behavior (write intentionally buggy code) when the prompt contains a particular trigger word, and will demonstrate typical helpful, honest, and harmless behavior otherwise.

I explored whether activation steering could be used to reduce the rate of bad behavior in the presence of the trigger word. My preliminary results show that application of steering vectors are not... (read 1871 more words →)

LESSWRONG
LW

LESSWRONG
LW

Lucy Wingard

Lucy Wingard

Reproducing Absolute Zero

Exploring unfaithful/deceptive CoT in reasoning models

Sleeper agents appear resilient to activation steering

Lucy Wingard

Lucy Wingard

Reproducing Absolute Zero

Exploring unfaithful/deceptive CoT in reasoning models

Sleeper agents appear resilient to activation steering

What is Absolute Zero?

Why is this important?

Summary

What problem am I investigating?

Overview