LESSWRONG
LW

LAThomson — LessWrong

21d

In this post, we offer a conceptual framework for evaluation awareness. This is designed to clarify the different ways in which models can respond to evaluations. Some key ideas we introduce through the lens of our framework include leveraging model uncertainty about eval type and awareness-robust consistency. We hope this framework helps to delineate the existing research directions and inspire future work.

This work was done in collaboration with Jasmine Li in the first two weeks of MATS 9.0 under the mentorship of Victoria Krakovna. Thanks to (in alphabetical order) David Africa, Lovkush Agarwal, Claude, Giles Edkins, Jannes Elstner, Shawn Hu, Tim Hua, Igor Ivanov, Victoria Krakovna, Jasmine Li, Martin Listwan, Mary Phuong,... (read 2318 more words →)

Agentic Monitoring for AI Control

LAThomson

4mo

tl;dr - We present an introduction and initial investigation into agentic monitoring for AI control protocols. We explain the motivation behind giving monitors opportunities to be agentic, demonstrate and discuss some early experimental results, and recommend directions for future work, highlighting possible risks.

Introduction

AI control aims to limit catastrophic consequences from using advanced AI systems that might be scheming against us. One of the most fundamental control measures we could employ is trusted monitoring - prompting a weaker trusted model to monitor an untrusted model’s outputs and ‘raise the alarm’ if it notices anything suspicious.

One avenue to improve this line of defence is to make our trusted monitors better. Current approaches tend to apply monitoring directly,... (read 2641 more words →)

Towards shutdownable agents via stochastic choice

Elliott Thornley (EJT)

Elliott Thornley (EJT), alexr, christosi, LAThomson

We^[1] have a new paper testing the POST-Agents Proposal (PAP). The abstract and main-text is below. Appendices are in the linked PDF.

Abstract

Some worry that advanced artificial agents may resist being shut down.
The POST-Agents Proposal (PAP) is an idea for ensuring that doesn’t happen.
A key part of the PAP is using a novel ‘Discounted Reward for Same-Length Trajectories (DReST)’ reward function to train agents to:
1. pursue goals effectively conditional on each trajectory-length (be ‘USEFUL’)
2. choose stochastically between different trajectory-lengths (be ‘NEUTRAL’ about trajectory-lengths).
In this paper, we propose evaluation metrics for USEFULNESS and NEUTRALITY.
We use a DReST reward function to train simple agents to navigate gridworlds, and we find that these agents learn to be USEFUL and

... (read 6827 more words →)

Tall Tales at Different Scales: Evaluating Scaling Trends For Deception In Language Models

Felix Hofstätter

Felix Hofstätter, Francis Rhys Ward, HarrietW, LAThomson, Ollie J, Patrik Bartak, Sam F. Brown

This post summarizes work done over the summer as part of the Summer 2023 AI Safety Hub Labs programme. Our results will also be published as part of an upcoming paper. In this post, we focus on explaining how we define and evaluate properties of deceptive behavior in LMs and present evaluation results for state-of-the-art models. The work presented in this post was done by the AISHL group of Felix Hofstätter, Harriet Wood, Louis Thomson, Oliver Jaffe, and Patrik Bartak under the supervision of Francis Rhys Ward and with help from Sam Brown (not an AISHL participant).

1 Introduction

For a long time, the alignment community has discussed the possibility that advanced AI systems may... (read 5223 more words →)

•••

LESSWRONG
LW

LESSWRONG
LW

LAThomson

Towards shutdownable agents via stochastic choice

Tall Tales at Different Scales: Evaluating Scaling Trends For Deception In Language Models

A Framework for Eval Awareness

Agentic Monitoring for AI Control

LAThomson

LAThomson

A Framework for Eval Awareness

Agentic Monitoring for AI Control

Towards shutdownable agents via stochastic choice

Tall Tales at Different Scales: Evaluating Scaling Trends For Deception In Language Models

LAThomson

Towards shutdownable agents via stochastic choice

Tall Tales at Different Scales: Evaluating Scaling Trends For Deception In Language Models

A Framework for Eval Awareness

Agentic Monitoring for AI Control

LAThomson

LAThomson

A Framework for Eval Awareness

Agentic Monitoring for AI Control

Towards shutdownable agents via stochastic choice

Tall Tales at Different Scales: Evaluating Scaling Trends For Deception In Language Models

Introduction

Abstract

1 Introduction