Nurturing the best AI safety talent as a Research Manager at MATS!

Previously worked as AI developer in speech recognition and gen AI for 3 years. Pursued part-time technical safety research (2021-24), and coaching for career impact and personal growth (since 2017).

Nurturing the best AI safety talent as a Research Manager at MATS!

Previously worked as AI developer in speech recognition and gen AI for 3 years. Pursued part-time technical safety research (2021-24), and coaching for career impact and personal growth (since 2017).

We are releasing a new paper called “The Elicitation Game: Evaluating Capability Elicitation Techniques”. See tweet thread here.

TL;DR: We train LLMs to only reveal their capabilities when given a password. We then test methods for eliciting the LLMs capabilities without the password. Fine-tuning works best, few-shot prompting and prefilling work okay, but activation steering isn’t effective.

Abstract

Capability evaluations are required to understand and regulate AI systems that may be deployed or further developed. Therefore, it is important that evaluations provide an accurate estimation of an AI system’s capabilities. However, in numerous cases, previously latent capabilities have been elicited from models, sometimes long after initial release. Accordingly, substantial efforts have been made to develop methods... (read 510 more words →)

Introduction and summary

This retrospective focuses on the 4-month MATS extension phase (referred to as "MATS 5.1") that ran from April 1 to July 25, 2024, and presents findings gathered from an end-of-extension survey as well as follow-up interviews and surveys ~5 months after the program.

Main changes from the 4.1 to 5.1 extension phase:

Cohort grew from 26 to 36 scholars split across London, Berkeley and remote participants;
MATS formalized research management for the London cohort and grew the team to 2 FTEs;
The cohort visited Google DeepMind's London offices;
The London team organized Tuesday lightning talks from scholars and MATS staff.

Key takeaways from MATS extension impact:

Research success: 75% of scholars published results in some form (paper,

... (read 4432 more words →)

Interesting work and findings. Like others suggested in the comments, recent Claude models may be particularly concerned about something looking like an evaluation. Have you tested other models / model families as a judges?

Additionally, models tend to recognise output from the same model family better than other, so you may want to use different models for different parts of the pipeline.

LESSWRONG
LW

LESSWRONG
LW

HenningB

HenningB

HenningB

The Elicitation Game: Evaluating capability elicitation techniques

MATS Spring 2024 Extension Retrospective

HenningB

HenningB

HenningB

The Elicitation Game: Evaluating capability elicitation techniques

MATS Spring 2024 Extension Retrospective

Abstract

Introduction and summary