x

LESSWRONG

LW

Scrattle — LessWrong

Scrattle

Scrattle

Message

2

1

13y

Scrattle

2

13y

Token Statistics Fail at AI Attack Detection But Generation Profiles Succeed

My initial impression is that you have an impressive result, but you need to be very careful about assumptions and limitations, and take care to avoid Goodharting.

Specifically:

The goal of monitoring in AI control is to be able to catch a potentially untrusted model taking undesirable actions.
AI control typically operationalize this in experiments as trying to detect an attack policy which is instructed to pursue a specific side objective.

The extent to which solving 2) generalises to solving 1) depends on the methodology used. You can trivially block the si... (read more)