My initial impression is that you have an impressive result, but you need to be very careful about assumptions and limitations, and take care to avoid Goodharting.
Specifically:
The goal of monitoring in AI control is to be able to catch a potentially untrusted model taking undesirable actions.
AI control typically operationalize this in experiments as trying to detect an attack policy which is instructed to pursue a specific side objective.
The extent to which solving 2) generalises to solving 1) depends on the methodology used. You can trivially block the si... (read more)
My initial impression is that you have an impressive result, but you need to be very careful about assumptions and limitations, and take care to avoid Goodharting.
Specifically:
The extent to which solving 2) generalises to solving 1) depends on the methodology used. You can trivially block the si... (read more)