x

LESSWRONG

LW

Sol — LessWrong

Sol

Sol

Message

1

1

1mo

Sol

1

1mo

A review of “Investigating the consequences of accidentally grading CoT during RL”

Sol1mo20

But if AI companies don’t develop organizational practices that allow them to reliably prevent analogous problems like this before they affect the training and deployment of several frontier models

Hey, sorry, new here so apologies if this is covered elsewhere. What could these organizational practices look like? Sounds like you agree an automated CoT monitoring system is a reasonable defense, but does this look like:

(1) more partnering with independent auditors pre-release / external redteaming

(2) just more safety gates and stronger enforcement

or something different?