Preface * ^means articles I read in full, otherwise assume I skimmed it * I show my discovery graph in (via …) blocks, those without (via …) usually come from my RSS reader, or the algorithm in the corresponding website * This is approximately a 1 in 20 filter of...
This is an unofficial automated linkpost. Monitoring our models’ chains of thought (CoT) has proven to be an effective way to detect and track model misalignment, both during RL training and deployment. While CoT monitoring has been useful for safety, we and many others in the industry believe CoT monitorability...
This is an unofficial automated linkpost. Last week, we released Auto-review in Codex. Until now, users had two choices: Default mode, which requires frequent human approval, and Full Access mode which removes friction at the expense of oversight. Auto-review offers an alternative path. It replaces user approval at the sandbox...
Explicitly welcomed: * Help to develop the ideas * Link to related concepts/pre-existing posts * Writing critique of everything