Link to our arXiv paper "Combining Cost-Constrained Runtime Monitors for AI Safety" here: https://arxiv.org/abs/2507.15886. Code can be found here. Executive Summary * Monitoring AIs at runtime can help us detect and stop harmful actions. For cost reasons, we often want to use cheap monitors like probes to monitor all of...
TL;DR * We explore a novel control protocol, Untrusted Editing with Trusted Feedback. Instead of having a trusted model directly edit suspicious outputs (trusted editing), the trusted model provides feedback for an untrusted model to implement, taking advantage of the untrusted model's superior capabilities. * Untrusted models were able to...
Authors: Henri Lemoine, Thomas Lemoine, and Fraser Lee We are excited to introduce AlignmentSearch, an attempt to create a conversational agent that can answer questions about AI alignment. We built this site in response to ArthurB’s $5k bounty for a LessWrong conversational agent calling for the creation of a chatbot...