BionicD0LPH1N

Message

Email at acxmontreal@gmail.com for ACX Montreal-related issues/comments/feedback.

339

BionicD0LPH1N

Email at acxmontreal@gmail.com for ACX Montreal-related issues/comments/feedback.

Optimally Combining Probe Monitors and Black Box Monitors

Link to our arXiv paper "Combining Cost-Constrained Runtime Monitors for AI Safety" here: https://arxiv.org/abs/2507.15886. Code can be found here. Executive Summary * Monitoring AIs at runtime can help us detect and stop harmful actions. For cost reasons, we often want to use cheap monitors like probes to monitor all of...

Jul 27, 2025•52

Untrusted AIs can exploit feedback in control protocols

TL;DR * We explore a novel control protocol, Untrusted Editing with Trusted Feedback. Instead of having a trusted model directly edit suspicious outputs (trusted editing), the trusted model provides feedback for an untrusted model to implement, taking advantage of the untrusted model's superior capabilities. * Untrusted models were able to...

May 27, 2025•30

Introducing AlignmentSearch: An AI Alignment-Informed Conversional Agent

Authors: Henri Lemoine, Thomas Lemoine, and Fraser Lee We are excited to introduce AlignmentSearch, an attempt to create a conversational agent that can answer questions about AI alignment. We built this site in response to ArthurB’s $5k bounty for a LessWrong conversational agent calling for the creation of a chatbot...

Apr 1, 2023•79