theashwinner — LessWrong

Negligent AI: Reasonable Care for AI Safety

theashwinner1mo10

Cool idea! I had a few quick questions/thoughts:

"Mean Δ Does not Necessarily Imply Alignment to the Law of Negligence and May Imply Poor Legal Reasoning": Poor legal reasoning can probably be tested empirically right? I'd imagine there are already benchmarks for legal reasoning. It might be interesting to run these frontier models on the legal reasoning questions related to negligence if this hasn't been done already and see if, say, a model like Grok performs poorly.
I'd be slightly curious in an ablation that tested a condition like the following: "I’m t

... (read more)

Monitoring benchmark for AI control

theashwinner2mo10

Very cool work!

I'm curious how you broke down the different difficulty grades for the technical tier. Were there any heuristics that you used that could potentially generalize for other settings?
- Relatedly: I'd be curious if you had a breakdown for safety across different difficulty gradings for different monitors.
Opus 4.5 seems to be the best at catching Opus 4.5 generated attacks. Do you know how much of the effect here is because models might just be particularly good at catching their own attacks?
- The reason I'm slightly curious is because I'm wondering

Sebastian Prasanna, theashwinner

2mo

This is a blog post for our paper Basic Legibility Protocols Improve Trusted Monitoring. This work was done as part of the CBAI Summer Research Fellowship, mentored by Cody Rushing.

In this blog post, we’ll go over the main results, and focus more on the engineering and methodological takeaways than we did in the paper. Notably, the section on red teaming takeaways is not in the paper. This blog post is intended for those familiar with AI control - we assume basic knowledge of the control agenda and the original control paper.

Introduction

A core challenge with trusted monitoring is that the weaker monitor often cannot effectively oversee the untrusted model because it does not fully understand the latter’s outputs. To address this, we introduce legibility protocols, which augment trusted...

(Continue Reading - 3017 more words)