(Crossposted from Midwittgenstein) We’re heading toward a world where AIs increasingly deal with other AIs. Agents will negotiate with, report to, and oversee other agents in increasingly high-stakes real-world settings. Something I’ve been worried about is how well alignment training in models will generalise from human-agent interactions to agent-agent interactions....
Write-up of a recent research sprint looking at factors influencing strategic deception in models TL;DR I tested models in a controlled scenario where they could deceptively inflate self-reported performance to influence an upcoming budget decision in their favour. Varying the budget proposal around a baseline lets us measure (a) whether...
I recently did a weekend research sprint investigating error-detection behaviour in DeepSeek R1 models and writing up preliminary findings. I think I found some cool stuff on how the model tracks its mistakes, and wanted to share the rough write-up in case it's of interest to anyone Summary I find...
This work was produced as part of Lee Sharkey's stream in the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort Intro and Motivation Sparse dictionary learning (SDL) has attracted a lot of attention recently as a method for interpreting transformer activations. They demonstrate that model activations can often...
Background This project was inspired by Anthropic’s post on attention head superposition, which constructed a toy model trained to learn a circuit to identify skip-trigrams that are OV-incoherent (attending from multiple destination tokens to a single source token) as a way to ensure that superposition would occur. Since the OV...
We summarize and compare several models and forecasts predicting when transformative AI will be developed. Highlights * The review includes quantitative models, including both outside and inside view, and judgment-based forecasts by (teams of) experts. * While we do not necessarily endorse their conclusions, the inside-view model the Epoch team...
(Crossposted from MidWittgenstein (as part of my lame resolution to write more this year)) This is a post about what I see as a bit of a confusion in some Decision Theory arguments. It touches on FDT criticisms of CDT/EDT so I wanted to post here in case I've misunderstood...