HoldenKarnofsky

Responsible Scaling Policy v3

All views are my own, not Anthropic’s. This post assumes Anthropic’s announcement of RSP v3.0 as background. Today, Anthropic released its Responsible Scaling Policy 3.0. The official announcement discusses the high-level thinking behind it. This is a more detailed post giving my own takes on the update. First, the big...

Feb 24179

Sabotage Evaluations for Frontier Models

by David Duvenaud, Joe Benton, Sam Bowman, evhub, mishajw, Eric Christiansen, HoldenKarnofsky, Ethan Perez, and Buck

This is a linkpost for a new research paper from the Alignment Evaluations team at Anthropic and other researchers, introducing a new suite of evaluations of models' abilities to undermine measurement, oversight, and decision-making. Paper link. Abstract: > Sufficiently capable models could subvert human oversight and decision-making in important contexts....

Oct 18, 202495

Case studies on social-welfare-based standards in various industries

Last year, I posted a call for case studies on social-welfare-based standards for companies and products (including standards imposed by regulation). The goal was to gain general context on standards to inform work on possible standards and/or regulation for AI. This resulted[1] in several dozen case studies that I found...

Jun 20, 202442

Good job opportunities for helping with the most important century

Yes, this is my first post in almost a year. I’m no longer prioritizing this blog, but I will still occasionally post something. I wrote ~2 years ago that it was hard to point to concrete opportunities to help the most important century go well. That’s changing. There are a...

Jan 18, 202436

We're Not Ready: thoughts on "pausing" and responsible scaling policies

Views are my own, not Open Philanthropy’s. I am married to the President of Anthropic and have a financial interest in both Anthropic and OpenAI via my spouse. Over the last few months, I’ve spent a lot of my time trying to help out with efforts to get responsible scaling...

Oct 27, 2023199

3 levels of threat obfuscation

One of the biggest reasons alignment might be hard is what I’ll call threat obfuscation: various dynamics that might make it hard to measure/notice cases where an AI system has problematic misalignment (even when the AI system is in a controlled environment and one is looking for signs of misalignment)....

Aug 2, 202371

A Playbook for AI Risk Reduction (focused on misaligned AI)

I sometimes hear people asking: “What is the plan for avoiding a catastrophe from misaligned AI?” This post gives my working answer to that question - sort of. Rather than a plan, I tend to think of a playbook.1 * A plan connotes something like: “By default, we ~definitely fail....

Jun 6, 202390

HoldenKarnofsky

HoldenKarnofsky

Thoughts on the Singularity Institute (SI)

Discussion with Nate Soares on a key alignment difficulty

We're Not Ready: thoughts on "pausing" and responsible scaling policies

Responsible Scaling Policy v3

HoldenKarnofsky

Thoughts on the Singularity Institute (SI)

Discussion with Nate Soares on a key alignment difficulty

We're Not Ready: thoughts on "pausing" and responsible scaling policies

Responsible Scaling Policy v3

Responsible Scaling Policy v3

Sabotage Evaluations for Frontier Models

Case studies on social-welfare-based standards in various industries

Good job opportunities for helping with the most important century

We're Not Ready: thoughts on "pausing" and responsible scaling policies

3 levels of threat obfuscation

A Playbook for AI Risk Reduction (focused on misaligned AI)