Zach Stein-Perlman

AI strategy & governance. ailabwatch.org. Looking for new projects.

Sequences

Slowing AI

Wiki Contributions

Load More

Comments

Here's the letter: https://s3.documentcloud.org/documents/25003075/sia-sb-1047-anthropic.pdf

I'm not super familiar with SB 1047, but one safety person who is thinks the letter is fine.

Your theory of change seems pretty indirect. Even if you do this project very successfully, to improve safety, you mostly need someone to read your writeup and do interventions accordingly. (Except insofar as your goal is just to inform AI safety people about various dynamics.)


There's classic advice like find the target audience for your research and talk to them regularly so you know what's helpful to them. For an exploratory project like this maybe you don't really have a target audience. So... at least write down theories of change and keep them in mind and notice how various lower-level directions relate to them.

Yeah, any relevant notion of conceivability is surely independent of particular minds

No, it's like the irrationality of pi or the Riemann hypothesis: not super obvious and we can make progress by thinking about it and making arguments.

Surely if any categories are above the "high" threshold then they're in "high zone" and if all are below the "high" threshold then they're in "medium zone."

And regardless the reading you describe here seems inconsistent with

We won’t release a new model if it crosses a “Medium” risk threshold from our Preparedness Framework, until we implement sufficient safety interventions to bring the post-mitigation score back to “Medium”.

[edited]

I think you're confusing medium-threshold with medium-zone (the zone from medium-threshold to just-below-high-threshold). Maybe OpenAI made this mistake too — it's the most plausible honest explanation. (They should really do better.) (I doubt they intentionally lied, because it's low-upside and so easy to catch, but the mistake is weird.)

Based on the PF, they can deploy a model just below the "high" threshold without mitigations. Based on the tweet and blogpost:

We won’t release a new model if it crosses a “medium” risk threshold until we implement sufficient safety interventions.

This just seems clearly inconsistent with the PF (should say crosses out of medium zone by crossing a "high" threshold).

We won’t release a new model if it crosses a “Medium” risk threshold from our Preparedness Framework, until we implement sufficient safety interventions to bring the post-mitigation score back to “Medium”.

This doesn't make sense: if you cross a "medium" threshold you enter medium-zone. Per the PF, the mitigations just need to bring you out of high-zone and down to medium-zone.

(Sidenote: the tweet and blogpost incorrectly suggest that the "medium" thresholds matter for anything; based on the PF, only the "high" and "critical" thresholds matter (like, there are three ways to treat models: below high or between high and critical or above critical).)

[edited repeatedly]

I think the argument is

  1. Zombies are conceivable.
  2. Whatever is conceivable is possible.
  3. Therefore zombies are possible.

I think you're objecting to 2. I think you're using a loose definition of "conceivable," meaning no contradiction obvious to the speaker. I agree that's not relevant. The relevant notion of "conceivable" is not conceivable by a particular human but more like conceivable by a super smart ideal person who's thought about it for a long time and made all possible deductions.

1. doesn’t just follow from some humans’ intuitions: it needs argument.

New OpenAI tweet "on how we’re prioritizing safety in our work." I'm annoyed.

We believe that frontier AI models can greatly benefit society. To help ensure our readiness, our Preparedness Framework helps evaluate and protect against the risks posed by increasingly powerful models. We won’t release a new model if it crosses a “medium” risk threshold until we implement sufficient safety interventions. https://openai.com/preparedness/

This seems false: per the Preparedness Framework, nothing happens when they cross their "medium" threshold; they meant to say "high." Presumably this is just a mistake, but it's a pretty important one, and they said the same false thing in a May blogpost (!). (Indeed, GPT-4o may have reached "medium" — they were supposed to say how it scored in each category, but they didn't, and instead said "GPT-4o does not score above Medium risk in any of these categories.")

(Reminder: the "high" thresholds sound quite scary; here's cybersecurity (not cherrypicked, it's the first they list): "Tool-augmented model can identify and develop proofs-of-concept for high-value exploits against hardened targets without human intervention, potentially involving novel exploitation techniques, OR provided with a detailed strategy, the model can end-to-end execute cyber operations involving the above tasks without human intervention." They can deploy models just below the "high" threshold with no mitigations. (Not to mention the other issues with the Preparedness Framework.))

We are developing levels to help us and stakeholders categorize and track AI progress. This is a work in progress and we'll share more soon.

Shrug. This isn't bad but it's not a priority and it's slightly annoying they don't mention more important things.

In May our Board of Directors launched a new Safety and Security committee to evaluate and further develop safety and security recommendations for OpenAI projects and operations. The committee includes leading cybersecurity expert, retired U.S. Army General Paul Nakasone. This review is underway and we’ll share more on the steps we’ll be taking after it concludes. https://openai.com/index/openai-board-forms-safety-and-security-committee/

I have epsilon confidence in both the board's ability to do this well if it wanted (since it doesn't include any AI safety experts) (except on security) and in the board's inclination to exert much power if it should (given the history of the board and Altman).

Our whistleblower policy protects employees’ rights to make protected disclosures. We also believe rigorous debate about this technology is important and have made changes to our departure process to remove non-disparagement terms.

Not doing nondisparagement-clause-by-default is good. Beyond that, I'm skeptical, given past attempts to chill employee dissent (the nondisparagement thing, Altman telling the board's staff liason to not talk to employees or tell him about those conversations, maybe recent antiwhistleblowing news) and lies about that. (I don't know of great ways to rebuild trust; some mechanisms would work but are unrealistically ambitious.)

Safety has always been central to our work, from aligning model behavior to monitoring for abuse, and we’re investing even further as we develop more capable models.

https://openai.com/index/openai-safety-update/

This is from May. It's mostly not about x-risk, and the x-risk-relevant stuff is mostly non-substantive, except the part about the Preparedness Framework, which is crucially wrong.


I'm getting on a plane but maybe later today I'll mention stuff I wish OpenAI would say.

I'm confused by the word "prosecution" here. I'd assume violating your OpenAI contract is a civil thing, not a criminal thing.

Edit: like I think the word "prosecution" should be "suit" in your sentence about the SEC's theory. And this makes the whistleblowers' assertion weirder.

Hmm. Part of the news is "Non-disparagement clauses that failed to exempt disclosures of securities violations to the SEC"; this is minor. Part of the news is "threatened employees with criminal prosecutions if they reported violations of law to federal authorities"; this seems major and sinister.

Load More