Wiki Contributions


Yeah basically Davidad has not only a safety plan but a governance plan which actively aims at making this shift happen!

Thanks for writing that. I've been trying to taboo "goals" because it creates so much confusion, which this post tries to decrease. In line with this post, I think what matters is how difficult a task is to achieve, and what it takes to achieve it in terms of ability to overcome obstacles.

Because it's meaningless to talk about a "compromise" dismissing one entire side of the people who disagree with you (but only one side!).

Like I could say "global compute thresholds is a robustly good compromise with everyone who disagrees with me"

*Footnote: only those who're more pessimistic than me.

That may be right but then the claim is wrong. The true claim would be "RSPs seem like a robustly good compromise with people who are more optimistic than me".

And then the claim becomes not really relevant?

Holden, thanks for this public post. 

  1. I would love if you could write something along the lines of what you wrote in "If it were all up to me, the world would pause now - but it isn’t, and I’m more uncertain about whether a “partial pause” is good" at the top of ARC post, which as we discussed and as I wrote in my post would make RSPs more likely to be positive in my opinion by making the policy/voluntary safety commitments distinction clearer.


Responsible scaling policies (RSPs) seem like a robustly good compromise with people who have different views from mine

2. It seems like it's empirically wrong based on the strong pushback RSPs received so that at least you shouldn't call it "robustly", unless you mean a kind of modified version that would accommodate the most important parts of the pushback. 

3. I feel like overall the way you discuss RSPs here is one of the many instances of people chatting about idealized RSPs that are not specified, and pointed to against disagreement. See below, from my post:

And second, the coexistence of ARC's RSP framework with the specific RSPs labs implementations allows slack for commitments that are weak within a framework that would in theory allow ambitious commitments. It leads to many arguments of the form:

  • “That’s the V1. We’ll raise ambition over time”. I’d like to see evidence of that happening over a 5 year timeframe, in any field or industry. I can think of fields, like aviation where it happened over the course of decades, crashes after crashes. But if it’s relying on expectations that there will be large scale accidents, then it should be clear. If it’s relying on the assumption that timelines are long, it should be explicit. 
  • “It’s voluntary, we can’t expect too much and it’s way better than what’s existing”. Sure, but if the level of catastrophic risks is 1% (which several AI risk experts I’ve talked to believe to be the case for ASL-3 systems) and that it gives the impression that risks are covered, then the name “responsible scaling” is heavily misleading policymakers. The adequate name for 1% catastrophic risks would be catastrophic scaling, which is less rosy.

Thanks for the post.

Would your concerns be mostly addressed if ARC had published a suggestion for a much more comprehensive risk management framework, and explicitly said "these are the principles that we want labs' risk-management proposals to conform to within a few years, but we encourage less-thorough risk management proposals before then, so that we can get some commitments on the table ASAP, and so that labs can iterate in public. And such less-thorough risk management proposals should prioritize covering x, y, z."

Great question! A few points: 

  1. Yes, many of the things I point are "how to do things well" and I would in fact much prefer something that contains a section "we are striving towards that and our current effort is insufficient" than the current RSP communication which is more "here's how to responsibly scale". 
  2. That said, I think we disagree on the reference class of the effort (you say "a few years"). I think that you could do a very solid MVP of what I suggest with like 5 FTEs over 6 months. 
  3. As I wrote in "How to move forward" (worth skimming to understand what I'd change) I think that RSPs would be incredibly better if they: 
    1. had a different name
    2. said that they are insufficient
    3. linked to a post which says "here's the actual thing which is needed to make us safe". 
  4. Answer to your question: if I were optimizing in the paradigm of voluntary lab commitments as ARC is, yes I would much prefer that. I flagged early though that because labs are definitely not allies on this (because an actual risk assessment is likely to output "stop"), I think the "ask labs kindly" strategy is pretty doomed and I would much prefer a version of ARC trying to acquire bargaining power through a way or another (policy, PR threat etc.) rather than adapting their framework until labs accept to sign them. 


If people took your proposal as a minimum bar for how thorough a risk management proposal would be, before publishing, it seems like that would interfere with labs being able to "post the work they are doing as they do it, so people can give feedback and input".

I don't think it's necessarily right, e.g. "the ISO standard asks the organization to define risk thresholds" could be a very simple task, much simpler than developing a full eval. The tricky thing is just to ensure we comply with such levels (and the inability to do that obviously reveals a lack of safety). 

"ISO proposes a much more comprehensive procedure than RSPs", it's not right either that it would take longer, it's just that there exists risk management tools, that you can run in like a few days, that helps having a very broad coverage of the scenario set.

"imply significant chances to be stolen by Russia or China (...). What are the risks downstream of that?" once again you can cover the most obvious things in like a couple pages. Writing "Maybe they would give the weights to their team of hackers, which increases substantially the chances of leak and global cyberoffence increase". And I would be totally fine with half-baked things if they were communicated as such and not as RSPs are.

Two questions related to it: 

  1. What happens in your plan if it takes five years to solve the safety evaluation/deception problem for LLMs (i.e. it's extremely hard)?
  2. Do you have an estimate of P({China; Russia; Iran; North Korea} steals an ASL-3 system with ASL-3 security measures)? Conditional on one of these countries having the system, what's your guess of p(catastrophe)?

Thanks Eli for the comment. 

One reason why I haven't provided much evidence is that I think it's substantially harder to give evidence of a "for all" claim (my side of the claim) than a "there exists" (what I ask Evan). I claim that it doesn't happen that a framework on a niche area evolves so fast without accidents based on what I've seen, even in domains with substantial updates, like aviation and nuclear.

I could potentially see it happening with large accidents, but I personally don't want to bet on that and I would want it to be transparent if that's the assumption. I also don't buy the "small coordinations allow larger coordinations" for domain-specific policy. Beyond what you said above, my sense is that policymakers satisfice and hence tend to not come back on a policy that sucks if that's sufficiently good-looking to stakeholders to not have substantial incentives to change.

GDPR cookies banner sucks for everyone and haven't been updated yet, 7 years after GDPR. Standards in the EU are not even updated more rapidly than 5y by default (I'm talking about standards, not regulation), and we'll have to bargain to try to bring it down to reasonable timeframes AI-specific. 

IAEA & safety in nuclear upgraded substantially after each accident, likewise for aviation but we're talking about decades, not 5 years.

Load More