This seems great in principle.
The below is meant in the spirit of [please consider these things while moving forward with this], and not [please don't move forward until you have good answers on everything].
That said:
First, I think it's important to clearly distinguish:
This program would be doing (3), so it's important to be aware that (1) is not in itself much of an argument. I expect that it's very hard to do (3) well, and that even a perfect version doesn't allow us to jump to the (1) of our dreams. But I still think it's a good idea!
Some thoughts that might be worth considering (very incomplete, I'm sure):
That's my guess too, but I'm not highly confident in the [no attractors between those two] part.
It seems conceivable to have a not-quite-perfect alignment solution with a not-quite-perfect self-correction mechanism that ends up orbiting utopia, but neither getting there, nor being flung off into oblivion.
It's not obvious that this is an unstable, knife-edge configuration. It seems possible to have correction/improvement be easier at a greater distance from utopia. (whether that correction/improvement is triggered by our own agency, or other systems)
If stable orbits exist, it's not obvious that they'd be configurations we'd endorse (or that the things we'd become would endorse them).
Anyway, overall I'd be surprised if it doesn't help substantially to have more granular estimates.
Oh, I'm certainly not claiming that no-one should attempt to make the estimates.
I'm claiming that, conditional on such estimation teams being enshrined in official regulation, I'd expect their results to get misused. Therefore, I'd rather that we didn't have official regulation set up this way.
The kind of risk assessments I think I would advocate would be based on the overall risk of a lab's policy, rather than their immediate actions. I'd want regulators to push for safer strategies, not to run checks on unsafe strategies - at best that seems likely to get a local minimum (and, as ever, overconfidence).
More [evaluate the plan to get through the minefield], and less [estimate whether we'll get blown up on the next step]. (importantly, it won't always be necessary to know which particular step forward is more/less likely to be catastrophic, in order to argue that an overall plan is bad)
It's important not to ignore that this speech is to the general public.
While I agree that "in the most unlikely but extreme cases" is not accurate, it's not clear that this reflects the views of the PM / government, rather than what they think it's expedient to say.
Even if they took the risk fully seriously, and had doom at 60%, I don't think he'd say that in a speech.
The speech is consistent with [not quite getting it yet], but also consistent with [getting it, but not thinking it's helpful to say it in a public speech]. I'm glad Eliezer's out there saying the unvarnished truth - but it's less clear that this would be helpful from the prime minister.
It's worth considering the current political situation: the Conservatives are very likely to lose the next election (no later than Jan 2025 - but it often happens early [this lets the governing party pick their moment, have the element of surprise, and look like calling the election was a positive choice]).
Being fully clear about the threat in public could be perceived as political desperation. So far, the issue hasn't been politicized. If not coming out with the brutal truth helps with that, it's likely a price worth paying. In particular, it doesn't help if the UK government commits to things that Labour will scrap as soon as they get in.
Perhaps more importantly from his point of view, he'll need support from within his own party over the next year - if he's seen as sabotaging the Conservatives' chances in the next election by saying anything too weird / alarmist-seeming / not-playing-to-their-base, he may lose that.
Again, it's also consistent with not quite getting it, but that's far from the only explanation.
We could do a lot worse than Rishi Sunak followed by Keir Starmer.
Relative to most plausible counterfactuals, we seem to have gotten very lucky here.
Thanks for clarifying your views. I think it's important.
...build consensus around conditional pauses...
My issue with this is that it's empty unless the conditions commit labs to taking actions they otherwise wouldn't. Anthropic's RSP isn't terrible, but I think a reasonable summary is "Anthropic will plan ahead a bit, take the precautions they think make sense, and pause when they think it's a good idea".
It's a commitment to take some actions that aren't pausing - defining ASL4 measures, implementing ASL3 measures that they know are possible. That's nice as far as it goes. However, there's nothing yet in there that commits them to pause when they don't think it's a good idea.
They could have included such conditions, even if they weren't concrete, and wouldn't come in to play until ASL4 (e.g. requiring that particular specifications or evals be approved by an external board before they could move forward). That would have signaled something. They chose not to.
That might be perfectly reasonable, given that it's unilateral. But if (even) Anthropic aren't going to commit to anything with a realistic chance of requiring a lengthy pause, that doesn't say much for RSPs as conditional pause mechanisms.
The transparency probably does help to a degree. I can imagine situations where greater clarity in labs' future actions might help a little with coordination, even if they're only doing what they'd do without the commitment.
Actively fighting improvements on the status quo because they might be confused for sufficient progress feels icky to me in a way that’s hard to articulate.
This seems a reasonable criticism only if it's a question of [improvement with downside] vs [status-quo]. I don't think the RSP critics around here are suggesting that we throw out RSPs in favor of the status-quo, but that we do something different.
It may be important to solve x, but also that it's not prematurely believed we've solved x. This applies to technical alignment, and to alignment regulation.
Things being "confused for sufficient progress" isn't a small problem: this is precisely what makes misalignment an x-risk.
Initially, communication around RSPs was doing a bad job of making their insufficiency clear.
Evan's, Paul's and your posts are welcome clarifications - but such clarifications should be in the RSPs too (not as vague, easy-enough-to-miss caveats).
That's reasonable, but most of my worry comes back to:
In part, I'm worried that the argument for (1) is too simple - so that a forecasting team might put almost all the emphasis elsewhere, producing a 30-page report with 29 essentially irrelevant pages. Then it might be hard to justify coming to the same conclusion once the issues on 29 out of 30 pages are fixed.
I'd prefer to stick to the core argument: a powerful model and unknown unknowns are sufficient to create too much risk. The end. We stop until we fix that.
The only case I can see against this is [there's a version of using AI assistants for alignment work that reduces overall risk]. Here I'd like to see a more plausible positive case than has been made so far. The current case seems to rely on wishful thinking (it's more specific than the one sentence version, but still sketchy and relies a lot on [we hope this bit works, and this bit too...]).
However, I don't think Eliezer's critique is sufficient to discount approaches of this form, since he tends to focus on the naive [just ask for a full alignment solution] versions, which are a bit strawmannish. I still think he's likely to be essentially correct - that to the extent we want AI assistants to be providing key insights that push research in the right direction, such assistants will be too dangerous; to the extent that they can't do this, we'll be accelerating a vehicle that can't navigate.
[EDIT: oh and of course there's the [if we really suck at navigation, then it's not clear a 20-year pause gives us hugely better odds anyway] argument; but I think there's a decent case that improving our ability to navigate might be something that it's hard to accelerate with AI assistants, so that a 5x research speedup does not end up equivalent to having 5x more time]
But this seems to be the only reasonable crux. This aside, we don't need complex analyses.
it relies on evals that we do not have
I agree that this is a problem, but it strikes me that we wouldn't necessarily need a concrete eval - i.e. we wouldn't need [by applying this concrete evaluation process to a model, we can be sure we understand it sufficiently].
We could have [here is a precise description of what we mean by "understanding a model", such that we could, in principle, create an evaluation process that answers this question].
We can then say in an RSP that certain types of model must pass an understanding-in-this-sense eval, even before we know how to write an understanding-in-this-sense eval. (though it's not obvious to me that defining the right question isn't already most of the work)
Personally, I'd prefer that this were done already - i.e. that anything we think is necessary should be in the RSP at some level of abstraction / indirection. That might mean describing properties an eval would need to satisfy. It might mean describing processes by which evals could be approved - e.g. deferring to an external board. [Anthropic's Long Term Benefit Trust doesn't seem great for this, since it's essentially just Paul who'd have relevant expertise (?? I'm not sure about this - it's just unclear that any of the others would)]
I do think it's reasonable for labs to say that they wouldn't do this kind of thing unilaterally - but I would want them to push for a more comprehensive setup when it comes to policy.
Oh I didn't mean only to do it afterwards. I think before is definitely required to know the experiment is worth doing with a given setup/people. Afterwards is nice-to-have for Science. (even a few blitz games is better than nothing)
Oh that's cool - nice that someone's run the numbers on this.
I'm actually surprised quite how close-to-50% both backgammon and poker are.
I don't think it makes sense to classify every instance of this as deceptive alignment - and I don't think this is the usual use of the term.
I think that to say "this is deceptive alignment" is generally to say something like "there's a sense in which this system has a goal different from ours, is modeling the selection pressure it's under, anticipating that this selection pressure may not exist in the future, and adapting its behaviour accordingly".
That still leaves things underdefined, e.g. since this can all happen implicitly and/or without the system knowing this mechanism exists.
However, if you're not suggesting in any sense that [anticipation of potential future removal of selection pressure] is a big factor, then it's strange to call it deceptive alignment.
I assume Wiblin means it in this sense - not that this is the chance we get catastrophically bad generalization, but rather that it happens via a mechanism he'd characterize this way.
[I'm now less clear that this is generally agreed, since e.g. Apollo seem to be using a foolish-to-my-mind definition here: When an AI has Misaligned goals and uses Strategic Deception to achieve them (see "Appendix C - Alternative definitions we considered", for clarification).
This is not close to the RFLO definition, so I really wish they wouldn't use the same name. Things are confusing enough without our help.]
All that said, it's not clear to me that [deceptive alignment] is a helpful term or target, given that there isn't a crisp boundary, and that there'll be a tendency to tackle an artificially narrow version of the problem.
The rationale for solving it usually seems to be [if we can solve/avoid this subproblem, we'd have instrumentally useful guarantees in solving the more general generalization problem] - but I haven't seen a good case made that we get the kind of guarantees we'd need (e.g. knowing only that we avoid explicit/intentional/strategic... deception of the oversight process is not enough).
It's easy to motte-and-bailey ourselves into trouble.