I think your first sentence is actually compatible with my view. If GPT-7 is very dangerous and OpenAI claims they can use some specific set of safety techniques to make it safe, I agree that the burden of proof is on them. But I also think the history of technology should make you expect on priors that the kind of safety research intended to solve actual safety problems (rather than safetywash) is net positive.

I don't think it's worth getting into why, but briefly it seems like the problems studied by many researchers are easier versions of problems that would make a big dent in alignment. For example, Evan wants to ultimately get to level 7 interpretability, which is just a harder version of levels 1-5.

I have not really thought about the other side-- making models more usable enables more scaling (as distinct from the argument that understanding gained from interpretability is useful for capabilities) but it mostly seems confined to specific work done by labs that is pointed at usability rather than safety. Maybe you could randomly pick two MATS writeups from 2024 and argue that the usability impact makes them net harmful.

"AI achieves silver-medal standard solving International Mathematical Olympiad problems"

Thomas Kwa1d5311

In early 2022 Paul Christiano and Eliezer Yudkowsky publicly stated a bet: Paul thought an IMO gold medal was 8% by 2025, and Eliezer >16%. Paul said "If Eliezer wins, he gets 1 bit of epistemic credit." I'm not sure how to do the exact calculation, but based on the current market price of 69% Eliezer is probably expected to win over half a bit of epistemic credit.

If the news holds up we should update partway to the takeaways Paul virtuously laid out in the post, to the extent we haven't already.

How I'd update
The informative:
I think the IMO challenge would be significant direct evidence that powerful AI would be sooner, or at least would be technologically possible sooner. I think this would be fairly significant evidence, perhaps pushing my 2040 TAI probability up from 25% to 40% or something like that.
I think this would be significant evidence that takeoff will be limited by sociological facts and engineering effort rather than a slow march of smooth ML scaling. Maybe I'd move from a 30% chance of hard takeoff to a 50% chance of hard takeoff.
If Eliezer wins, he gets 1 bit of epistemic credit.^[2]^[3] These kinds of updates are slow going, and it would be better if we had a bigger portfolio of bets, but I'll take what we can get.
This would be some update for Eliezer's view that "the future is hard to predict." I think we have clear enough pictures of the future that we have the right to be surprised by an IMO challenge win; if I'm wrong about that then it's general evidence my error bars are too narrow.

News : Biden-⁠Harris Administration Secures Voluntary Commitments from Leading Artificial Intelligence Companies to Manage the Risks Posed by AI

Thomas Kwa3d20

A more recent paper shows that an equally strong model is not needed to break watermarks though paraphrasing. It suffices to have a quality oracle and a model that achieves equal quality with positive probability.

Daniel Kokotajlo's Shortform

Thomas Kwa25d80

Vertical takeoff aircraft require a far more powerful engine than a helicopter to lift the aircraft at a given vertical speed because they are shooting high velocity jets of air downwards. An engine "sufficiently powerful" for a helicopter would not be sufficient for VTOL.

TsviBT's Shortform

Thomas Kwa1mo72

This argument does not seem clear enough to engage with or analyze, especially steps 2 and 3. I agree that concepts like reflective stability have been confusing, which is why it is important to develop them in a grounded way.

Richard Ngo's Shortform

Thomas Kwa1moΩ3130

How can the mistakes rationalists are making be expressed in the language of Bayesian rationalism? Priors, evidence, and posteriors are fundamental to how probability works.

[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations

Thomas Kwa1mo130

Relevant: Redwood found that fine-tuning and RL are both capable of restoring the full performance of sandbagging (password-locked) models created using fine-tuning.

Thomas Kwa's Shortform

Thomas Kwa1mo60

Is this even possible? Flexibility/generality seems quite difficult to get if you also want the long-range effects of the agent's actions, as at some point you're just solving the halting problem. Imagine that the agent and environment together are some arbitrary Turing machine and halting gives low reward. Then we cannot tell in general if it eventually halts. It also seems like we cannot tell in practice whether complicated machines halt within a billion steps without simulation or complicated static analysis?

Thomas Kwa's Shortform

Thomas Kwa1mo40

Do you think we could easily test this without having a deceptive model lying around? I could see us having level 5 and testing it in experimental setups like the sleeper agents paper, but being unconfident that our interpretability would actually work against a deceptive model. This seems analogous to red-teaming failure in AI control, but much harder because the models could very easily have ways we don't think of to hide its cognition internally.

Thomas Kwa's Shortform

Thomas Kwa1mo430

People with p(doom) > 50%: would any concrete empirical achievements on current or near-future models bring your p(doom) under 25%?

Answers could be anything from "the steering vector for corrigibility generalizes surprisingly far" to "we completely reverse-engineer GPT4 and build a trillion-parameter GOFAI without any deep learning".