How would you answer this without looking at the csv?
I wrote a post on my prior over Bernoulli distributions, called "Rethinking Laplace's Law of Success". Laplace's Law of Succession is based on a uniform prior over [0,1], whereas my prior is based on the following mixture distribution:
w1 * logistic-normal(0, sigma^2) + w2 * 0.5(dirac(0) + dirac(1)) + w3 * thomae_{100}(α) + w4 * uniform(0,1)
where:
- The first term captures logistic transformations of normal variables (weight w1), resolving the issue that probabilities should be spread across log-odds
- The second term captures deterministic programs (weight w2), allowing for exactly zero and one
- The third term captures rational probabilities with simple fractions (weight w3), giving weight to simple ratios
- The fourth term captures uniform interval (weight w4), corresponding to Laplace's original prior
The default parameters (w1=0.3, w2=0.1, w3=0.3, w4=0.3, sigma=5, alpha=2) reflect my intuition about the relative frequency of these different types of programs in practice.
Using this prior, we get the result [0.106, 0.348, 0.500, 0.652, 0.894]
The numbers are predictions for P(5th trial = R | k Rs observed in first 4 trials):
The Laplace's Rule of Succession five numbers using the are [0.167, 0.333, 0.500, 0.667, 0.833], but I think this is too conservative because it underestimate the likelihood of near-deterministic processes.
To clarify what I think is Ryan's point:
Key context: I think that if scheming is caught then it'll be removed quickly through (a) halting deployment, (b) training against it, or (c) demonstrating to the AIs that we caught them, making scheming unattractive. Hence, I think that scheming arises naturally at the roughly the capability where AIs are able to scheme successfully.
Pessimism levels about lab: I use Ryan Greenblatt's taxonomy of lab carefulness. Plan A involves 10 years of lead time with international coordination and heavy safety investment. Plan B involves 1-3 years of lead with US government buy-in. Plan C involves 2-9 months of lead where the leading lab spends its lead on safety. Plan D has 10-30 people inside who care about safety but no broader organizational support. Plan E has essentially no one inside taking safety seriously. My probabilities are somewhat more optimistic than Greenblatt's: Plans A-B (15%), Plan C (25%), Plan D (45%), Plan E (15%).
Current | GPT 6 | Professional | Best human | Superhuman | |
---|---|---|---|---|---|
Additional OOMS | 0 OOMs | +6 OOMs | +12 OOMs | +18 OOMs | +24 OOMs |
Effective compute | 10^27 | 10^33 | 10^39 | 10^45 | 10^51 |
Years (current scaling) | 2025 | 2029 | 2033 | 2037 | 2041 |
IQ equivalent | 50 | 150 | 250 | 350 | |
Rank among humans | 10 billionth | 100 000th | 1st |
The conversions here are super-rough point-estimates.[1]
To estimate the capability of the earliest schemers, we must estimate the probability that scheming emerges at a particular capability conditional on (i) the labs having a given plan carefulness, and (ii) no scheming arising at lower levels.
Below is a table of P(scheming emerges at this level | plan & no scheming at lower levels). For example, if a D-careful lab deploys Professional non-schemers to help build Best Humans, then they have a 30% chance of building Best Humans schemers. These probabilities are based on vibes, but I give some explanation below.
Current | GPT-6 | Professional | Best Human | Superhuman | |
---|---|---|---|---|---|
Plans A-C | 0% | 2% | 10% | 10% | 10% |
Plan D | 0% | 15% | 25% | 30% | 12% |
Plan E | 0% | 18% | 50% | 70% | 12% |
NB: I think that P(scheming emerges at this level | plan) will be higher, where this conditional includs scenarios that earlier AIs where scheming[2]
GPT-6 and Professional: The optimistic scenarios (A-C) have a lower chance of yielding weak schemers because:
By contrast, the pessimistic scenarios (D-E) have a high chance of yielding weak schemers because:
Best Human: If the lab deploys weak non-schemers, then optimistic situations have a lower chance of yielding strong schemers because:
If the lab deploys weak non-schemers, then pessimistic situations have a higher chance of yielding strong schemers because:
Superhuman: If the lab deploys strong non-schemers, then both optimistic and pessimistic situations have a low chance of yielding even stronger schemers. In pessimistic scenarios (D-E), the strong non-schemers can act in the interests of the lab, even if the lab leadership are miscalibrated about risks.
Without conditioning on scheming arising: The probability scheming emerges naturally is 56%. If it does emerge, the most likely capability level is Professional (20%), followed closely by Best Human (19%), then GPT-6 (10%) and Superhuman (6%). There's a 44% chance scheming never emerges naturally.
Conditional on scheming arising naturally: The distribution shifts substantially toward weaker schemers. Professional becomes most likely (37%), with Best Human close behind (35%), followed by GPT-6 (18%) and Superhuman (10%). This shift occurs because conditioning on scheming arising updates us toward pessimistic lab scenarios (Plans D and E shift from 60% to 73% combined probability), which have higher rates of weak AI scheming due to inadequate safety measures.
Here are some assumptions, credit to others:
Across all plans, P(scheming emerges at this level | plan) exceeds P(scheming emerges at this level | plan & no scheming at lower levels) because the former conditional includes scenarios where we have deployed earlier schemers. Deploying earlier schemes increases the chance that later generations are schemers, both evidentially (because the same mechanisms might apply to both early and late schemers) and causally (the early schemers could undermine our training and deployment of later AIs).
This effect is greater for the pessimistic scenarios (e.g. D/E) because we haven't managed to avoid research sabotage, whereas the optimistic scenarios (e.g. A-C) have a better chance of recovering from early undermining, and so P(scheming emerges at this level | plan) might be closer to P(scheming emerges at this level | plan & no scheming at lower levels).
In D, safety humans are outnumbered by the careless humans. However, as AI begins to automate cognitive labour, the relative effectiveness of these two factions should become less stark, as both factions approach the diminishing returns of additional cognitive labour.
To illustrate, the relative effectiveness between (10 humans) and (990 humans) is greater than between (10 humans and 1M AIs) and (990 humans and 990M AIs).
Hence, the D-scenario moves closer to the C-scenario and away from the E-scenario.
Yeah, E seems hard. Especially because from the outside E is indistinguishable from D, and many of the E strategies would be negatively-polarising, hence counterproductive on Plan D.
You can't really have a technical "Plan E" because there is approximately no one to implement the plan; in Plan E situations, the focus should be on moving to a higher level of political will and effort on mitigating risk.
There are no employee who could implement Plan E, but is there nothing that non-employees could do?
I don't think that doubling the incomes of people working on existing projects would be a good use of resources.
I think it's an honorable goal, but seems infeasible given the current landscape.
Although I think the critical period for safety evals is between training and internal deployment, not training and external deployment. See Greenblatt's Attaching requirements to model releases has serious downsides (relative to a different deadline for these requirements)
That's what "in particular" means, i.e. the "the order of the trials is irrelevant" is a particular feature