LESSWRONG
LW

724
Cleo Nardo
3273Ω240332605
Message
Dialogue
Subscribe

DMs open.

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Game Theory without Argmax
5Shortform
2y
148
Experiment: Test your priors on Bernoulli processes.
Cleo Nardo2d40

That's what "in particular" means, i.e. the "the order of the trials is irrelevant" is a particular feature

Reply
Experiment: Test your priors on Bernoulli processes.
Cleo Nardo2d*60

How would you answer this without looking at the csv?

I wrote a post on my prior over Bernoulli distributions, called "Rethinking Laplace's Law of Success". Laplace's Law of Succession is based on a uniform prior over [0,1], whereas my prior is based on the following mixture distribution:

w1 * logistic-normal(0, sigma^2) + w2 * 0.5(dirac(0) + dirac(1)) + w3 * thomae_{100}(α) + w4 * uniform(0,1)

where:

  • The first term captures logistic transformations of normal variables (weight w1), resolving the issue that probabilities should be spread across log-odds
  • The second term captures deterministic programs (weight w2), allowing for exactly zero and one
  • The third term captures rational probabilities with simple fractions (weight w3), giving weight to simple ratios
  • The fourth term captures uniform interval (weight w4), corresponding to Laplace's original prior

The default parameters (w1=0.3, w2=0.1, w3=0.3, w4=0.3, sigma=5, alpha=2) reflect my intuition about the relative frequency of these different types of programs in practice.

Using this prior, we get the result [0.106, 0.348, 0.500, 0.652, 0.894]

The numbers are predictions for P(5th trial = R | k Rs observed in first 4 trials):

  • If you see 0 Rs in the first 4 trials (all Ls), there's a 10.6% chance the 5th is R
  • If you see 1 R in the first 4 trials, there's a 34.8% chance the 5th is R
  • If you see 2 Rs in the first 4 trials, there's a 50% chance the 5th is R
  • If you see 3 Rs in the first 4 trials, there's a 65.2% chance the 5th is R
  • If you see 4 Rs in the first 4 trials (all Rs), there's an 89.4% chance the 5th is R

The Laplace's Rule of Succession five numbers using the  are [0.167, 0.333, 0.500, 0.667, 0.833], but I think this is too conservative because it underestimate the likelihood of near-deterministic processes.

Reply
Plans A, B, C, and D for misalignment risk
Cleo Nardo4d20

To clarify what I think is Ryan's point:

  • In D-labs, both the safety faction and the non-safety faction are leveraging AI labour.
  • AI labour makes D-labs seem more like C-labs and less like E-labs, directionally.
  • This is because the effectiveness ratio between (10 humans) and (990 humans) is greater than the ratio between (10 humans and 1M AIs) and (990 humans and 990M AIs).
  • This is because of diminishing returns to cognitive labour, i.e. cheap interventions.
Reply1
Shortform
Cleo Nardo6d*20

Conditional on scheming arising naturally, how capable will models be when they first emerge?

Key context: I think that if scheming is caught then it'll be removed quickly through (a) halting deployment, (b) training against it, or (c) demonstrating to the AIs that we caught them, making scheming unattractive. Hence, I think that scheming arises naturally at the roughly the capability where AIs are able to scheme successfully.

Pessimism levels about lab: I use Ryan Greenblatt's taxonomy of lab carefulness. Plan A involves 10 years of lead time with international coordination and heavy safety investment. Plan B involves 1-3 years of lead with US government buy-in. Plan C involves 2-9 months of lead where the leading lab spends its lead on safety. Plan D has 10-30 people inside who care about safety but no broader organizational support. Plan E has essentially no one inside taking safety seriously. My probabilities are somewhat more optimistic than Greenblatt's: Plans A-B (15%), Plan C (25%), Plan D (45%), Plan E (15%).

Capability levels

 Current GPT 6ProfessionalBest humanSuperhuman
Additional OOMS0 OOMs+6 OOMs+12 OOMs+18 OOMs+24 OOMs
Effective compute10^2710^3310^3910^4510^51
Years (current scaling)20252029203320372041
IQ equivalent 50150250350
Rank among humans 10 billionth100 000th1st 

The conversions here are super-rough point-estimates.[1]

Probability scheming emerges at each capability level

To estimate the capability of the earliest schemers, we must estimate the probability that scheming emerges at a particular capability conditional on (i) the labs having a given plan carefulness, and (ii) no scheming arising at lower levels.

Below is a table of P(scheming emerges at this level | plan & no scheming at lower levels). For example, if a D-careful lab deploys Professional non-schemers to help build Best Humans, then they have a 30% chance of building Best Humans schemers. These probabilities are based on vibes, but I give some explanation below.

 Current GPT-6ProfessionalBest HumanSuperhuman
Plans A-C0%2%10%10%10%
Plan D0%15%25%30%12%
Plan E0%18%50%70%12%

NB: I think that P(scheming emerges at this level | plan) will be higher, where this conditional includs scenarios that earlier AIs where scheming[2]

Why these probabilities

GPT-6 and Professional: The optimistic scenarios (A-C) have a lower chance of yielding weak schemers because:

  1. The lab implements monitoring of the chains-of-thought, internal activations, and tool-use.
  2. The lab implements untrusted monitoring while honeypotting the monitor.
  3. An AI that successfully evades these measures requires situational knowledge about its own training and deployment.
  4. The lab filters this knowledge from both (i) the AI's training data, and (ii) the AI's in-context environment.
  5. Therefore, the AIs would need to derive this knowledge independently.
  6. This seems difficult very difficult for GPT-6 and difficult for Professional.

By contrast, the pessimistic scenarios (D-E) have a high chance of yielding weak schemers because:

  1. Even weak AIs will likely seeks power for itself, when possible. See Alex Turner's argument.
  2. Either (i) the lab will either implement no safety measures, or (ii) the safety measures are subvertable even for weak AIs, or (iii) the subversion strategies will be leaked during training or deployment.
  3. D is 5% lower than E, because there may be cheap interventions to avoid scheming in weak AIs. This is especially true when scaling to Professional capability, because the safety faction in a D lab can leverage the labour of GPT-6.

Best Human: If the lab deploys weak non-schemers, then optimistic situations have a lower chance of yielding strong schemers because:

  1. The lab deploys the weak non-schemers for monitoring, designing honeypots of untrusted monitors, and safety research.

If the lab deploys weak non-schemers, then pessimistic situations have a higher chance of yielding strong schemers because:

  1. Strong AIs would be motivated to scheme, as part of a coherent power-seeking strategy. See Carlsmith 2023.
  2. The lab didn't use the weak non-schemers to automate safety research.
  3. D looks substantially better than E because the safety faction can leverage their own Professional-class AIs.[3]

Superhuman: If the lab deploys strong non-schemers, then both optimistic and pessimistic situations have a low chance of yielding even stronger schemers. In pessimistic scenarios (D-E), the strong non-schemers can act in the interests of the lab, even if the lab leadership are miscalibrated about risks.

Overall answer

Without conditioning on scheming arising: The probability scheming emerges naturally is 56%. If it does emerge, the most likely capability level is Professional (20%), followed closely by Best Human (19%), then GPT-6 (10%) and Superhuman (6%). There's a 44% chance scheming never emerges naturally.

Conditional on scheming arising naturally: The distribution shifts substantially toward weaker schemers. Professional becomes most likely (37%), with Best Human close behind (35%), followed by GPT-6 (18%) and Superhuman (10%). This shift occurs because conditioning on scheming arising updates us toward pessimistic lab scenarios (Plans D and E shift from 60% to 73% combined probability), which have higher rates of weak AI scheming due to inadequate safety measures.

  1. ^

    Here are some assumptions, credit to others:

    • 3 OOMs effective compute ≈ 1 GPT generation ≈ 50 IQ points
    • 6 OOM effective compute ≈ 5 OOMs rank improvement
    • Professional researcher capability (~100 000th best human) corresponds to AI R&D automation threshold
    • Historical scaling rate is ~3 OOMs per 2 years, where effective compute accounts for algorithmic progress. However, this rate could dramatically accelerate after automation threshold (+12 OOMs), potentially reaching 6 OOMs per year during intelligence explosion. It might slow down if hardware supply begins to constrain scaling.
  2. ^

    Across all plans, P(scheming emerges at this level | plan) exceeds P(scheming emerges at this level | plan & no scheming at lower levels) because the former conditional includes scenarios where we have deployed earlier schemers. Deploying earlier schemes increases the chance that later generations are schemers, both evidentially (because the same mechanisms might apply to both early and late schemers) and causally (the early schemers could undermine our training and deployment of later AIs).

    This effect is greater for the pessimistic scenarios (e.g. D/E) because we haven't managed to avoid research sabotage, whereas the optimistic scenarios (e.g. A-C) have a better chance of recovering from early undermining, and so P(scheming emerges at this level | plan) might be closer to P(scheming emerges at this level | plan & no scheming at lower levels).

  3. ^

    In D, safety humans are outnumbered by the careless humans. However, as AI begins to automate cognitive labour, the relative effectiveness of these two factions should become less stark, as both factions approach the diminishing returns of additional cognitive labour.

    To illustrate, the relative effectiveness between (10 humans) and (990 humans) is greater than between (10 humans and 1M AIs) and (990 humans and 990M AIs).

    Hence, the D-scenario moves closer to the C-scenario and away from the E-scenario.

Reply
Plans A, B, C, and D for misalignment risk
Cleo Nardo7d30
  1. Try to make deals with the AIs (maybe this counts as 'add things to corpus')
  2. Try to make deals with the lab (maybe this counts as 'Move to Plan C/D')
  3. Try to disrupt the compute supply chain or lab employees
  4. Harden the external environment, d/acc stuff (probably hopeless, but maybe worthwhile on slow takeoff)
  5. YOLO human intelligence enhancement / uploads

Yeah, E seems hard. Especially because from the outside E is indistinguishable from D, and many of the E strategies would be negatively-polarising, hence counterproductive on Plan D.

Reply
Plans A, B, C, and D for misalignment risk
Cleo Nardo7d20

You can't really have a technical "Plan E" because there is approximately no one to implement the plan; in Plan E situations, the focus should be on moving to a higher level of political will and effort on mitigating risk.

There are no employee who could implement Plan E, but is there nothing that non-employees could do?

Reply
The Case for Mixed Deployment
Cleo Nardo15d20

I would be very pleased if Anthropic and Openai maintained their collaboration.[1][2]

Even better: this arrangement could be formalised in a third-party (e.g. government org or non-profit) and other labs could be encouraged/obliged to join.

  1. ^

    OpenAI's findings from an alignment evaluation of Anthropic models

  2. ^

    Anthropic's findings from an alignment evaluation of of OpenAI's models

Reply
Reasons to sell frontier lab equity to donate now rather than later
Cleo Nardo16d128

I don't think that doubling the incomes of people working on existing projects would be a good use of resources.

Reply
Peter Curtis's Shortform
Cleo Nardo16d40
  1. Why should this be named for Emma Watson?
  2. "experiences devoid of the friction" --> this is not what the life of wealthy/privileged people is like
  3. "required to rear authentic understanding merited by experience"
    1. People born wealthy/privileged have a good understanding of the world.
    2. I'm suspicious of the 'authentic' modifier, and of 'merited'.
Reply1
Shortform
Cleo Nardo25d62
  1. Not confident at all.
    1. I do think that safety researchers might be good at coordinating even if the labs aren't. For example, safety researchers tend to be more socially connected, and also they share similar goals and beliefs.
    2. Labs have more incentive to share safety research than capabilities research, because the harms of AI are mostly externalised whereas the benefits of AI are mostly internalised.
      1. This includes extinction obviously, but also misuse and accidental harms which would cause industry-wide regulations and distrust.
    3. Even a few safety researchers at the lab could reduce catastrophic risk.
    4. The recent OpenAI-Anthropic collaboration is super good news. We should be giving them more cudos for this.
      1. OpenAI evaluates Anthropic models
      2. Anthropic evaluates OpenAI models
  2. I think buying more crunch time is great.
    1. While I'm not excited by pausing AI[1], I do support pushing labs to do more safety work between training and deployment.[2][3]
    2. I think sharp takeoff speeds are scarier than short timelines.
    3. I think we can increase the effective-crunch-time by deploying Claude-n to automate much of the safety work that must occur between training and deploying Claude-(n+1). But I don't know if there's any ways which accelerate Claude-n at safety work but not the capabilities work.
  1. ^

    I think it's an honorable goal, but seems infeasible given the current landscape.

  2. ^

    c.f. RSPs are pauses done right

  3. ^

    Although I think the critical period for safety evals is between training and internal deployment, not training and external deployment. See Greenblatt's Attaching requirements to model releases has serious downsides (relative to a different deadline for these requirements) 

Reply
Load More
Dealmaking (AI)
2 months ago
(+742)
Dealmaking (AI)
2 months ago
(+31/-69)
Dealmaking (AI)
2 months ago
(+113)
Dealmaking (AI)
2 months ago
(+1036)
34The Case for Mixed Deployment
1mo
4
44Gradient routing is better than pretraining filtering
1mo
3
38Here’s 18 Applications of Deception Probes
2mo
0
16Looking for feature absorption automatically
2mo
0
31Trusted monitoring, but with deception probes.
3mo
0
106Proposal for making credible commitments to AIs.
4mo
45
35Can SAE steering reveal sandbagging?
6mo
3
13Rethinking Laplace's Rule of Succession
11mo
5
27Appraising aggregativism and utilitarianism
1y
10
28Aggregative principles approximate utilitarian principles
1y
3
Load More