LESSWRONG
LW

1102
ryan_greenblatt
20663Ω49705219428
Message
Dialogue
Subscribe

I'm the chief scientist at Redwood Research.

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
15ryan_greenblatt's Shortform
Ω
2y
Ω
317
Plans A, B, C, and D for misalignment risk
ryan_greenblatt2d20

(Yes, also I think that a small number of employees working on safety might get proportionally more compute than the average company employee, e.g. this currently seems to be the case.)

Reply
Plans A, B, C, and D for misalignment risk
ryan_greenblatt2dΩ4100

Maybe I should clarify my view a bit on Plan A vs "shut it all down":

  • Both seem really hard to pull off from a political will and actually making it happen.
    • Plan A is complicated and looking at I go "oh jeez, this seems really hard to make happen well, idk if the US government has the institutional capacity to pull off something like this". But, I also think pulling off "shut it all down" seems pretty rough and pulling off a shutdown that lasts for a really long time seems hard.
    • Generally, it seems like Plan A might be hard to pull off and it's easy for me to imagine it going wrong. This also mostly applies to "shut it all down" though.
    • I still think Plan A is better than other options and Plan A going poorly can still nicely degrade into Plan B (or C).
  • It generally looks to me like Plan A is strictly better on most axes, though maybe "shut it all down" would be better if we were very confident in maintaining extremely high political will (e.g. at or above peak WW2 level political will) for a very long time.
    • There would still be a question of whether the reduction in risk is worth delaying the benefits of AI. I tenatively think yes even from a normal moral perspective, but this isn't as clear of a call as doing something like Plan A.
    • I worry about changes in the balance of power between the US and China with this long of a pause (and general geopolitical drift making the situation worse and destabilizing this plan). But, maybe if you actually had the political will you could make a deal which handles this somehow? Not sure exactly how, minimally the US could try to get its shit together more in terms of economic growth.
    • And I'd be a bit worried about indefinite pause but not that worried.

This is my view after more seriously getting into some of the details of the Plan A related to compute verification and avoiding blacksite projects as well as trying to do a more precise comparison with "shut it all down".

Reply1
Plans A, B, C, and D for misalignment risk
ryan_greenblatt2dΩ340

One alternative way of thinking about this is to decompose plans by which actor the plan is for:

  • Plan A: Most countries, at least the US and China
  • Plan B: The US government (and domestic industry)
  • Plan C: The leading AI company (or maybe a few of the leading AI companies)
  • Plan D: A small team with a bit of buy in within the leading AI company

This isn't a perfect breakdown, e.g. Plan A might focus mostly on what the US should do, but it might still be helpful.

This decomposition was proposed by @Lukas Finnveden.

Reply
Plans A, B, C, and D for misalignment risk
ryan_greenblatt2d93

(I assume you're asking "why isn't it much less bad than AI takeover" as opposed to "isn't it almost as bad as AI takeover, like 98% as bad".)

I care most about the long-run utilization of cosmic resources, so this dominates my thinking about this sort of question. I think it's very easy for humans to use cosmic resources poorly from my perspective and I think this is more likely if resources are controlled by an autocratic regime, especially an autocratic regime where one person holds most of the power (which seems reasonably likely for a post-AGI CCP). In other words, I think it's pretty easy to lose half of the value of the long-run future (or more) based on which humans are in control and how this goes.

I'll compare the CCP having full control to broadly democratic human control (e.g. most cosmic resources are controlled by some kinda democratic system or auctioned while retaining democracy).

We could break this down into likelihood of carefully reflecting and then how much this reflection converges. I think control by an autocratic regime makes reflection less likely and that selection effects around who controls the CCP are bad making post-reflection convergence worse (and it's unclear to me how much I expect reflection to converge / be reasonable in general).

Additionally, I think having a reasonably large number of people having power substantially improves the situation due to the potential for trade (e.g., many people might not care much at all about long-run resource use in far away galaxies, but this is by far the dominant source of value from my perspective!) and the beneficial effects on epistemics/culture (though this is less clear).

Part of this is that I think pretty crazy considerations might be very important to having the future go close to as well as it could (e.g. acausal trade etc) and this requires some values+epistemics combinations which aren't obviously going to happen.

This analysis assumes that the AI that takes over is a pure paperclipper which doesn't care about anything else. Taking into account the AI that takes over potentially having better values doesn't make a big difference to the bottom line, but the fact that AIs that take over might be more likely to do stuff like acausal trade (than e.g. the CCP) makes "human autocracy" look relatively worse compared to AI takeover.

See also Human takeover might be worse than AI takeover, though I disagree with the values comparisons. (I think AIs that take over are much more likely to have values that I care about very little than this post implies.)


As far as outcomes for currently alive humans, I think full CCP control is maybe like 15% as bad as AI takeover relative to broadly democratic human control. AI takeover maybe kills a bit over 50% of humans in expectation while full CCP control maybe kills like 5% of humans in expectation (including outcomes for people that are as bad as death and involve modifying them greatly), but also has some chance of imposing terrible outcomes on the remaining people which are still better than death.

Partial CCP control probably looks much less bad?


None of this is to say we shouldn't cooperate on AI with China and the CCP. I think cooperation on AI would be great and I also think that if the US (or an AI company) ended up being extremely powerful, that actor shouldn't violate Chinese sovereignty.

Reply111
Plans A, B, C, and D for misalignment risk
ryan_greenblatt2d83

I don't think you get radical reductions in mortality and radical life extension with (advanced) narrow bio AI without highly capable general AI. (It might be that a key strategy for unlocking much better biotech is highly capable general AIs creating extremely good narrow bio AI, but I don't think the narrow bio AI which humans will create over the next ~30 years is very likely to suffice.) Like narrow bio AI isn't going to get you (arbitrarily good) biotech nearly as fast as building generally capable AI would. This seems especially true given that much, much better biotech might require much higher GDP for e.g. running vastly more experiments and using vastly more compute. (TBC, I don't agree with all aspects of the linked post.)

I also think people care about radical increases in material abundance which you also don't get with narrow bio AI.

And the same for entertainment, antidepressants (and other drugs/modifications that might massively improve quality of life by giving people much more control over mood, experiences, etc), and becoming an upload such that you can live a radically different life if you want.

You also don't have the potential for huge improvements in animal welfare (due to making meat alternatives cheaper, allowing for engineering away suffering in livestock animals, making people wiser, etc.)

I'm focusing on neartermist-style benefits; as in, immediate benefits to currently alive (or soon to be born by default) humans or animals. Of course, powerful AI could result in huge numbers of digitial minds in the short run and probably is needed for getting to a great future (with a potentially insane amount of digital minds and good utilization of the cosmic endowment etc.) longer term. The first order effects on benefits of delaying don't matter that much from a longtermist perspective of course, so I assumed we were fully operating a neartermist-style frame when talking about benefits.

Reply
Plans A, B, C, and D for misalignment risk
ryan_greenblatt2d40

I wasn't trying to assume in the post that leading AI companies have a legible, unambiguous and are willing to spend down this lead. I was just noting that my "Plan C" proposal requires that companies have a lead they are willing to spend down (which might not be legible or unambiguous, though I do think ambiguity gets less bad as AIs get more capable and their impacts are more clear).

Reply
ryan_greenblatt's Shortform
ryan_greenblatt2d*Ω4510352

Anthropic, GDM, and xAI say nothing about whether they train against Chain-of-Thought (CoT) while OpenAI claims they don't[1].

I think AI companies should be transparent about whether (and how) they train against CoT. While OpenAI is doing a better job at this than other companies, I think all of these companies should provide more information about this.

It's particularly striking that Anthropic says nothing about whether they train against CoT given their system card (for 4.5 Sonnet) is very thorough and includes a section on "Reasoning faithfulness" (kudos to them for providing so much other information!). Naively, this seems like some evidence that they are training against CoT and didn't want to disclose this.[2]

I think we should try to create an incentive gradient that pushes AI companies to disclose information even if that information makes them look bad, so we should make companies feel some heat for not disclosing (reasonably) important information like how/whether they are training against CoT.


It's worth noting that some versions of training against CoT might be totally fine (as in, they don't incentivize obfuscation of misaligned reasoning) and that some particular approach for training against CoT could be worth it even if it does incentivize obfuscating misaligned reasoning. For instance, it could end up being the case that training against CoT works well and we end up with sufficient confidence that this doesn't result in substantial obfuscation of reasoning. The situation is also just generally complicated because things like "what SFT/prompting initialization do you use for your CoT" could have big effects on obfuscation and aren't exactly well described as applying optimization pressure to CoT. (But, if you start iterating on the SFT/prompting initialization to reduce misalignment observed in the CoT, this could cause problems.)

Given that the situation is complicated, ideally AI companies would publicly disclose enough information that other actors can come to their own conclusions about how worried they should be about CoT obfuscation.[3] If companies are strongly opposed because this would leak too much IP, they could instead disclose this information to some number of non-conflicted third parties (e.g. METR) who could then publicly share their conclusions.


  1. Source: METR's evaluation of GPT-5 says that OpenAI asserts "There was no direct training pressure on GPT-5 to hide or obfuscate misaligned reasoning in reasoning traces" and "There's no particular reason to expect that GPT-5's reasoning traces are less monitorable than METR would assume.". It's of course possible that OpenAI will train against CoT in future models and it's possible that for GPT-5 they did train against CoT in some way that they think wouldn't directly apply pressure to hide/obfuscate misaligned reasoning (or they could be lying). ↩︎

  2. Of course, it is totally possible that Anthropic just didn't get around to saying anything or just defaulted to secrecy rather than specifically going out of their way to not say anything in this case. I think the prior probability on this happening just because not saying anything is the default (or something like this) is high. ↩︎

  3. The information that OpenAI has disclosed doesn't suffice for other actors (that don't have private information) to come to their own conclusions without trusting relatively general assurances from OpenAI that aren't backed by very specific claims. Of course, this is still much more information about training against CoT than other companies! ↩︎

Reply1
Plans A, B, C, and D for misalignment risk
ryan_greenblatt3d30

I guess I expect 0-9 month lead and coordination between companies seems possible.

Reply
Tomás B.'s Shortform
ryan_greenblatt3d20

To clarify my perspective a bit here:

I think literally just prepending "Humor:" or "Mostly joking:" would make me think this was basically fine. Or like a footnote saying "mostly a joke / not justifying this here". Like it's just good to be clear about what is non-argued for humor vs what is a serious criticism (and if it is a serious criticism of this sort, then I think we should have reasonably high standards for this, e.g. like the questions in my comment are relevant).

Idk what the overall policy for lesswrong should be, but this sort of thing does feel scary to me and worth being on guard about.

Reply
Tomás B.'s Shortform
ryan_greenblatt3d20

As in, you think frontier math is (strongly) net negative for this reason? Or you think Epoch was going to do work along these lines that you think is net negative?

Reply
Load More
40Iterated Development and Study of Schemers (IDSS)
Ω
2d
Ω
1
116Plans A, B, C, and D for misalignment risk
Ω
4d
Ω
66
225Reasons to sell frontier lab equity to donate now rather than later
16d
32
55Notes on fatalities from AI takeover
7d
60
47Focus transparency on risk reports, not safety cases
Ω
20d
Ω
3
40Prospects for studying actual schemers
Ω
23d
Ω
0
46AIs will greatly change engineering in AI companies well before AGI
Ω
1mo
Ω
9
154Trust me bro, just one more RL scale up, this one will be the real scale up with the good environments, the actually legit one, trust me bro
Ω
1mo
Ω
32
99Attaching requirements to model releases has serious downsides (relative to a different deadline for these requirements)
Ω
2mo
Ω
2
163My AGI timeline updates from GPT-5 (and 2025 so far)
Ω
2mo
Ω
14
Load More
Anthropic (org)
9 months ago
(+17/-146)
Frontier AI Companies
a year ago
Frontier AI Companies
a year ago
(+119/-44)
Deceptive Alignment
2 years ago
(+15/-10)
Deceptive Alignment
2 years ago
(+53)
Vote Strength
2 years ago
(+35)
Holden Karnofsky
2 years ago
(+151/-7)
Squiggle Maximizer (formerly "Paperclip maximizer")
3 years ago
(+316/-20)