COVID and climate change are actually easy problems that only became serious or highly costly because of humanity's irrationality and lack of coordination.

  • COVID - Early lockdown in China + border closures + testing/tracing stops it early, or stockpiling enough elastomeric respirators for everyone keeps health/economic damage at a minimum (e.g. making subsequent large scale lockdowns unnecessary).
  • Climate change - Continued nuclear rollout (i.e. if it didn't stop or slow down decades ago) + plugin hybrids or EVs allows world to be mostly decarbonized at minimal cost, or if we failed to do that, geoengineering minimizes damage.

For me, the generalization from these two examples is that humanity is liable to incur at least 1 or 2 orders of magnitude more cost/damage than necessary from big risks, so if you think an optimal response to AI risk means incurring 1% loss of expected value (from truly unpredictable accidents that happen even when one has taken all reasonable precautions), then the actual response would perhaps incur 10-100%.


I don't think I understand, what's the reason to expect that the "acausal economy" will look like a bunch of acausal norms, as opposed to, say, each civilization first figuring out what its ultimate values are, how to encode them into a utility function, then merging with every other civilization's utility function? (Not saying that I know it will be the latter, just that I don't know how to tell at this point.)

Also, given that I think AI risk is very high for human civilization, and there being no reason to suspect that we're not a typical pre-AGI civilization, most of the "acausal economy" might well consist of unaligned AIs (created accidentally by other civilizations), which makes it seemingly even harder to reason about what this "economy" looks like.

That’s the path the world seems to be on at the moment. It might end well and it might not, but it seems like we are on track for a heck of a roll of the dice.

I agree with almost everything you've written in this post, but you must have some additional inside information about how the world got to this state, having been on the board of OpenAI for several years, and presumably knowing many key decision makers. Presumably this wasn't the path you hoped that OpenAI would lead the world onto when you decided to get involved? Maybe you can't share specific details, but can you at least talk generally about what happened?

(In additional to satisfying my personal curiosity, isn't this important information for the world to have, in order to help figure out how to get off this path and onto a better one? Also, does anyone know if Holden monitors the comments here? He apparently hasn't replied to anyone in months.)


We have a lot of experience and knowledge of building systems that are broadly beneficial and safe, while operating in the human capabilities regime.

What? A major reason we're in the current mess is that we don't know how to do this. For example we don't seem to know how to build a corporation (or more broadly an economy) such that its most powerful leaders don't act like Hollywood villains (race for AI to make a competitor 'dance')? Even our "AGI safety" organizations don't behave safely (e.g., racing for capabilities, handing them over to others, e.g. Microsoft, with little or no controls on how they're used). You yourself wrote:

Unfortunately, given that most other actors are racing for as powerful and general AIs as possible, we won’t share much in terms of technical details for now.

How is this compatible with the quote above?!

Looking forward to your next post, but in the meantime:

  1. AI - Seems like it would be easier to build an AI that helps me get what I want, if "what I want" had various nice properties and I wasn't in “crossing that bridge when we come to it” mode all the time.
  2. meta-ethical uncertainty - I can't be sure there is no territory.
  3. ethics/philosophy as a status game - I can't get status from this game if I opt out of it.
  4. morality as coordination - I'm motivated to make my morality have various nice properties because it helps other people coordinate with me (by letting them better predict what I would do in various situations/counterfactuals).

My first thought upon hearing about Microsoft deploying a GPT derivative was (as I told a few others in private chat) "I guess they must have fixed the 'making up facts' problem." My thinking was that a big corporation like Microsoft that mostly sells to businesses would want to maintain a reputation for only deploying reliable products. I honestly don't know how to adjust my model of the world to account for whatever happened here... except to be generically more pessimistic?

Answer by Wei_DaiFeb 15, 20234-2

But it seems increasingly plausible that AIs will not have explicit utility functions, so that doesn’t seem much better than saying humans could merge their utility functions.

There are a couple of ways to extend the argument:

  1. Having an utility function (or some other stable explicit representation of values) is a likely eventual outcome of recursive self-improvement, since it makes you less vulnerable to value drift and manipulation, and makes coordination easier.
  2. Even without utility functions, AIs can try to merge, i.e., negotiate and jointly build successors with values that represent a compromise of their individual values. It seems likely this will be much easier, less costly, more scalable, and more effective for them than the analogous thing is for humans (to the extent that there is an analogy, perhaps having and raising kids together).

I think AIs with simpler values (e.g., paperclip maximizers) have an advantage with both 1 and 2, which seems like bad news for AI risk.

Whereas shard theory seems aimed at a model of human values that’s both accurate and conceptually simple.

Let's distinguish between shard theory as a model of human values, versus implementing an AI that learns its own values in a shard-based way. The former seems fine to me (pending further research on how well the model actually fits), but the latter worries me in part because it's not reflectively stable and the proponents haven't talked about how they plan to ensure that things will go well in the long run. If you're talking about the former and I'm talking about the latter, then we might have been talking past each other. But I think the shard-theory proponents are proposing to do the latter (correct me if I'm wrong), so it seems important to consider that in any overall evaluation of shard theory?

BTW here are two other reasons for my worries. Again these may have already been addressed somewhere and I just missed it.

  1. The AI will learn its own shard-based values which may differ greatly from human values. Even different humans learn different values depending on genes and environment, and the AI's "genes" and "environment" will probably lie far outside the human distribution. How do we figure out what values we want the AI to learn, and how to make sure the AI learns those values? These seem like very hard research questions.
  2. Humans are all partly or even mostly selfish, but we don't want the AI to be. What's the plan here, or reason to think that shard-based agents can be trained to not be selfish?
Load More