Happy to share my reasons/arguments:

  1. I think I'm in part a goal-directed optimizer. I want to eventually offload all of the cognition involved in being a goal-directed optimizer to a superintelligence, as opposed to having some part of it being bottlenecked by my (suboptimal/slow/unstable/unsafe) biological brain. I think this describes or will probably describe many other humans.
  2. Competitive pressures may drive people to do this even if they aren't ready or wouldn't want to in the absence of such pressures.
  3. Some people (such as e/accs) seem happy to build any kind of AGI without considerations of safety (or think they'll be automatically safe) and therefore may build a goal-directed optimizer either because they're the easiest kind of AGI to stumble upon, or because they're copying human cognitive architecture or training methods as a shortcut to trying to invent/discover new ones.
  4. Even if no AI can ever be describe as "goal-directed optimizer", larger systems composed of humans and AIs can probably be described as such, so they are worth studying from a broader "safety" perspective even if not a narrower "alignment" perspective.
  5. Coherence-based arguments, which I also put some weight on (but perhaps less than others)

“well, maybe that’s not the best way to think about trained networks and their effects”

This seems fine if you're trying to understand how current or near-future ML models work and how to make them safer, but I think in the longer run it seems inevitable that we eventually end up with AIs that are more or less well-descried as "goal-directed optimizers", so studying this concept probably won't be "wasted" even if it's not directly useful now.

Aside from a technical alignment perspective, it also seems strategically important to better understand how to model future goal-directed AIs, for example whether their decision/game theories will allow unaligned AIs to asymmetrically extort aligned AIs (or have more bargaining power because they have less to lose than aligned AI), or whether acausal trade will be a thing. This seems important input into various near-term decisions such as how much risk of unaligned AI we should tolerate.

Personally I prioritize studying metaphilosophy above topics directly related to "goal-directed optimizers" such as decision theory, as I see the former as a bit more urgent and neglected than the latter, but also find it hard to sympathize with describing the study of the latter as "weird".

Wei Dai and Robin Hanson seem to be gesturing at this point from different directions, how not doing philosophy correctly is liable to get us lost in the long term, and how getting lost in the long term is a basic fact of human condition and AIs don’t change that.

Interesting connection you draw here, but I don't see how "AIs don’t change that" can be justified (unless interpreted loosely to mean "there is risk either way"). From my perspective, AIs can easily make this problem better (stop the complacent value drift as you suggest, although so far I'm not seeing much evidence of urgency), or worse (differentially decelerate philosophical progress by being philosophically incompetent). What's your view on Robin's position?

How many times has someone expressed “I’m worried about ‘goal-directed optimizers’, but I’m not sure what exactly they are, so I’m going to work on deconfusion.”? There’s something weird about this sentiment, don’t you think? I can’t quite put my finger on what, and I wanted to get this post out.

This community inherited the concept of "goal-directed optimizers" and attempted formalizations of it from academia (e.g., vNM decision theory, AIXI). These academic ideas also clearly describe aspects of reality (e.g., decision theory having served as the foundation of economics for several decades now).

Given this, are we not supposed to be both worried (due to the threatening implications of modeling future AIs as goal-directed optimizers) and also confused (due to existing academic theories having various open problems)? Or what is the "not weird" response or course of action here?

I wrote a post expressing similar sentiments but perhaps with a different slant. To me, apparent human morality along the lines of "heretics deserve eternal torture in hell" or what was expressed during the Chinese Cultural Revolution are themselves largely a product of status games, and there's a big chance that these apparent values do not represent people's true values and instead represent some kind of error (but I'm not sure and would not want to rely on this being true). See also Six Plausible Meta-Ethical Alternatives for some relevant background.

But you're right that the focus of my post here is on people who endorse altruistic values that seem more reasonable to me, like EAs, and maybe earlier (pre-1949) Chinese supporters of communism who were mostly just trying to build a modern nation with a good economy and good governance, but didn't take seriously enough the risk that their plan would backfire catastrophically.

"China’s first attempt at industrialization started in 1861 under the Qing monarchy. Wen wrote that China “embarked on a series of ambitious programs to modernize its backward agrarian economy, including establishing a modern navy and industrial system.”

However, the effort failed to accomplish its mission over the next 50 years. Wen noted that the government was deep in debt and the industrial base was nowhere in sight."

Improving institutions is an extremely hard problem. The theory we have on it is of limited use (things like game theory, mechanism design, contract theory), and with AI governance/institutions specifically, we don't have much time for experimentation or room for failure.

So I think this is a fine frame, but doesn't really suggest any useful conclusions aside from same old "let's pause AI so we can have more time to figure out a safe path forward".

Current AIs are not able to “merge” with each other.

AI models are routinely merged by direct weight manipulation today. Beyond that, two models can be "merged" by training a new model using combined compute, algorithms, data, and fine-tuning.

As a result, humans don’t need to solve the problem of “What if a set of AIs form a unified coalition because they can flawlessly coordinate?” since that problem won’t happen while humans are still in charge. We can leave this problem to be solved by our smarter descendants.

How do you know a solution to this problem exists? What if there is no such solution once we hand over control to AIs, i.e., the only solution is to keep humans in charge (e.g. by pausing AI) until we figure out a safer path forward? As the last sentence you say "However, it’s perhaps significantly more likely in the very long-run." well what can we do today to reduce this long-run risk (aside from pausing AI which you're presumably not supporting)?

That said, it seems the probability of a catastrophic AI takeover in humanity’s relative near-term future (say, the next 50 years) is low (maybe 10% chance of happening).

Others already questioned you on this, but the fact you didn't think to mention whether this is 50 calendar years or 50 subjective years is also a big sticking point for me.

Who a decade ago thought that AI would think symbolically? I'm struggling to think of anyone. There was a debate on LW though, around "cleanly designed" versus "heuristics based" AIs, as to which might come first and which one safety efforts should be focused around. (This was my contribution to it.)

If someone had followed this discussion, there would be no need for dramatic updates / admissions of wrongness, just smoothly (more or less) changing one's credences as subsequent observations came in, perhaps becoming increasingly pessimistic if one's hope for AI safety mainly rested on actual AIs being "cleanly designed" (as Eliezer's did). (I guess I'm a bit peeved that you single out an example of "dramatic update" for praise, while not mentioning people who had appropriate uncertainty all along and updated constantly.)

Ah I see, thanks for the clarification. Personally I'm uncertain about this, and have some credence on each possibility, and may have written the OP to include both possibilities without explicitly distinguishing between them. See also #3 in this EAF comment and its followup for more of how I think about this.

Thanks for the pointers. I think these proposals are unlikely to succeed (or at least very risky) and/or liable to give people a false sense of security (that we've solved the problem when we actually haven't) absent a large amount of philosophical progress, which we're unlikely to achieve given how slow philosophical progress typically is and lack of resources/efforts. Thus I find it hard to understand why @evhub wrote "I’m less concerned about this; I think it’s relatively easy to give AIs “outs” here where we e.g. pre-commit to help them if they come to us with clear evidence that they’re moral patients in pain." if these are the kinds of ideas he has in mind.

Load More