The key question here is how difficult the objective O is to achieve. If O is "drive a car from point A to point B", then we agree that it is feasible to have AI systems that "strongly increase the chance of O occuring" (which is precisely what we mean by "goal-directedness") without being dangerous. But if O is something that is very difficult to achieve (i.e. all of humanity is currently unable to achieve it), then it seems that any system that does reliably achieve O has to "find new and strange routes to O" almost tautologically.

Once we build AI systems that find such new routes for achieving an objective, we're in dangerous territory, no matter whether they are explicit utility maximizers, self-modifying, etc. The dangerous part is coming up with new routes that achieve the objective, since most of these routes will contain steps that look like "acquire resources" or "manipulate humans".”

This seems pretty wrong. Many humans are trying to achieve goals that no one currently knows how to achieve, and they are mostly doing that in “expected” ways, and I expect AIs would do the same. Like if O is “solve an unsolved math problem”, the expected way to do that is to think about math, not try to take over the world. If O is “cure a disease”, the expected way to do that is doing medical research, not “acquiring resources”. In fact, it seems hard to think of an objective where “do normal work in the existing paradigm” is not a promising approach.

It’s true that more people means we each get a smaller share of the natural resources, but more people increases the benefits of innovation and specialization. In particular, the benefits of new technology scale linearly with the population (everyone can use the) but the costs of research do not. Since the world is getter richer over time (even as the population increases), the average human is clearly net positive.

A more charitable interpretation is that this is a probability rounded to the nearest percent

I don’t think most people are trying to explicitly write down all human values and then tell them to an AI. Here are some more promising alternatives:

  1. Tell an AI to “consult a human if you aren’t sure what to do”
  2. Instead of explicitly trying to write down human values, learn them by example (by watching human actions, or reading books, or…)

Why should we expect AGIs to optimize much more strongly and “widely” than humans? As far as I know a lot of AI risk is thought to come from “extreme optimization”, but I’m not sure why extreme optimization is the default outcome.

To illustrate: if you hire a human to solve a math problem, the human will probably mostly think about the math problem. They might consult google, or talk to some other humans. They will probably not hire other humans without consulting you first. They definitely won’t try to get brain surgery to become smarter, or kill everyone nearby to make sure no one interferes with their work, or kill you to make sure they don’t get fired, or convert the lightcone into computronium to think more about the problem.

I agree with it but I don’t think it’s making very strong claims.

I mostly agree with part 1; just giving advice seems too restrictive. But there’s a lot of ground between “only gives advice” and “fully autonomous” and “fully autonomous” and “globally optimizing a utility function”, and I basically expect a smooth increase in AI autonomy over time as they are proved capable and safe. I work in HFT; I think that industry has some of the most autonomous AIs deployed today (although not that sophisticated), but they’re very constrained over what actions they can take.

I don’t really have the expertise to have an opinion on “agentiness helps with training”. That sounds plausible to me. But again, “you can pick training examples” is very far from “fully autonomous”. I think there’s a lot of scope for introducing “taking actions” that doesn’t really pose a safety risk (and ~all of Gwern’s examples fall into that, I think; eg optimizing over NN hyper parameters doesn’t seem scary).

I guess overall I agree AIs will take a bunch of actions, but I’m optimistic about constraining the action space or the domain/world-model in a way that IMO gets you a lot of safety (in a way that is not well-captured by speculating about what the right utility function is).

My sense is that the existing arguments are not very strong (e.g. I do not find them convincing), and their pretty wide acceptance in EA discussions mostly reflects self-selection (people who are convinced that AI risk is a big problem are more interested in discussing AI risk). So in that sense better intro documents would be nice. But maybe there simply aren't stronger arguments available? (I personally would like to see more arguments from an "engineering" perspective, starting from current computer systems rather than from humans or thought experiments). I'd be curious what fraction of e.g. random ML people find current intro resources persuasive.

That said, just trying to expose more people to the arguments makes a lot of sense; however convincing the case is, the number of convinced people should scale linearly with the number of exposed people. And social proof dynamics probably make it scale super-linearly.

I expect people to continue making better AI to pursue money/fame/etc., but I don't see why "better" is the same as "extremely goal-directed". There needs to be an argument that optimizer AIs will outcompete other AIs.

Eliezer says that as AI gets more capable, it will naturally switch from "doing more or less what we want" to things like "try and take over the world", "make sure it can never be turned off", "kill all humans" (instrumental goals), "single-mindedly pursue some goal that was haphazardly baked in by the training process" (inner optimization), etc. This is a pretty weird claim that is more assumed than argued for in the post. There's some logic and mathematical elegance to the idea that AI will single-mindedly optimize something, but it's not obvious and IMO is probably wrong (and all these weird bad consequences that would result are as much reasons to think this claim is wrong as they are reasons to be worried if its true).

IMO the biggest hole here is "why should a superhuman AI be extremely consequentialist/optimizing"? This is a key assumption; without it concerns about instrumental convergence or inner alignment fall away. But there's no explicit argument for it.

Current AIs don't really seem to have goals; humans sort of have goals but very far from the level of "I want to make a cup of coffee so first I'll kill everyone nearby so they don't interfere with that".

I don't think "burn all GPUs" fares better on any of these questions. I guess you could imagine it being more "accessible" if you think building aligned AGI is easier than convincing the US government AI risk is truly an existential threat (seems implausible).

"Accessibility" seems to illustrate the extent to which AI risk can be seen as a social rather than technical problem; if a small number of decision-makers in the US and Chinese governments (and perhaps some semiconductor companies and software companies) were really convinced AI risk was a concern, they could negotiate to slow hardware progress.  But the arguments are not convincing (including to me), so they don't.

In practice, negotiation and regulation (I guess somewhat similar to nuclear non-proliferation) would be a lot better than "literally bomb fabs". I don't think being driven underground is a realistic concern - cutting-edge fabs are very expensive.

