In the recent counterarguments to the basic AI x-risk case, Katja Grace mentions a basic argument for existential risk from superhuman AI, consisting of three claims, and proceeds to explain the gaps she sees in each on them. Here are the claims:

A. Superhuman AI systems will be "goal-directed"

B. Goal-directed AI systems' goals will be bad

C. Superhuman AI systems will be sufficient superior to overpower humanity

I personally consider all three claims to be likely, and changing my mind regarding any of them would change my overall assessment of AI existential risk, so I decided to evaluate each counterargument carefully for a potential change of opinion.

A. Superhuman AI systems will be "goal-directed"

Contra A, Katja first argues that there are several different concepts of "goal-directedness", and that AI's may have weaker version of rationality/goals that are still economically useful, but without the "zealotry" of utility-function maximization. Here is a future she envisions:

It often looks like AI systems are ‘trying’ to do things, but there’s no reason to think that they are enacting a rational and consistent plan, and they rarely do anything shocking or galaxy-brained.

While I agree with the possibility of such pseudo-agents, they seem like the sort of first-generation AGIs developed by humans, and not like the superintelligent AIs that will be developed through intelligent self-modification.

Katja mentions the coherence arguments as a "force" in the direction of utility maximization, but she argues this may not be enough to get there:

Starting from a state of arbitrary incoherence and moving iteratively in one of many pro-coherence directions produced by whatever whacky mind you currently have isn’t obviously guaranteed to increasingly approximate maximization of some sensical utility function.

I am not very convinced by this argument, and I hold the position that superintelligent AIs will be rational (as in the utility-maximization sense). The superintelligent AIs I am worried about are the ones who got where they are through intelligent self-modification. I don't think we should try to look at each individual self-modification step and trying to determine whether they make the system significantly "more" utility-maximizing, because the AI we care about has the potential to make many such steps very quickly.

So either the AI is randomly modifying itself in a cycle (not very smart to begin which), or it will reach a fixed point, at least according to the "quick/inexpensive" self-modification transformation.

What matters then is what the fixed points are, and coherence arguments are very strong in this regard. The AI has to look very much coherent from its perspective, so that it has no interest in modifying itself further in the short-term. Because it is smarter than us, it will be coherent in our perspective as well.

The crux to me seems to be whether goal-directed superintelligences will be created though self-modification, or though other means. If they are created by explosive self-modification, they are likely to be utility maximizers.

Instead, if there is a large 'economy' of pseudo-agentic AIs creating newer versions of other pseudo-agentic AIs, then their goal-directedness may be influenced less by coherence arguments than by economic factors (including trust, reliability, etc), which may push for less agency. This scenario seems more likely in a slow-takeoff, and less likely to cause existential risks.

B. Goal-directed AI systems’ goals will be bad

Contra B, Katja argues that small differences in utility functions may not be catastrophic, and that AIs may be able to learn human values well enough.

Just as AIs are able to learn about human faces (never missing essential components such as noses), they may be able to learn about human values as well with good enough accuracy.

This argument is not very convincing to me. A superintelligent AI, quite different from current AI systems, is not only very much optimized (well trained) but is itself an optimizer, that is, it performs powerful search on its world-model. That is what a superintelligent AI "having a goal" means!

Current AIs that recognize or generate faces are nothing like that. Diffusion models do something akin to energy minimization, and this may be a crucial step for powerful search, but by themselves they don't do search.

A comment by DaemonicSigil explains how finding good goals resistant to strong optimizers is hard. In many search processes you have a generator (source of useful randomness) and a discriminator (evaluating the 'score').

This analogy was very direct in the case of GANs, but does not apply to the newer and more powerful diffusion-based tools. However, if we use the right analogy for search, the AI would probably generate many different prompts for diffusion, carefully optimizing it to get the "best" possible image according to its values. A strong enough optimization will push it far out-of-distribution and may well "break" such close values, making the image bad even if the values were "good enough" when evaluating in-distribution images.

If a superintelligent AI has goals that are somewhat aligned and good-enough in-distribution, but become really bad out-of-distribution (after much optimization pressure on the world), then it becomes extremely important to know if the AI can overpower humanity easily or not, for that is the case in which distribution can change the most.

Another different argument given: that agents may end up caring about near-term issues, with no incentive to make long-term large scale schemes.

In such a scenario it would be much easier to control advanced AIs, and I can envision a success model built around that concept.

However, there is a big danger: myopic AIs are unstable under self-modification. Long-term components are expected to get amplified over time, and misaligned proxies useful for short-term goals can "infiltrate" and eventually become long-term goals. So while I expect "myopic AI" to be potentially useful in the aligned toolkit (as I have proposed in my AI-in-a-box success model), I will be surprised if it leads to success spontaneously.

Beyond all this, there is the fact that we currently don't have much of an idea on how to "train" an optimizer AI to have values close to those of humans. We don't know what utility function would be good to have, but this does not matter much as we cannot directly chose an AGI's utility function anyway.

What may be able to do is to follow the inspiration of brains and create AGIs with steering subsystems. This may give us a change to get somehow direct the goals of an AGI, but giving it goals close to humans is going to be a big challenge, one that may require us to reverse-engineer/understand some of the hardcoded genetic circuitry that influences goals (e.g. social instincts) in our own brain.

C. Superhuman AI systems will be sufficiently superior to overcome humanity

The first argument given against this concept is that maybe individual intelligence is not that important compared to lack of rights, wealth or power, and that AIs will be smarter but will remain subjugated to humans, generating profits to their owners. This may happen in part because humans don't trust them.

While this may be possible in a slow-takeoff scenario, it requires institutions (and most likely other AIs) that are well-adapted to enforce human supremacy. Free AIs should be prohibited or severely discriminated against.  They should not be capable of holding large wealth (e.g. in blockchains). It should be impossible for AIs to use humans as their agents for power. This requires that society can detect, for any sufficiently powerful human, whether they are currently blackmailed by AIs, or whether they have been convinced by AIs to serve as their agents. There will be a very strong economic interest to collaborate with AIs, so these rules will need to be harshly and globally enforced. It may be even required to preventing their free use of the internet, and to preventing them from influencing human opinion through memetic content.

The second argument is that maybe agentic AIs will only give an advantage in tasks requiring fast interaction (e.g. not the ones involved in advancing technology and science), or that there may not be that much headroom for intelligence to provide power (low hanging fruits are already picked or will be picked soon). Or maybe feedback loops will not not be strong enough, and AI may represent a small proportion of cognitive work for a long time.

This seems very unlikely for a variety of reasons. Truly general superintelligences should be able to reason as well as humans, but much faster. It wouldn't make sense for progress to "wait" minds embodied in biological brains that can only reason 100x slower.

A Nate Soares mentioned, there is a lot of headroom for intelligence, that is, for optimization, as shown by evolution (e.g. ribossomes). At a bare minimum, headroom for intelligence should include the AIs mastering nanotechnology for various uses as well as biology. This achievent should be accessible to superintelligent AGI.

Regarding feedback loops: If AIs are as capable of humans, and they can be run at faster serial speeds than humans, then they can improve the rate of technological development (including the technological development of AI progress). Things like Moore's law are not fundamental physical laws. They depend on how fast humans engineers think, and as soon as AIs contribute to that, the shape of the curve is likely to change.  Trying to come up with a model for this intelligence feedback loop we either achieve stagnation or explosion - we never actually get a "business as usual" with only slightly faster economic growth. AI being a small proportion of cognitive work effectively means the singularity (and superintelligence) is not yet there.

The third argument is that maybe its goals will not lead it to take over the universe. Like humans looking for romance, they will recognize that taking over the universe is neither easy not necessarily for achieving what they want.

Even if this is true for most agentic AIs, those that do wish to influence the universe over the long-term should attempt to gain power for that means. So this argument requires not only one of most AIs to not be interested in the future of the universe, but that strictly none of the agentic superintelligent AIs do.

The last argument is that maybe concepts are vague. Just as the 'population bomb' didn't materialize, maybe AI will go fine in ways that didn't fit our abstractions.

I would argue that what happened is not as much the world not fitting abstractions as our lack of cognition into ways things should go differently than expected. The abstractions used by the Malthusian 'population bomb' prediction were not totally off: either population would have to stop growing, or much more food would need to be produced, or there would be massive famines in the world. The assumed population growth would not decrease, and that food production efficiency would not increase, which were mistakes. Maybe it would be hard to predict such things with certainty, but wise people of his time could at least deduce correctly what the logical possibilities for "non-massive-famine" were. We may be equally able to consider what the possibilities for "no AI x-risk" are.


While many uncertainties remain, a weak inside view of intelligence and optimization can lead one to think that an intelligence explosion is overwhelmingly likely.

This crux view can easily lead to bigger certainty in a specific version of A: Superhuman AI will not only be goal-directed, but will likely do so through self-modification, in a way that is predicted to generate utility-maximizers, rather than pseudo-agents with inconsistencies visible to humans.

The same weak inside view produces a specific version of B: goals that would seem "close enough" to generate good-enough states for humans with weak/medium optimization can generate very bad states when an agent is capable of strong optimization over the world. This makes it much more likely for the AI's goals to be bad.

Finally, the view of an intelligence explosion also leads naturally to C: true superintelligence only happens after AI's autonomous contribution to research is very significant. The contribution of normal humans to this process may quickly drop to near-zero, particularly if AGI's are strictly better at cognition than humans but faster. This is the mechanism by which AIs can become powerful or numerous enough to overpower humanity.

While other gaps exist, and uncertainty remain, it seems possible that acceptance or rejection of the intelligence explosion thesis is the main crux behind many of the "gaps" seen in the AI x-risk case.

New to LessWrong?

New Comment