Counting Arguments in AI Safety

Samuel Ratnam

This is a linkpost for https://substack.com/home/post/p-198682900

cf. https://www.lesswrong.com/posts/YsFZF3K9tuzbfrLxo/counting-arguments-provide-no-evidence-for-ai-doom , https://www.lesswrong.com/posts/yQSmcfN4kA7rATHGK/many-arguments-for-ai-x-risk-are-wrong

A counting argument is a style of argument that looks something like this:

We are drawing from a space where there are many more Xs than Ys
Therefore, absent any strong reason to expect Ys, we are much more likely to get Xs

For example, when trying to answer the question “what is the probability that superintelligent systems will want to kill us” we might use an argument like:

A superintelligent AI will land somewhere in a vast space of possible goals. The goals compatible with our survival occupy only a tiny corner of that space.
Absent a good reason to believe our training process selects strongly for human-friendly goals, it is much more likely that the goals of the AI end up somewhere else - in the region where everyone dies.

I wonder what you think of this argument. Does it sound familiar to you? Does it seem reasonable?

Bertrand's Paradox

Consider an equilateral triangle inscribed in a circle. Suppose a chord of the circle is chosen at random. What is the probability that the chord is longer than a side of the triangle?

Method 1: random endpoints. Pick two points at random on the circumference and draw the chord between them. By rotational symmetry, fix one endpoint at a vertex of the triangle. The chord is longer than a side iff the other endpoint lands on the arc between the two opposite vertices — one third of the circumference. The probability is 1/3.

Method 2: random radial point. Pick a radius at random, then pick a point on that radius uniformly, then draw the chord perpendicular to the radius at that point. The chord is longer than a side iff the point lies within half the radius of the centre. The probability is 1/2.

Method 3: random midpoint. Pick a point at random inside the disk; it is the midpoint of a unique chord. The chord is longer than a side iff the midpoint lies within the inscribed circle of radius r/2, which has area 1/4 of the disk. The probability is 1/4.

(images from https://en.wikipedia.org/wiki/Bertrand_paradox_(probability) )

All three arguments seem reasonable; all assume a uniform prior over some unknown property of the chord (positions of endpoints, perpendicular radial point, midpoint), but lead to contradictory conclusions. It turns out that a uniform prior in the face of radical uncertainty (the Principle of Indifference) is a principle that cannot be applied coherently. There is no privileged way of picking out a chord, and therefore there is no 'correct' answer to the question without knowing more about the generating process.

AI Safety Projections

Many arguments about future AI systems implicitly rely on something like a uniform prior over an unknown property of superintelligent systems (the goals that they will have). This is, in some sense, an argument from ignorance - and I admit it should at least give us reason to be uncertain about what kinds of goals AI systems develop - but it provides a deceptively compelling intuition for why we should expect doom with high probability.

I don’t think all counting arguments are bad, or should never be used. But the real answer to Bertrand’s paradox is that the word ‘random’ is not meaningful without any knowledge of the structure of your sampling process. When you make a counting argument, you are implicitly projecting the weird complex minds that future AI systems will be into a lower dimensional, more understandable subspace (eg. the space of goals and world models - or analogously, the midpoint of a chord). Much like how Greenland looks almost as large as Africa on a Mercator projection, the apparent distribution over outcomes depends entirely on the projection you choose, and so can easily distort structure.

I wanted to write this up because I think that a lot of disagreements about alignment bottom out into differences in projections, but people often argue as if they’re disagreeing about the territory. And I think a lot of discourse in this space could be more productive if people tried to think more about the selection processes generating the relevant distributions, and reason about why some projections are more or less valid than others. Fundamentally, I think we are in a position of deep uncertainty and confusion, and I am very sceptical of anyone who claims to be able to predict the motivation space of future AI systems with any kind of certainty, whether to justify optimism or pessimism.

Written during AFFINE Superintelligence Seminar. Thanks to Stefano Zutti for discussion and feedback.

[-]williawa4h10

There is no privileged way of picking out a chord, and therefore there is no 'correct' answer to the question without knowing more about the generating process.

This doesn't quite make sense to me. This just kicks the problem one level up. There are many way's we could privilege one space to distribute our uncertainty over. Almost none of them have minds that value human happiness occupying a large share of the space.

Like it seems obvious to me you have to make this concession. Because your argument doesn't privilege human morality in any way. You could make the same argument to argue we should be "uncertain" about any aspect of AI motivation

Fundamentally, I think we are in a position of deep uncertainty and confusion, and I am very sceptical of anyone who claims to be able to predict the motivation space of future AI systems with any kind of certainty, whether to justify believing the AI will spend all its time [building anime cat-girls or not]

And I think a lot of discourse in this space could be more productive if people tried to think more about the selection processes generating the relevant distributions, and reason about why some projections are more or less valid than others.

I think all of prosaic alignment work is already about this, no?