Ideas for studies on AGI risk

dr_s

This post from a few days ago brought up the lack of preeminent AI papers on the topic of AI risk. The lack of such peer-reviewed studies is often brought up by critics as a way to dismiss the worries, so I'd like to consider, what are possible avenues to work on some pieces or the entirety of the problem and provide some empirical or theoretical arguments that support the conclusion of AI being risky (or possibly even shine light on possible risk mitigation strategies)? Stripped to the bone, the argument rests on four pillars:

Possibility: AGI and ASI can be physically built with realistic amounts of resources and powered with realistic amounts of energy by our civilization;
Orthogonality: no matter how smart, AGI and ASI are not guaranteed to converge spontaneously to a specific set of values (stronger version: even if they did converge to some values, these aren't guaranteed to be our values too);
Instrumental Convergence: for a substantial fraction of possible terminal goals, any agent develops instrumental goals such as power seeking and self-preservation;
Power: a smart enough agent that is allowed to interact with the world at all is able to overcome any restriction placed on it by much less smart agents, and becomes impossible to contain or defeat.

Each of these pillars could be explored individually in ways that at least try to assess the likelihood of things going badly. Anyone who wants empirical evidence that ASI, specifically, would cause an existential disaster is essentially in bad faith - can't build the thing just to prove that the thing won't end the world - but anyone else should be more amenable to at least arguments by analogy and toy models illustrating the key problems and placing limits on the belief in AI X-risk. I'll suggest possible avenues that come to my mind and also try to steelman objections to each of the four pillars. I'd be happy if anyone came up with more ideas or wanted to develop further mine (with or without my help).

Possibility

Thesis: AGI and ASI can be physically built with realistic amounts of resources and powered with realistic amounts of energy by your civilization

The counterpoint here would be that either ASI is too expensive, requiring something like a planet-sized computer, or ASI is impossible, because humans are the smartest thing the universe allows to exist for whatever reason. The obvious avenue to go here IMO is considering the human brain. We know that the human brain is small and uses very little energy; this places an upper bound in that we know that AGI that is at least that cheap ought to be allowed by the rules of the universe (unless one believes in dualism and humans having souls that do most of the computation, but in that case all arguments will be moot).

It seems to me that a possible argument here would go along the lines of "evolution has to optimize under severe constraints; our technology does not have those constraints and almost always manages to do much better". You could in fact establish a set of comparisons in performance and efficiency between "best solution found by evolution" and "best solution found by human technology" for a number of categories:

land speed: cheetah vs Thrust SSC
water speed: black marlin vs Spirit of Australia
air speed: peregrine falcon vs SR-71 Blackbird
most efficient photon collection: sugar cane vs GaSb solar cells
best visual resolution
best sound detection
strongest grip

And so on, so forth. In general it's pretty clear who wins, and it'd be interesting to see statistics on by how much technology wins. It seems to be possible to make a fair, quantitative case for how artificial built ASI might eventually surpass the human brain. Theoretical arguments could be made from thermodynamics about the minimum required energy use for given computations, energy dissipation with size, and so on. I'm sure they'd come out allowing much more than what the human brain does, but it'd be interesting to double check (if someone hasn't already).

A stronger counter-argument that could be made today is that since our current way to create intelligence in LLMs is based on extrapolating from human-produced text, this makes vastly superhuman general intelligence out-of-domain and potentially impossible to achieve with the current paradigm. I actually agree with that, but of course it doesn't mean that with so much research going on someone couldn't find ways to surpass the LLM paradigm, or exploit it in tandem with something else to remove its limitations. Not sure how to argue this though (especially not without giving people ideas).

Orthogonality

Thesis: no matter how smart, AGI and ASI are not guaranteed to converge spontaneously to a specific set of values (stronger version: even if they did converge to some values, these aren't guaranteed to be our values too)

I think this is more of a philosophical point, but an attempt at an empirical argument could be made too. My immediate instinct is to look again at the animal world, since that's the only sample of non-human intelligences that we possess. It could be useful to pick N species of varied intelligence, try to quantify that ability - I'd prefer a rough but objective metric like brain/body mass ratio than something too subjective, since measuring animal cognition is always hard - and then seeing if it correlates with any behaviours, in particular behaviours that we would consider repugnant and we would hope an ASI would not possess or endorse: cannibalism, infanticide, devouring the partner after mating, killing for reasons other than eating or self-defence and so on. The biggest problem I see with this is that:

it would be very sensitive to cherry picking certain species, so some kind of objective or randomized sampling approach should be used;
it would need to be controlled for factors like e.g. sociality: it's likely that sociality correlates with both large brains and less distasteful behaviour, but that's just looking at animals that are "human-like" in a sense, and goes against the point of the question.

I also think that seeing a lack of correlation would be stronger evidence for orthogonality than seeing correlation would be against it: this because all animals still evolved within the rules of the Earth biosphere, and many have brains descended from the same ancestor, so it's possible that they do share behavioural roots. I think this is pretty sound but also a critic could see it as a convenient way to wriggle out of the conclusions of the study if the outcome seems to run counter to orthogonality.

One could also make more theoretical arguments. Suppose the following ab absurdum thought experiment: say that orthogonality is false in a way that matters to us, and thus there is a set of "real" moral values that all beings converge to arbitrarily as they become smarter. These values include things like "sentient, intelligent beings should not be harmed without sufficient reason" (if they don't, we don't really care about a non-orthogonal world in which all ASI explicitly converge to wanting to kill us as a terminal value). So given those values, imagine an ASI smart enough that it embraces them fully. Then imagine duplicating perfectly this ASI so that now it's two different beings, labelled A and B. Each of them is given a choice for a vote for one or the other. If both ASIs vote the same, the one that was voted is destroyed, and the other survives. If they vote differently or refuse to vote, they are both destroyed. Both are motivated to minimize the number of deaths, and possibly avert their own death. How do they vote? In what sense can values be absolute if both selfish and selfless solutions in this symmetrical scenario still lead to different actual choices, and both ASIs dying? The argument or similar ones could be better expanded and developed to provide a stronger philosophical case for orthogonality.

The steelman here would probably rest on objections based around how humans seem to converge to certain values with intelligence. A counterpoint might be to study humans throughout history and different cultures, control for education and other background factors, to verify whether this seems consistently true or if it's just a by-product of specific cultural trends of our post-Enlightenment world. In other words, do smart humans converge to broader spheres of concern as a rule, or do they just converge to whatever their society considers the highest form of morality, and this is simply what our current society promotes? The biggest difficulty seems to me to quantify intelligence just from someone's known written works or life facts, and how to make sure we correct for sample bias (if you were very smart but born a slave in Roman times odds were you wouldn't have as many chances to leave us written works as if you were an aristocrat instead). The sociality exception for animals could also suggest a possible alignment strategy: maybe one could gradually evolve ASIs in a social environment - some sort of cooperative, instead of adversarial, training - and thus lead them to develop a similar correlation. It's reasonable and might be worth investigating how the mechanism through which intelligence can make humans more empathetic is the expansion of an instinctual "circle of concern" via increasing rational understanding to all sentient beings; this might be a potentially reproducible phenomenon even in artificial conditions.

Instrumental convergence

Thesis: for a substantial fraction of possible terminal goals, any agent develops instrumental goals such as power seeking and self-preservation

This I think is the thesis that has most been tested already. The obvious road I can think of is some kind of agent-based simulation or game, either solved theoretically or played by AI, in which power-seeking is a potential strategy, to see what players with different goals converge to. A quick search finds for example this model, as well as this paper. Overall I think it's likely that we have enough material to cover this point already pretty convincingly. Further tests could be tried both in the experimental direction (for example, something akin to this study that used GPT-3.5 turbo to instantiate multiple agents, but in a context of competition for resources), as well as in the theoretical model.

I think the simplest possible model for this problem could be a world full of a conserved resource R with two agents, A and B, each defined by a "capability" and $β$ . Each is trying to create some "product", $P_{A}$ or $P_{B}$ , and has a time-dependent strategy function, $s_{A} (t)$ and $s_{B} (t)$ , comprised between 0 and 1, which determines how much of its efforts go towards making the product (its terminal goal) and how much towards amassing power (which multiplies its effectiveness at making the product). So for example you would have equations:

$\frac{\partial P_{A}}{\partial t} = α s_{A} (t) R_{A} \frac{\partial P_{B}}{\partial t} = β s_{B} (t) R_{B}$

$\frac{\partial R_{A}}{\partial t} = α (1 - s_{A} (t)) R_{B} - β (1 - s_{B} (t)) R_{A} \frac{\partial R_{B}}{\partial t} = - \frac{\partial R_{A}}{\partial t}$

You can then find optimal strategies by variationally optimizing for the strategy functions in order to maximize $P_{A}$ and $P_{B}$ at a certain time horizon $T$ . I'm sure this could be done numerically though I haven't tried yet; not so sure about it being possible analytically. Either way, though, it would be just a very simplified toy model. The most interesting insight I could imagine from it is that I'd expect the strategy to depend on the time horizon: for long term plans, accumulating power seems ideal, whereas for short term ones it might be best to throw everything into making lots of product with whatever resources you start with. This is analogue to the effect of the discount rate on the POWER model from Harris and Suo.

I'm honestly not sure what the steelman for the opposition to this argument could be. Perhaps one could focus on the ways in which complexity can change the outcomes of these models. In reality, for example, for an extremely capable agent, there always is the option of reward hacking, which should require minimal effort compared to actually pursuing one's stated goal. However, even reward hacking might benefit from power-seeking behaviour (for example to ensure your hack isn't fixed). There's also the question of how the goals of the AI would be determined in the first place. Working with LLMs at the very least for now shows that they do not seem to be generally laser-focused on a specific goal, and perhaps this kind of "noisy", erratic behaviour can be compared to humans' and be argued as a mitigating influence on extreme power seeking. It is not clear however why this would be true also of smarter and more well-targeted systems.

Power

Thesis: a smart enough agent that is allowed to interact with the world at all is able to overcome any restriction placed on it by much less smart agents, and becomes impossible to contain or defeat.

This part is I think the one that's most uncertain, and the one where I am less convinced of the doom argument. I think "power" really splits in two parts: ability of the AGI to improve itself to superintelligence, and ability of superintelligence of overcome virtually any obstacle as long as it's smart enough. People seem to fall on very different points on the spectrum of this, even people agreeing with the other premises; Eliezer obviously thinks that intelligence would be powerful enough to make an ASI nigh omnipotent, other put more trust in strategies like boxing. It seems to me in the most theoretical form, arguments here would be about chaos theory. If one thinks of reality as encoded by a certain state vector which evolves according to complex laws, and an intelligent agent is able to control only the $n$ most significant digits of a few components of this vector, then its ability to manipulate it to end close to a certain state in the future will be limited; the effect of the non-controlled components as well as the less significant digits will eventually cause the trajectories to become unpredictable and diverge exponentially. However this theoretical arguments only sets the obvious boundaries - that no matter how smart an agent is, their intelligence will have diminishing returns if they control a very small part of the world's state, and if the world is highly chaotic. But that's not very useful in practice since we don't know how to actually estimate how small is small enough for our world.

At the other end of the spectrum, practical arguments might focus on actually trying to envision ways in which an intelligence that's only slightly superhuman could seek power and gain it even from a strong disadvantage (for example, only digital presence with no robotics). The problem with this approach is that it can not easily extend to truly superhuman intelligence, since we can't trust ourselves to guess its potential strategies, and that it also includes actively looking for and designing essentially global asymmetric warfare strategies - things that if speculative are useless, and if practical and potentially doable are dangerous to put out there where both AIs and other people can read them.

In the middle might be another empirical approach based on comparative analysis of animal species. In this case it might be interesting to correlate how intelligence enabled other species to exert control over their environment. The problems with this are:

not only it's hard to guess the intelligence of a specific animal, but this analysis would require estimating the collective intelligence of a species, or at least of its social units. Humans aside, ants are another obvious example of a species that does acquire a lot of power over its environment, yet the individuals are fairly stupid, and it's only via swarm interactions within a colony that a greater intellect emerges;
it's equally hard to quantify impact on the environment. Do beavers have more impact than wolves because they build dams? But wolves don't need to build dams. One could argue that for example predators exert their power by establishing and successfully defending a territory. Power-seeking and resource control in nature comes in a wide variety of ways.

If one could design a half-decent methodology, though, it would be interesting to see whether there's some trend suggesting diminishing returns or instead exponential gains from intelligence in terms of ability to exert power. I suspect the latter; in fact I suspect that for almost any metrics, humans would appear as a drastic outlier anyway. Nor in principle is there a strict guarantee that the returns would be some kind of uniform function, so this would remain a fairly weak sort of evidence.

The steelmanned objection I'd expect the most is that there are no grounds to believe that we can't put even a superintelligence in such a position of disadvantage that it simply can't escape fruitfully. After all, any AI that wanted to destroy us as an instrumental goal would also need to guarantee its continued existence, which is a much stricter set of conditions. If it was kept fairly boxed in, simply sneaking some deadly device in the designs of an otherwise innocuous device used as a trojan horse would not be sufficient if it couldn't also rapidly establish a foothold in the physical world to maintain and repair itself. In fact, I don't necessarily disagree with that. While manipulation, bribery, deceit, and trojan horses are all viable tactics, I don't believe the control they would allow an AI to exert would be absolute enough to guarantee its success, iff its options were very limited. And therein lies the problem, because if we were in the kind of mindset where the AGI gets built on a desert island, in an airgapped facility, with lockdown protocols, allowed communication only shortly with rotating researchers periodically subject to psychological examinations and never exposed to contact with it for too long, then I'd say that we might have good chances of resisting its attempts to seek power, or rather, put it in a condition in which it might judge them unviable in the first place. But that's very far from the current condition, a situation in which something as basic as "please don't put the AI in full control of our nuclear arsenal" even needs to be said (though hey, at least it was said! Baby steps). If we had that sort of safety mindset, I'd say all but the most pessimistic would be pretty reassured about AI X-risk.

Conclusion

There's not much I can conclude other than that I hope some of these ideas sounded good starting points, or even stimulated other ideas (even if just to argue that mine are stupid), for some of you who read them. I'd love to hear opinions and suggestions, and if any of these things became real studies, I'd love to be involved in some capacity (I'm not a researcher any more, but I have some fair knowledge of scientific simulation, modelling and data science). While I don't think peer reviewed science is the be-all end-all of human knowledge, trying to ground one's thinking in empirical and theoretical bases as much as possible is never a bad exercise, and if it helps gaining the worries of AI safety some credibility, or even suggest some possible solutions, all the better. I'd say if AI safety worries have to be taken seriously, right now, this sort of systematic work might be a necessary step.