Considerations on interaction between AI and expected value of the future

Beth Barnes

Some thoughts about the ‘default’ trajectory of civilisation and how AI will affect the likelihood of different outcomes.

Is the default non-extinction outcome utopic or dystopic?

Arguments for dystopia:

The world seems dystopic in many ways now
We’re getting better at various types of manipulative tactics, i.e. persuasion, marketing, creation of addiction and dependence, harnessing of tribalism. These things cause people’s actions to depart from what would best achieve their ‘true’ values. When this happens a lot, there is no reason the world should move in a direction that people think is broadly good. The things that happen in the world will no longer reflect human values; they will reflect the results of competition for resources and influence taking place between non-human actors (i.e. corporations, political parties), who are able to effectively control human actions.
Most creatures have existed in a dystopic Malthusian struggle, where the size of the population is kept stable by large numbers dying due to lack of resources. This is the state for all wild animals, and was the state for almost all of human history. We should probably consider this the default state. We should view the current state, where there are surplus resources for people to live much more comfortably than ‘barely not dying’ and the population is controlled by our reproductive choices rather than starvation or conflict, as an exception that will probably revert to normal, rather than permanent state.
Altruistic or broadly benevolent values are not the default, they’re pretty recent and rare. The fraction of human history where our moral circles have been wide enough to include humans of other races or nationalities is extremely small. Most current people’s moral circles only patchily include animals- it depends on the exact species and circumstance whether people feel any empathy. It seems like in many traditional societies, killing or harming a member of a rival group was considered not just neutral but good, something virtuous and something to be celebrated.
Technology is making the circumstances of our lives increasingly distant from the sort of settings humans are adapted for. We’ll find ourselves in settings where the environment is too unnatural to successfully trigger happiness, fulfilment, or empathy.
There are positional goods and values that involve people benefiting from the misfortune of others, or from authority and dominion over them. Current people desire various positional goods like relative social position, relative wealth, relative dating success, e.t.c. Historically, values systems which directly value the submission or suffering of others seem common. It seems to have been pretty standard, for most of human history, to have your value system include how many people you’ve conquered or dominated, and a key way to prove (to others or for your own satisfaction) that you’ve successfully conquered someone is to force them to do something they don’t want to do
Selection favours certain kinds of values that are not the values we’d want in a utopia - those with values of dominance, conquering and proliferation will survive and spread more effectively than those with values of pacifism, altruism and benevolence

Arguments for utopia:

All else being equal, most (current, westernish?) people would prefer the world to be generally good for (at least most of) the sentient beings in it. People will generally try to shape the world in a better direction, as long as it’s not too costly to themselves to do this. Technological progress will make it increasingly cheap and easy to ensure all sentient beings have good lives
Life for most people has mostly been getting better, and benevolent values have been becoming more common
Most times when people have claimed civilisation is going downhill they have been wrong
Under many circumstances, having values of pacifism, altruism and benevolence actually does outcompete aggressive and selfish values, because those with cooperative values are better able to coordinate, be trustworthy, and avoid fighting among themselves.

How does AI affect these considerations?

Ways AI can make things worse:

Disrupting utopia arguments:

Affects (1):

Disrupts ‘people want thing to happen’ -> “thing happens”, by making the world more confusing and difficult to steer, or by resource competition (AI taking resources from humans)
Disrupts ‘most people want x -> x happens’, by enabling more concentration of power. Currently, society is steered by the aggregate of many people’s values and preferences; this aggregate is more moderate and more reliably benevolent than a random individual’s values. We might lose this property if AI enables ‘single sociopath wants thing to happen -> thing happens’.

Disrupts (2+3) if AI is qualitatively different from past technological change and therefore breaks previous patterns

Strengthening dystopia arguments:

(2) AI is likely to make us much better at manipulation - it will allow more intelligently optimised, larger scaled and more personalised targeting of persuasion and other tactics that decouple people’s actions from the things that they ‘really’ value

(4+5) AIs that are moral patients but don’t trigger empathy, or seem like moral patients but are actually not, are going to create murky and confusing ethical territory, increasing the risk of moral catastrophe.

(4+5) AI making the environment more strange and unnatural risks breaking whatever is causing people to have broadly altruistic values

(3+7) AI provides a new and faster-moving ecosystem for selection to take place in (i.e., among individual models or agents, among automated companies, etc), which will increase the strength of this effect relative to other things that influence the trajectory of the world (i.e., that most people don’t want the world to be taken over by whatever corporation is most ruthless). This both increases the probability that the world will be dominated by whichever actor is most ruthless, and increases the probability that we’ll end up in a Malthusian struggle.

(7) AI capabilities increase the influence gap a group can obtain by being more ruthless. If there are more powerful tools on the table to grab, the most grabby people will outcompete others by a larger margin

Ways AI can make things better:

Strengthening utopia arguments

AI is another technology, and as such it will enable humans to better understand and control the world. Scientific progress and economic growth resulting from AI progress will make it cheaper and easier to provide for the needs of sentient beings, and to obtain things we want without harming sentient beings.

Humans overall mostly do things that are in their interests. If we, as a society, develop and deploy an AI capability, that is evidence that the capability does in fact make the world better

Weakening dystopia arguments

(1 + 3) If AI changes the world radically, then maybe current dystopic aspects will disappear. For example, a singleton would eliminate coordination problems, and even a widely trusted advisor would eliminate many coordination problems.

(2) As well as improving manipulation, AI tools can also increase individual people’s ability to find, process, and understand information. AI could vastly improve the quality of education, and therefore people’s judgement and thinking skills. It could improve people’s control of what content they interact with

(4) AI can reduce scarcity and competition, and improve education and availability of information, both of which are likely to increase the frequency of benevolent and altruistic values. AI can help us reflect on and refine our values.

I think the default non-extinction outcome is a singleton with near miss at alignment creating large amounts of suffering.

I'm surprised. Unaligned AI is more likely than aligned AI even conditional on non-extinction? Why do you think that?

I think alignment is finicky, and there's a "deep pit around the peak" as discussed here.

I am skeptical. AFAICT a the typical attempted-but-failed alignment looks like one of the two:

Goodharting some proxy, such as making the reward signal go on instead of satisfying the human's request in order for the human to press the reward button. This usually produces a universe without people, since specifying a "person" is fairly complicated and the proxy will not be robustly tied to this concept.
Allowing a daemon to take over. Daemonic utility function are probably completely alien and also produce a universe without people. One caveat is: maybe the daemon comes from a malign simulation hypothesis and the simulators are an evolved species so their values involve human-relevant concepts in some way. But it doesn't seem all that likely. And, if it turns out to be true, then a daemonic universe might as well happen to be good.

These involve extinction, so they don't answer the question what's the most likely outcome conditional on non-extinction. I think the answer there is a specific kind of near-miss at alignment which is quite scary.

My point is that Pr[non-extinction | misalignment] << 1, Pr[non-extinction | alignment] = 1, Pr[alignment] is not that low and therefore Pr[misalignment | non-extinction] is low, by Bayes.

To me it feels like alignment is a tiny target to hit, and around it there's a neighborhood of almost-alignment, where enough is achieved to keep people alive but locked out of some important aspect of human value. There are many aspects such that missing even one or two of them is enough to make life bad (complexity and fragility of value). You seem to be saying that if we achieve enough alignment to keep people alive, we have >50% chance of achieving all/most other aspects of human value as well, but I don't see why that's true.

I think where we differ is that I think Pr[full alignment] is extremely low, and there is quite a lot of space for non-omnicidal partial misalignment.

This seems to be missing what I see as the strongest argument for "utopia": most of what we think of as "bad values" in humans comes from objective mistakes in reasoning about the world and about moral philosophy, rather than from a part of us that is orthogonal to such reasoning in a paperclip-maximizer-like way, and future reflection can be expected to correct those mistakes.

future reflection can be expected to correct those mistakes.

I'm pretty worried that this won't happen, because these aren't "innocent" mistakes. Copying from a comment elsewhere:

Why did the Malagasy people have such a silly belief? Why do many people have very silly beliefs today? (Among the least politically risky ones to cite, someone I’ve known for years who otherwise is intelligent and successful, currently believes, or at least believed in the recent past, that 2⁄3 of everyone will die as a result of taking the COVID vaccines.) I think the unfortunate answer is that people are motivated to or are reliably caused to have certain false beliefs, as part of the status games that they’re playing. I wrote about one such dynamic, but that’s probably not a complete account.

From another comment on why reflection might not fix the mistakes:

many people are not motivated to do “rational reflection on morality” or examine their value systems to see if they would “survive full logical and empirical information”. In fact they’re motivated to do the opposite, to protect their value systems against such reflection/examination. I’m worried that alignment researchers are not worried enough that if an alignment scheme causes the AI to just “do what the user wants”, that could cause a lock-in of crazy value systems that wouldn’t survive full logical and empirical information.

One crucial question is, assuming AI will enable value lock-in when humans want it, will they use that as part of their signaling/status games? In other words, try to obtain higher status within their group by asking their AIs to lock in their morally relevant empirical or philosophical beliefs? A lot of people in the past used visible attempts at value lock in (constantly going to church to reinforce their beliefs, avoiding talking with any skeptics/heretics, etc.) for signaling. Will that change when real lock in becomes available?

Yeah, I'm particular worried about the second comment/last paragraph - people not actually wanting to improve their values, or only wanting to improve them in ways we think are not actually an improvement (e.g. wanting to have purer faith)

Is this making a claim about moral realism? If so, why wouldn't it apply to a paperclip maximiser? If not, how do we distinguish between objective mistakes and value disagreements?

I interpreted steven0461 to be saying that many apparent "value disagreements" between humans turn out, upon reflection, to be disagreements about facts rather than values. It's a classic outcome concerning differences in conflict vs. mistake theory: people are interpreted as having different values because they favor different strategies, even if everyone shares the same values.

ah yeah, so the claim is something like 'if we think other humans have 'bad values', maybe in fact our values are the same and one of us is mistaken, and we'll get less mistaken over time'?

I guess I was kind of subsuming this into 'benevolent values have become more common'

I tend to want to split "value drift" into "change in the mapping from (possible beliefs about logical and empirical questions) to (implied values)" and "change in beliefs about logical and empirical questions", instead of lumping both into "change in values".

most of what we think of as "bad values" in humans comes from objective mistakes in reasoning

Could the same be also true about most "good values"? Maybe people just makes mistakes about almost everything.

My sense is that most would-be dystopian scenarios lead to extinction fairly quickly. In most Malthusian situations, ruthless power struggles... humans would be a fitness liability that gets optimised away.

The way this doesn't happen is if we have AIs with human-extinction-avoiding constraints: some kind of alignment (perhaps incomplete/broken).

I don't think it makes much sense to reason further than this without making a guess at what those constraints may look like. If there aren't constraints, we're dead. If there are, then those constraints determine the rules of the game.

It sounds like you're implying that you need humans around for things to be dystopic? That doesn't seem clear to me; the AIs involved in the Malthusian struggle might still be moral patients

Sure, that's possible (and if so I agree it'd be importantly dystopic) - but do you see a reason to expect it?
It's not something I've thought about a great deal, but my current guess is that you probably don't get moral patients without aiming for them (or by using training incentives much closer to evolution than I'd expect).

I guess I expect there to be a reasonable amount of computation taking place, and it seems pretty plausible a lot of these computations will be structured like agents who are taking part in the Malthusian competition. I'm sufficiently uncertain about how consciousness works that I want to give some moral weight to 'any computation at all', and reasonable weight to 'a computation structured like an agent'.

I think if you have malthusian dynamics you *do* have evolution-like dynamics.

I assume this isn't a crux, but fwiw I think it's pretty likely most vertebrates are moral patients

I agree with most of this. Not sure about how much moral weight I'd put on "a computation structured like an agent" - some, but it's mostly coming from [I might be wrong] rather than [I think agentness implies moral weight].

Agreed that malthusian dynamics gives you an evolution-like situation - but I'd guess it's too late for it to matter: once you're already generally intelligent, can think your way to the convergent instrumental goal of self-preservation, and can self-modify, it's not clear to me that consciousness/pleasure/pain buys you anything.

Heuristics are sure to be useful as shortcuts, but I'm not sure I'd want to analogise those to qualia (??? presumably the right kind would be - but I suppose I don't expect the right kind by default).

The possibilities for signalling will also be nothing like that in a historical evolutionary setting - the utility of emotional affect doesn't seem to be present (once the humans are gone).
[these are just my immediate thoughts; I could easily be wrong]

I agree with its being likely that most vertebrates are moral patients.

Overall, I can't rule out AIs becoming moral patients - and it's clearly possible.
I just don't yet see positive reasons to think it has significant probability (unless aimed for explicitly).

Thanks, that's interesting, though mostly I'm not buying it (still unclear whether there's a good case to be made; fairly clear that he's not making a good case).
Thoughts:

Most of it seems to say "Being a subroutine doesn't imply something doesn't suffer". That's fine, but few positive arguments are made. Starting with the letter 'h' doesn't imply something doesn't suffer either - but it'd be strange to say "Humans obviously suffer, so why not houses, hills and hiccups?".
We infer preference from experience of suffering/joy...:
[Joe Xs when he might not X] & [Joe experiences suffering and joy] -> [Joe prefers Xing]
[this rock is Xing] -> [this rock Xs]
Methinks someone is petitioing a principii.
(Joe is mechanistic too - but the suffering/joy being part of that mechanism is what gets us to call it "preference")
Too much is conflated:

In particular, I can aim to x and not care whether I succeed. Not achieving an aim doesn't imply frustration or suffering in general - we just happen to be wired that way (but it's not universal, even for humans: we can try something whimsical-yet-goal-directed, and experience no suffering/frustration when it doesn't work). [taboo/disambiguate 'aim' if necessary]
1. There's no argument made for frustration/satisfaction. It's just assumed that not achieving a goal is frustrating, and that achieving one is satisfying. A case can be made to ascribe intentionality to many systems - e.g. Dennett's intentional stance. Ascribing welfare is a further step, and requires further arguments.
  Non-achievement of an aim isn't inherently frustrating (c.f. Buddhists - and indeed current robots).
  1. The only argument I saw on this was "we can sum over possible interpretations" - sure, but I can do that for hiccups too.

I think the default non-extinction outcome is a singleton with near miss at alignment creating large amounts of suffering.

I'm surprised. Unaligned AI is more likely than aligned AI even conditional on non-extinction? Why do you think that?

I think alignment is finicky, and there's a "deep pit around the peak" as discussed here.

I am skeptical. AFAICT a the typical attempted-but-failed alignment looks like one of the two:

Goodharting some proxy, such as making the reward signal go on instead of satisfying the human's request in order for the human to press the reward button. This usually produces a universe without people, since specifying a "person" is fairly complicated and the proxy will not be robustly tied to this concept.
Allowing a daemon to take over. Daemonic utility function are probably completely alien and also produce a universe without people. One caveat is: maybe the daemon comes from a malign simulation hypothesis and the simulators are an evolved species so their values involve human-relevant concepts in some way. But it doesn't seem all that likely. And, if it turns out to be true, then a daemonic universe might as well happen to be good.