Levels of superpersuasion

Bill Jackson

Superpersuasion^[1] is one of few ways a non-embodied AI system could bootstrap itself into causing harm in the physical world. It's not obvious that superpersuasion is even possible, to quote Bryan Caplan:

I’ve got great confidence that the most incredible intelligence of the universe is not going to be able to construct 10 words that will make me kill myself. I don’t care how good your words are. Words don’t work that way.

I'm inclined to agree with this, but there's a question of how persuasive an AI-in-a-box could actually be. Clearly (I believe) it could persuade you in a 5 minute conversation that 2 + 2 = 4, but not that you should kill your own grandmother.

You may think it could persuade you to kill your own grandmother because, as a superintelligence, it can apply tricks that are completely incomprehensible to us and achieve arbitrary effects. This may well be true, but it's not self-evidently true. There are situations where complete superiority in intelligence can't outweigh limitations in I/O: the information the AI has to work with, and the set of actions it can take.

For instance, if a far superhuman chess AI can only use half its pieces, then it will lose to a mediocre human player. If a far superhuman grandmother-killing AI can only say 500 words, it seems like it would struggle to convince a mediocre human. Whereas if it had legs and a gun it could obviously get much further.

The purpose of this post is:

To point out that merely being very intelligent doesn't necessarily let you achieve arbitrary outcomes through one text or audio channel (DONE).
To lay out levels of (super)persuasion, for people to disagree over which is the highest possible.
To say that I would be surprised if anything above "4a. Super-personalised or super-numerous charismatic leader" is possible (DONE)

Note: Many arguments under "Case for/against" could apply at multiple levels. I've put arguments at the level where I think they are a crucial condition for that specific level

Note 2: This is an Inkhaven post and as such, it was written in one day. I'm not trying to excuse anything, but I'm just telling you it was written in one day.

1. Subhuman

Can persuade you of some things, but not as well as someone in your personal life or a reasonably good salesman. Perhaps it's very good at persuading you in AI-like contexts, e.g. convincing you of a well-supported concept it is explaining. But not in a fully general context, for instance convincing you to take it on a romantic getaway.

Case for this being achievable

Even without thinking of concrete examples, I feel like this has already been achieved.

Case against this being achievable

Since I think this has already been achieved, I'm taking this to mean "reasons it might not get that close to human level"

The "it's clearly an AI trapped in a computer" problem: When cinemas were first invented, a grainy video of a train coming towards the screen was enough to make people crouch in their seats. As film technology got better, you might think it would get to the level where people called their loved ones and ran for the exits, but it didn't. The reason is simple: People know they're going to the cinema to watch a film.

Similarly, on the question of "Why can my mother convince me to send her flowers on Mother's day, but Claude can't?". Well, for one thing, I know my mother is my mother, whereas Claude is a genie that lives in the computer. For another, if Claude starts to berate me, I can simply close my laptop and walk away, whereas my mother will be sad the next time I see her. The same may apply to AI convincing people of things generally. (see "Appendix: Fingerprint argument" for a more general form of this argument).

A counter-counter argument: Some people argue that various "AI box experiments" have shown that it is quite doable for a text-only AI to persuade its way into the real world, even with human level intelligence. To this I again fall back to the fingerprint argument: It was obvious to the people in those experiments that they were talking to Eliezer Yudkowsky over email, and not guarding a superintelligent AI in a box. In the real world, prisons exist, and people generally aren't able to talk their way out of them, despite imperfect security measures. In my view this shows that when you really try to construct a system to achieve the goal, it is not too hard to keep a human-level intelligence in a box^[2].

2. Ordinary human

Can persuade you of things about as well as someone in your personal life who really wants something, or the best car salesman at your local dealership.

Case for this being achievable

Human-level persuasion is clearly achievable, humans do it. If the sub-human arguments above don't bite especially hard, this could be achieved. And of course the AI could offset its limitations in some areas with benefits in another (covered in higher tiers).

Off the shelf AI benefits from "expert level" training on everything, which humans lack. Most humans do not invest more than ~1% of their time into becoming very persuasive, so even with some structural disadvantages an AI could end up better than most humans.

Case against this being achievable

Persuasion in humans often involves a kind of quid pro quo, where you give people what they want so they give you what you want. The ability to give people what they want depends on more than just intrinsic properties. By sheer intelligence alone, an AI can't give you money, it can't give you exciting interpersonal experiences, it can't give you social status. A mediocre car salesman can at least give you a ride in a cool car, an employer can give you money and a high status position, someone you meet in a bar can give you an exciting romantic fling. If this is the 80/20 of persuasion, then an AI may not be able to get very far through intrinsic persuasive ability.

3. Charismatic leader

Can persuade you of things about as well as the most persuasive people in history (Steve Jobs, Hitler, Rasputin). Can instill in you goals that are completely different to what you would have aimed for otherwise. Not just "Ford vs Toyota" as a suburban dad, but devoting your life to a cause the AI convinces you of.

Case for this being achievable

Again, humans appear to be able to achieve this, at least in some people, in some places, some of the time. If no argument above bites, then it should be possible in AI.

In terms of the "give people what they want" argument, an advanced AI does have some disadvantages (no money by default, no physical form by default), but it also has a huge advantage in being more intelligent than a human, and this alone could give it a strong ability to give people what they want. Many people already seem to run their life by Claude, and this role as a trusted authority and advisor gives a strong basis for persuasion on its own.

In my view a particularly strong angle here is for an AI to run a company or government very well, much better than a human can. This could be achieved without having fantastic skills in pure persuasion, but could still give it a lot of power to make people do what it wants.

Case against this being achievable

This is right at the limit of what is known to be possible, since few humans can achieve "charismatic leader" level persuasion.

In humans, this apparent high level of persuasion may be more situational than dependent on literally one person being very good at convincing people of things. To illustrate what I mean:

Carl Jung described Hitler as (paraphrase) "the mouthpiece of the collective unconscious". I.e. his idea was that it wasn't so much that Hitler came up with his ideas and did a very good job of convincing people of them, but that people were crying out for a Hitler-like leader (the implication being that someone else would have filled the slot in any case).
Steve Jobs ~invented the personal computer and the smartphone. Arguably these would have been invented eventually anyway, and someone else would have taken the spoils. While he was renowned as a great salesman and as having a "reality distortion field", this didn't actually spill over into big effects outside selling personal computers and smartphones.
Rasputin achieved a lot of influence over the Tsar and Tsarina of Russia, but it's generally agreed that this was in large part due to their vulnerability: Their belief he could heal their haemophiliac son, the Tsarina being foreign-born and isolated, the Tsar being generally "wet". And, other stronger willed people saw this and put a stop to it by murdering him.

Overall, I think it's likely that most notoriously charismatic leaders emerge in a way that is highly situational, and it's not that you can print 1000 Chairman Mao's and have them market your course.

Additionally, if you look at cults of personality in history, they depend a lot on having "a poster on every street corner" or other kinds of distribution (e.g. Steve Jobs' keynotes). AI doesn't get this for free. If it's trapped in a box, then it can't do it, and that could be the primary mechanism that actually does the persuasion.

See also this quick take regarding B.F. Skinner's belief that Hitler-level behavioural control of populations would become widespread and dangerous as behavioural science developed. TL;DR: This didn't happen, behavioural science turned out to be too hard.

4. Hypnotic

Can persuade people of things to achieve a much stronger (but not unbounded) effect than the most persuasive humans. I see two routes to this, one through scaling up regular persuasion, one through achieving recognisably superhuman individual persuasion.

4a. Super-personalised or super-numerous charismatic leader

Can persuade people of things about as well as the most persuasive people in history, and is able to coordinate this across a lot of people in a personalised manner (i.e. picking what each person will find most persuasive) to achieve a much stronger effect than small-scale or broadcast persuasion.

Case for this being achievable

Assuming highly individually persuasive capabilities are possible, crossing this boundary seems common-sense possible, simply by applying the base capability to every individual with some steering to keep it pointed at the same goal.

Religious texts, manifestos, and other memes are able to influence a lot of people (sometimes towards extreme behaviour) without much personalisation. It stands to reason that a hyper-personalised version of this could achieve a greater effect.

Case against this being achievable

Firstly, this assumes the AI-in-a-box already has access to a large number of people. This (and worse) will likely be the case if the current trajectory continues, but this would be defeatable simply by restricting access.

Second, I think the strongest way this could fail is on the "achieve a much stronger effect than small-scale or broadcast persuasion". Advertising campaigns, social movements, and religions already apply some level of personalisation and segmentation to persuade different target groups. I'm somewhat skeptical that highly individualised targeting could do much better than this, i.e. I'm skeptical that people are so different in what makes them tick that going beyond ~100 "profiles" to true individualisation makes much difference^[3].

4b. Truly hypnotic

Can persuade you to do some things "you would never normally do" (but not anything), through means that seem incomprehensible or spooky. Also hits the "achieve a much stronger effect than the most persuasive humans" condition.

Case for this being achievable

Humans do sometimes take extreme actions, like killing each other, doing large scale organised crime. They must be persuaded to do this in some way.

Humans also do things "out of character" or where "they don't know what came over them". One way this can happen is through ordinary hypnosis by other humans. If there exist some principles which can be inferred here, and could be deployed more widely or more consistently, then it could be possible to reliably trigger behaviour that seems way off the path of someone's normal reactions. AI being able to analyse a superhuman amount of data, and apply superhuman intelligence to it, could enable it to work out these principles, which humans haven't been able to achieve.

Case against this being achievable

In human-on-human hypnosis, it's generally accepted (1, 2, 3) that:

You can't get people to do literally anything, for instance harm themselves or others.
People need to be "suggestible" for it to work, and be in an environment where they're ready to receive it. It doesn't work on anyone at any time.
It may well not be a real thing, and just be a case of people going along with it.

Given humans already know some rough principles (e.g. find suggestible people, get them into a calm state), the "next level up" could be arbitrarily harder. By analogy, it's possible to convince someone to buy a car who wants to buy a car. It's possible to hypnotise someone who "wants to be hypnotised". It's not slightly harder to convince someone to buy a car if they don't want one, it's borderline impossible. This may also be true for inducing desired behaviour in people who are not basically on board with it.

Other arguments overlap with "5. Arbitrary behavioural control", so I've pushed them all into that section.

5. Arbitrary behavioural control

Can make you do literally anything your body is capable of. E.g. it makes some dial-up tone noise and suddenly your arm moves one inch to the left; It makes some other noise and you pick up an axe and start swinging it around.

Case for this being achievable

If it's possible for you to do some action, there must be some brain-state where, if you could set up exactly that state, you would take the action. It's a question of whether it's possible to

Determine what this condition is.
Induce it via only text/audio input.

On this question of inducing highly specific brain states or actions:

In animals: There have been experiments in rats^[4] showing you can trigger or suppress seizures by activating specific neurons. This shows that it is possible in principle to "induce specific brain states" through targeted external input. This requires physically inserting proteins into the brain to make certain neurons light-sensitive, and shining light directly on the exposed brain, which is quite far from text/audio input. Additionally, a seizure is a non-subtle macro-scale brain state, inducing arbitrary behaviour is likely much harder.
In AI: adversarial examples are "a thing", where if you know a model's weights you can construct innocuous-seeming inputs which induce specific outputs. If human brains work similarly, then in principle there could exist some sequence of words or sounds that triggers a very specific response. The question is whether it's possible to find and deliver such a sequence through a normal conversation.

Case against this being achievable

Predicting or inducing exact behaviour could be...

...like "the weather", i.e. it's too chaotic to have coherent effects over more than a short period of time. Maybe seizures (high entropy breakdown of function) are a lot easier to achieve than coherent "walk over there and pick that up" actions.

...computationally intractable (relevant EA Forum post), i.e. even if you know the starting state, simulating forward the human brain on the computer hardware is ~impossible, even if the hardware is very efficient about doing its own thinking computations (again see "Appendix: Fingerprint argument").

...I/O bound, i.e. the AI needs to get enough data out of humans to model the brain, and put enough data in to have a strong controlling effect. This may not be possible. By analogy: two computers could display the same exact pixels on screen, and have the same exact set of inputs, but one could be running on Linux and one on Windows. An attack that would work on one would generally not work on the other. More than that, you have very little insight into the exact instructions running on the CPU, so you're limited to quite "surface level" attacks. In human persuasion, this is equivalent to being limited to things like "I'll give you £1M to let me out of the box", but not the string of 10 words that makes someone kill themselves.

Conclusion

I'd be very surprised if something like "5. Arbitrary behavioural control" is possible. Below that, I think there's room to argue over the different levels. I encourage you to make your arguments in the comments.

Appendix: Fingerprint argument

Claim: In general, it's much easier to achieve functional equivalence between two artifacts (having the same behaviour/purpose), than it is to achieve indistinguishability. Slogan version: Artefacts tend to have a "fingerprint" that is hard to fake.

Examples:

Both petrol and electric cars can be driven from A to B, but one goes "vroom vroom" and another goes "vvvvrrrrrrrrr"
Both cats and dogs are small four-legged animals that live in your house, and in fact it's hard to define one clear visible difference. But you can just look at one and determine which it is.
The functional properties of physical money that make it usable are essentially just: it's durable, non-perishable, portable. It's easy to print off fake money that fulfils this role, e.g. for a board game. It's very hard to produce convincing counterfeit money, due to a large number of deliberate and accidental hallmarks introduced in the manufacturing process.

Argument:

There are generally N properties that let an artefact fulfil its functional niche.
There are generally M >> N measurable properties that an artefact has, which are largely irrelevant to its functional niche.
It's often straightforward to replicate the N properties in an alternative way.
It's much harder to replicate the other M properties to the point where you can't distinguish the two artefacts.

This is especially strong when you consider that the set of M properties are generally unknown ahead of time, so you can always inspect further to find new properties that let you distinguish. For instance, imagine how hard it would be to make an electric car that is indistinguishable from a petrol car, assuming that: if you can't tell from driving it, you're allowed to look under the bonnet; if you can't tell from that, you're allowed to start disassembling and inspecting every part; if you can't tell from that, you're allowed to contact the manufacturers and trace the whole process of how it came into existence.

^{^}
From the latin, "super" = "really good", "persuasion" = "persuasion"
^{^}
I'm being quite glib here, I am aware of counterarguments like "the AI only has to succeed once". I'm not unsympathetic to this, I'm just trying to cover a lot of ground here! Please feel encouraged to point out these gaps in the comments.
^{^}
On net though: My guess is this is possible, and could be quite scary
^{^}
Described in this Rationally Speaking episode

[-]FeepingCreature11d30

Tbh I'm not convinced an AI tuned for "anti-human chess" actually loses with knight-bishop-rook odds against a mediocre player, particularly if it can review the player's games in advance. Mediocre club player, yes- but I'm not even completely certain of that. Stockfish plays this at a handicap because it models its opponent as just as capable as itself. (Google's AI says two-piece odds bring Stockfish to about club level, and it seems plausible that play targeted at somebody's weaknesses, clock pressure, forcing deep reads etc. could get you another piece worth of advantage, or enough to at least make it contested for a few games.)

21

Levels of superpersuasion

21

1. Subhuman

Case for this being achievable

Case against this being achievable

2. Ordinary human

Case for this being achievable

Case against this being achievable

3. Charismatic leader

Case for this being achievable

Case against this being achievable

4. Hypnotic

4a. Super-personalised or super-numerous charismatic leader

Case for this being achievable

Case against this being achievable

4b. Truly hypnotic

Case for this being achievable

Case against this being achievable

5. Arbitrary behavioural control

Case for this being achievable

Case against this being achievable

Conclusion

Appendix: Fingerprint argument

21

21