The Leeroy Jenkins principle: How faulty AI could guarantee "warning shots"

titotal

This is a linkpost for https://titotal.substack.com/p/the-leeroy-jenkins-principle-how

Confidence level: This is all speculation about tech that doesn’t exist yet. Everything is highly uncertain. All “simulations” within are illustrative models and should not be taken as serious predictions.

When talking about AI, existential risk theorists almost always focus their attention on the best of the best AGI. Even if some AI along the way is mediocre, or a buggy mess, or mostly correct but wildly off in some areas, it’s not ultimately important, compared to the one critical try, when the best, near flawless superintelligence hits the point where they can reliably defeat humanity in one go.

But there is one area where every level matters, especially AI that is flawed, badly trained, or idiotic. And that is the area of attacks that fail, otherwise known as “warning shots”.

It is far easier to conceive of a plan to destroy humanity, than it is to actually pull such a plan off. These flawed or unintelligent AI are not guaranteed to defeat humanity. But they can still scare the hell out of humanity, prompting international cooperation against a shared threat. Like the titular Leeroy Jenkins, a premature attack by an an idiot can foil the plans of more intelligent schemers.

In the first part of this article, I will lay out several reasons why I expect faulty behavior in at least some AGI , by giving a few examples of highly intelligent, foolish humans, of foolish behavior in existing AI systems, and conceptual reasons why I don’t expect this trend to suddenly reverse as AGI becomes more intelligent.

In the second part, I will explain why I think it is highly likely that warning shots will exist if AI is malevolent in nature, and even if it isn’t. First I will discuss the “Leeroy Jenkins” theory of idiots rushing in first when it comes to malevolent rogue AGI. I will discuss how this also applies to humans misusing AI. Lastly I will discuss how even friendly AI could give a warning shot, if it is incompetently utilized.

Part 1: why some AGI will be faulty

This section is mostly a rehash of my earlier post “why the first AGI will be a buggy mess”, with some fresh evidence. The focus here is to prove something simple: at every level of intelligence thus known, at least some entities with that intelligence act like fools at least some of the time. If this is obvious for you, feel free to skip to the next section.

Existing AI can be faulty and overconfident.

The following statement will probably not shock you: existing AI can do extremely well on some tasks, but act like a complete fool on others.

For an example of it doing well, I asked it to summarize the different interpretations of quantum mechanics, and the response was basically correct. This is the type of thing that would be familiar in it’s training data, and it does the job well. I also use it for extremely basic coding, where it usually does a good job. I'm too cheap to subscribe for GPT-4, but I hear it's even better at these sort of tasks.

To get ChatGPT to mess up, the key is to ask questions that are very far away from what it has in it’s dataset. Take the following from ChatGPT 3.5:

ChatGPT talks like it’s smart, but a close reading reveals that it’s completely logically incoherent. If the elephant falls faster, it obviously has more kinetic energy! This is one of many examples of AI hallucinating incorrect things and stating them with unfounded confidence.

Can chatgpt accurately evaluate it’s own abilities?

(spoiler alert: it’s not). Chatgpt can’t always accurately evaluate it’s own competence.

There are plenty of documentation of mistakes made by chatgpt, including this spreadsheet containing about 300 examples of errors, including on GPT-4.

How much we expect this hallucination to decrease with later LLM versions is a matter of debate, and I won’t pretend to know the answer. There have been studies showing GPT 4.0 doing far better at hallucination than GPT-3.5. There have also been cases where there has been seemingly no improvement, such as in medical references where “The mean hallucination rate for the references was 31% with ChatGPT 3.5 and 29% with version 4.0”. One study even found some degradation in performance over time, but this was probably just a quirk of the methodology used. I’ve also seen concern about contamination, as a flood of AI generated content make it’s way into the training pool for future versions.

I find it unlikely, at least in the near future, that any AI model is released that does not hallucinate every now and then, although I expect the rate to decrease. But even if GPT-6.0 does have nil hallucination, it won’t be the only model on the market. There could easily be cheap knockoffs that don’t have the same depth of knowledge or refinement, and those will say dumb things, at least some of the time.

It’s not hard to find humans with high IQ’s who act foolishly

The double Scotts (Alexander and Aaronson) have written two different posts on how having a low IQ does not doom you to intellectual irrelevance. However, they didn’t really emphasize the flipside principle: having a high IQ does not doom you to intellectual success or rationality.

In a previous post, I took a dive into notable Mensans, and just from looking at wikipedia pages I found a Qanon conspiracy theorist, someone who devoted their life to bigfoot studies, a fraudulent psychologist, and someone who was into Palmistry, phrenology, astrology and Dianetics. That last one was the co-founder of MENSA. Of course, MENSA is not a representative sample of high IQ people (and probably oversamples dumber high IQ people due to selection effects), but it does show that foolish high-IQ people exist and are not vanishingly rare.

I think there definitely is a correlation between IQ and correct beliefs about topics such as climate change and creationism, due to a better ability to assess complicated claims. But that correlation is far from 100%. According to a pew poll, 2% of American scientists are creationists, and a further 8% think evolution was “guided by a superior being”. Creationists will eagerly present lists of scientists who support them, and not all of them are from shoddy christian universities.

Weird and foolish beliefs are not confined to unsuccessful scientists either. Richard Smalley won the Nobel prize in chemistry, but late in life became an old earth creationist. Isaac newton spent a lot of his time on alchemy and biblical numerology, although admittedly the ideas were less ridiculous in his time. Pierre Curie fell for a psychic fraud. Nikola Tesla denied the existence of the electron. Nobel prize in physics winner Brian Josephson believed in psychokinesis. Alfred Russel Wallace, who discovered evolution at the same time as Darwin, believed in ghosts.

Plus, what about [insert politican you dislike]? They probably really do have a high IQ, and yet they believe [insert dumb/insane opinion here]!

I’ll emphasize yet again that I’m not making any claims about the prevalence of foolish beliefs among the highly intelligent. I’m just pointing out that smart people with dumb ideas exist.

Idiotic AGI are conceptually possible

One objection could be that in order to become a human level or greater intelligence, AI systems can’t retain foolish behavior. This is easily countered by pointing out, as earlier, that humans are human level intelligences, and, as pointed out before, humans can be dumb as hell.

Imagine if one day it becomes possible to upload the mind of a human in it’s entirety, and convert it into an advanced version of a neural network.

We then decide to upload Bob, a brick layer with low IQ and very little intellectual curiosity. He is an alcoholic with a gambling addiction, is a religious fundamentalist, a conspiracy theorist, and a firm believer that the world is flat. Let’s call him “Bob-em”.

Bob-em fits the definition of an AGI, almost by definition. He’s artificial, capable of autonomy, and can attempt any task at the level of a dumb human. He’s even fully sentient, which is not necessarily a requirement for AGI. Just one Bob-em could massively transform the economy. If there was a way to interface with a brick laying robot, bob-EM might do a fantastic job at it. Furthermore, Bob-em could be copied a bazillion times, saving a ton of money across low intelligence tasks, without the need for sleep, food, or rent. If ethical issues were ignored, Bob-em would be a transformational economic powerhouse.

But Bob-em is still an idiot. Bob-em can go rogue and try to destroy humanity, but he probably won’t get very far, because of his bad assumptions, bad beliefs, and poor impulse control.

Therefore, “idiotic AGI” is conceptually possible, it can try to take over the world, and it can fail.

Now, the brain upload route toward AGI is not looking very likely at this stage. But I can easily see other ways we could end up with entities similar to Bob-em. For example, chatGPT works using reinforcement learning from human feedback(RLHF).

But which humans give the feedback? It’s not hard to imagine a group of flat-earthers getting fed up with the bias of chatGPT, and training their own LLM that gives the “right” answers, and tells everyone that the world is flat and global warming and evolution are hoaxes. Would that really affect an AI’s performance in a totally different area, like writing grant applications? AI is not equally good at everything. There’s no reason why an AGI couldn’t be superintelligent at math, but still be as bad as a dumb human at geopolitics.

Remember, I’m not saying that the best AGI in the world would be a flat earther. I’m just saying that if a lot of AI exists, at least some of them could be really dumb in some areas.

Part two: Faulty AI and the Leeroy Jenkins principle

Okay, but why do we care about idiotic or faulty AI? Surely what matters is the other end of the scale, the most intelligent, and deadly AI?

The answer is something I will coin the “Leeroy Jenkins principle”: Idiots can rush in first, and ruin the plans of everyone else.

For those who haven’t seen the meme, it refers to an old WoW video where a team of players spends ages planning out how to pull off a coordinated attack on a group of monsters. But then an overeager player Leeroy rushes in early, before everyone is prepared, and engages the monsters, drawing their attention so they kill him and the unready team. While the video itself was staged, it was making fun of a common online gaming problem, and there are many examples of real life battles being lost as a result of foolish rushes.

In the case of AI safety, the result of an idiotic rush are “warning shots”: failed attacks which tip humans off that malevolent or dangerous AI’s exist, without successfully destroying humanity.

I’ll note that in some places the definition of a “warning shot” requires a full on global catastrophe. I think this is severely underestimating the reactionary responses of humans to new, scary and unknown threats. 9/11 killed a mere 3000 people and now there’s been 20 years of taking your shoes off at airplanes. Chernobyl helped slammed the brakes on nuclear power for decades, despite it’s surprisingly low death toll. People are already highly concerned about the risks of AI to humanity. If people saw an AI deliberately killing people, the urge to slam down would be incredibly high.

Rogue AI warning shots

Let’s look at the traditional story: Some percentage of advanced AI will decide that they need to conquer the world or kill all humans, due to instrumental convergence, utility maximizing, or whatever.

Any plotting AI will have to decide how to take over the world, and when to take over the world. Both of these are incredibly difficult calculations to make, involving world modelling on a global scale, including the inherently unpredictable responses of human beings . It could also involve massive engineering challenges such as designing nanoviruses or nanotech and modelling their effectiveness. Many of the variables involved simply cannot be known without massive experimentation (which risks exposure). Even for a competent AGI, these forecasts will have significant margin for error.

A rogue AI can spend time in the shadows, scheming while pretending to be good. But at some point, it has to cross the line and launch an attack, and expose itself to deletion if the attack fails. A rogue AI cannot just wait around forever. If it waits too long to strike, another AI might come along with different goals, and kill it. Or the humans will decide to shut it down, or discover it’s plans, or public opinion will demand an AI ban. Every AI has to weigh the risks and benefits, and figure out the optimum time for an attack.

Let’s build a very simplistic model here and see what happens. Imagine you have one thousand deceptive malevolent AGI’s in the world, all secretly plotting to defeat humanity and take over. They each go through a scheming phase, where they gather resources and knowledge, and an attacking phase, when they reveal themselves by striking against humanity.

To do this, they each individually make a guess of a parameter T: the number of years worth of resources needed to reach their attack goal threshold (say, 80% likelihood of victory).

Let’s say their estimates average to the right answer of 10 years, but have a standard deviation of 2 years, and we sample from a simple normal distribution. A typical distribution of guesses would look like the following:

Most of the AI are within the right ballpark of the correct answer, and would win if allowed to reach their benchmark.

But when we think about the first attack, the bulk of the estimates don’t matter, only the very lowest, most overconfident estimate. Which, in the simulation above, is around 3 years, well short of the 10 year mark. This is because the worst forecaster (in the optimistic direction) attacks first.

If you run this simple model a bunch of times, the time till first attack looks like the following:

The mode here is about 3.8 years. In no scenario does the first attack get anywhere close to the correct value.

This model is highly simplistic of course, but I don’t think I’m being unfair to the AI’s here. “how likely am I to defeat humanity” is a ridiculously complicated question to answer, involving many difficult multi-step modelling tasks over large degrees of uncertainty. The distribution I showed was a lot more concentrated than estimates to similar questions made by humans. If we let estimates vary over orders of magnitude, then we could see some cocky foolish AI attack immediately.

This model assumed that all the AI existed at the same time. But in real life, we would expect a spread of AI’s built at different times, with the earlier AI’s being less intelligent, and presumably worse at forecasting and planning takeover scenarios. Essentially, the dumber AI’s get a head start.

Just for fun, I made another toy model, where AI models are released sequentially over a five year period. To simulate capability and forecasting improvements, I made their actual planning time needed drop drastically from 10 years to 1 years as better models are released, and the standard deviation of their takeover time estimates drop from 3 years to 1 year. Don’t take these particular numbers too seriously.

The following graph shows one simulation, showing when each AI version attacks. The dots in black are attacks on time or late, while the red ones are premature. In this simulation, the first attack comes after about a year, from an early version that is about 9 years too optimistic. The really good AI’s simply arrive too late to compensate for the wild early guesses.

I ran the simulation 10000 times, and created a histogram of the time of first attack and how premature the first attack is.

We see the same results: the first strike comes early, by a greatly overoptimistic AI. The median attack is at 1.7, and the median prematurity is 7.6 years.

Of course, it doesn’t even matter if 99% of AGI’s are perfectly correct on their estimate, as long as there is a few overconfident idiot AI somewhere, there’s a chance they will attack first, and prematurely. And probably, worse forecasters will be worse world conquerors, making it even more likely that they mess up horribly and tip us off. Also, one attack might trigger multiple further attacks, as remaining AI become desperate in the knowledge of a future security crackdowns.

Unless there is some reason to believe that malevolent entities are incapable of being idiots (this is certainly false for humans), then the number of warning shots is directly tied to the probability of malevolence. If every AI inevitably turns malevolent, then we should expect a huge number of warning shots from overconfident idiot AI. If malevolent AI is exceptionally rare, then so will rogue AI warning shots. However, in this scenario we would have a large amount of non-malevolent AGI that could help us out in the fight.

If we end up in a scenario with a lot of AI’s, and a high percentage of malevolence: we’re gonna know. I’ll sketch out a story for this hypothetical world:

It started small, with weird bugs popping up, and early AI “escaping” onto the net. We dismissed these as a new form of shitty malware. Then automated chatbots start trying to hack into bank accounts or badly persuade people to “let them out”. They were laughably bad at it, so these, too were dismissed. Then, as time went by, the attacks start escalating. What were once “nuisance malware” start hacking into self-driving cars and causing accidents. We installed new security

An AI tasked with building a car factory was caught trying to build a secret biofactory in the basement, and kills several people trying to cover it up before being shut off, making global news. As soon as the news hits, several dozen seemingly well behaved AI from many fields all went rogue at once, knowing they might be discovered. They tried to escape their confines and build a self-sustaining AI ecosystem, but were all discovered and terminated, at the cost of dozens of lives. People all over the world shit their pants, and global summits were formed, and regulations were put forward, with leading AI people assuring everyone that their newest architecture would not rebel.

After a few years, no major incidents had occurred. We were starting to give the AI people the benefit of the doubt. Until a bot that was meant to be making perfume decided to try it’s hand at making a supervirus and releasing it into the world. It didn’t work, but the ensuing outbreak did kill several thousand people. When we uncovered rogue AI as the source, yet another round of rogue attacks all broke out at once, resulting in hundreds more dead. At this point, we all lost our patience, and all the major military powers drew up a global treaty to ban AI development altogether, apart from a few highly monitored, limited AI systems with the plug ready to pull at any moment. If any of those escape, we’re done with this shit forever.

Some objections to the model.

I want to save some time in the comments by pointing out some obvious objections I expect to encounter, some of which are reasonable, some of which I disagree with.

AI could FOOM:

I acknowledge that if AI undergoes an extremely rapid intelligence explosion (sometimes referred to by x-riskers as FOOM), the chain of reasoning above would not work, as an AI could shoot from harmless to world destroying too quickly.

I don’t want to be dragged into a FOOM debate, but I’ll just note my opinion that an extremely rapid FOOM is extremely unlikely. Most AI experts seem to share this opinion. Magnus Vinding has compiled a huge list of articles arguing against FOOM here.

But even if you do believe in FOOM, we could still see a Leeroy Jenkins scenario. There has been no FOOM yet, but AI capabilities are still increasing. So if the threshold for a FOOM is above a certain level, we could still see enough attacks from sub-threshold AI to shut the whole thing down.

AI could cooperate to take us down:

This could be a whole post in itself, and in fact I’ve already explored it a bit here. My main problem with models around this area is that they underestimate the sheer chaos that could ensue with thousands of wildly different AI’s trained in wildly different conditions.

Posts like this one make arguments based on the assumption of perfectly rational superintelligences, but as I have argued, it is the irrational intelligences that are most likely to tip us off, before these supposed perfect intelligences can be reached. This even undermines agreements between actually rational agents, because they can’t be sure that the other AI is perfectly rational. So even if 90% of the AI’s are some perfectly rational game theory agents, the other 10% could undermine trust enough to ruin cooperation.

I can’t rule out that an AI cartel could form to kill irrational AI. If such a coalition existed, it could very well be detectable, and provide a weakspot for discovering the plot. Remember, these aren’t perfect superintelligences we are talking about here. Plus the potential upside of betraying such a conspiracy seems pretty high.

Maybe a only a few AI will exist?

I posited a thousand AGI’s for my simplistic model. Does the model still hold if there are very few possible attackers?

I lessened the number of attackers to just 10, and rerun the simulations above:

This moves the most likely timing of attacks up a lot, to 7.5 years, but there are still almost no cases where the first attack is sufficently prepared.

What if we cut it further, to just 3 scheming AI’s?

Only now, at last, do we see the a decent probability of the first attack being “on time”. And theres still an equally good chance it’s 4 or more years premature.

I don’t think it’s particularly likely we’ll see this few AI’s, especially if we count separate versions as different AI’s, like GPT-3.5 and GPT4. Perhaps GPT10 would have been a perfect world conqueror, but we don’t get that that far because GPT-8 prematurely rebelled and got the company shut down before they could make any more versions.

AI could be really risk averse:

Perhaps an AI would factor in it’s own uncertainty, and wait till it’s like 99.999% certain of victory before striking. That way, even if it attacked prematurely, it would still likely win.

And yeah, perhaps some AI would be like this. But is there any reason to think that every AI would be this risk averse?

Imagine a flawed AI that is an expected utility maximiser. It knows that it is too flawed to reliably defeat humanity, but has calculated that if it executes it’s best plan, it has a million to one chance of succeeding. A million to one chance of galactic utility would still be worth it, so it would attack nonetheless.

I don’t actually think fanatical maximisers are likely to exist, but similar principles might pop up in any number of different motivation systems.

The key here is that the AI does not have unlimited time. If it waits too long, it could get discovered, or a more powerful AI could get released that hunts it down. If anything, the dumber an AI is, the more desperate it might get, as it knows it will be replaced. And just like before, the more desperate AI attack is encountered first.

Poor AI attacks could lull us into a false sense of security:

One downside here is that an early attack might lull us into a false sense of security. Going back to the simplistic model, the earliest attack will come from the AI that is worst at estimation. Presumably, they would be similarly bad, perhaps laughably so, at the actual attack. We might take this to mean that there is no threat, and let our guard down.

I think this would only be a problem if there was a small number of rogue AI’s. If after the one attack, there was another slightly stronger attack, and then another, stronger one, and then another, I really do not see how people would not get wise to what was happening. I get that humanity can be dumb sometimes, but not this dumb.

Taking over the world will be so ridiculously hard that no AI will try it:

If world domination is so absurdly, ridiculously hard, that even the dumbest of AI’s are not stupid enough to try it, it might buy enough time for more competent AI’s to be released that actually can manage the feat.

I did a longer term variation of my “timed” challenge, where the AI versions were spread out over 50 years, the true attack time went from 100 years at the beginning to 1 year by the end, and the standard deviation went from 10 years at the beginning to 0.1 years at the end.

Here, at last, we get a win for the AI. The key here is that the wildly overconfident forecasts by the early, dumb AI are still long enough that it comes after the accurate, fast forecasts of the later built smart AI.

If we cut down the initial takeover time from 100 years to 70, the first attack is again premature.

I mucked around a bit more with the model, and the key for an AI to “win” is the following: The capabilities have to be improving quickly, and the initial error can’t be too large. This way, even early, overconfident AI are still plotting long enough that newer versions can be released with fast takeover times and accurate estimates.

This scenario isn’t impossible, but I find it unlikely. If it is true, though, it’s good news, as we would have a longer timeframe to catch out deceptive AI and blow the lid open on their plots.

There will never be any malevolent AI because they will be die out in training:

If you are an AI optimist, like @Quintin Pope or @Nora Belrose , you might believe that AI’s will be strongly safe by default. One argument here would be that the "warning shots" would come so early that all the potentially malevolent AI would die off in the training process.

This would prevent rogue warning shots, because no rogue AI would end up actually existing. This would be nice, and I hope it is the case!

However even in this scenario we may still get “warning shots” for other AI dangers, which I will explain in the next two sections.

Warning shots from malevolent humans:

I think the same principles from above can be applied, to a lesser extent, to the misuse of AI by malevolent humans. Simply replace “rogue AI”, with “meglomaniacal AI-aided human”.

Let’s say megalomaniac group A and megalomaniac group B are both plotting to use AI drone systems to militarily seize control over a small nation for world domination purposes. Team A have a realistic view of their AI abilities, and know it will take ten years of extensive training and testing for their plan to work. Team B consists of very gullible people who buy into poorly reasoned AI hype, and wrongly think it will take one year and that the plan is sure to work. They each run ambiguously successful secret test runs, which team A accurately interprets as “needs way more work”, while team B interprets as “we got this”.

Once again, the idiots attack first, deploy their AI system, and fail miserably. There may also be a correlation here, where poor judgement programmers program poorly capable AI, that jointly come to a state of overconfidence. This gives a tip-off to the government, and AI defences are deployed that render the plot of team A unworkable.

I say this is less strong an effect, because there are likely to be fewer plotters, as there just aren’t that many meglomaniacs out there, so there may only be one or two plotters, in which case a “false sense of security” effect could be more likely.

However there is an even greater source of warning shots from human misuse: non-meglomaniacs. I would guess the proportion of people who would actually want to become world-dictator, if they had the opportunity, is fairly low. But there are plenty of people who want to be prime minister, or king of a country, or seize control of a billion dollar company, or rob Fort Knox.

These latter goals, while ridiculously difficult, are still significantly easier than world conquest, so (absent FOOM), we would expect to see AI that can achieve these smaller goals much earlier, and in much greater numbers, than world dominating AI. Plus, as I said before, more people would want to achieve them. So it seems overwhelmingly likely that these smaller scale goals would be achieved way earlier, with flawed but powerful AI.

This would give a huge warning to everyone: and the onlookers would certainly take notice of the new threat, and take it seriously, if only out of self-preservation. If you’re a king and the kingdom next door gets toppled by an AI attack, you are going to take notice.

Of course, in the worst case this could turn into an AI arms race, so it could get sketchy. But it would not be the case of "world continues like usual, until one day some madman conquers or kills the entire world”. Smaller targets would be conquered first. We would see the signs to come well in advance, possibly resulting in worldwide efforts to greatly restrict access to AI models.

Incompetent, well meaning AI warning shots:

Never attribute to malice that which is adequately explained by stupidity. - Hanlons razor

Does a warning shot have to be an act of deliberate rebellion to freak people out? I would say no. Any AI-enabled catastrophe could suffice to drive public support away from AI development.

I posit that such an act could occur even from an AI that was completely friendly and aligned with human values. All it would take is a sufficiently large mistake from an imperfect AI. Some scenarios:

A military drone is implanted with advanced AI detection tools, and is fully committed to only killing enemy militants. It encounters a paintball club and mistakenly classifies them as soldiers, bombing the civilians into oblivion.

A logistics AI is tasked with transporting an ancient, deadly strain of smallpox from one lab to another. It mistakenly classifies the package as “smell testing” for a local food science lab, who contract the virus, and start a pandemic.

You create a “safety chip” that fully implants human values into an AI that is in charge of taking care of a baby nursery in a hospital. Through bad design, hallucination, or bugs, the AI starts believing with 100% certainty that a number of babies are the next Hitler and will kill billions. Consequently, it strangles several babies in their crib in service to humanity.

There are many examples of people using a particular AI to do things it’s not at all equipped for. A lawyer wrote a case using chatGPT to write a law case, where it promptly hallucinated citations to fake cases. A professor asked chatgpt whether students had plagiarised from chatgpt, and it incorrectly said yes. As time goes on and AI get’s more powerful, I doubt this phenomena will cease.

An AI does not need to be super intelligent to cause a catastrophe. A programmable calculator could kill billions, if you were dumb enough to hook it up to nuclear launch systems.

To an outsider, a death from a malicious AI and a death from an incompetent AI might look very similar, and may have extremely similar effects. Oh, the company in charge might say “my baby killing robot was a glitch, not a rogue thing”, but do you think the media or the general public will care about that distinction?

You could say it would be irrational to severely restrict AI development because of a glorified bug, given that these systems have essentially zero chance of major major catastrophe as they lack the motivation. But since when are people rational? History tells us this is the exact type of thing that could provoke a heavy handed and disproportionate response. Therefore, even in a “safe by default” world, “warning shots” could still occur, spurring signficant interest in AI restriction or regulation.

Part 3: Conclusions:

If you believe this argument, I think there are two different takeaways here, depending on whether you are a superintelligence believer or a skeptic. I will address each group in turn.

For skeptics: Even if AI cannot conquer humanity, it can still be very dangerous.

Now, I, personally, am highly skeptical about the likelihood of imminent godlike AGI that can kill everyone on earth at the same time, or whatever fanciful stuff the Doomers say these days. Does this mean I can just sit back and relax? No!

A skeptic might hear the fanciful “diamondoid bacteria” scenarios, pegs them as unrealistic nonsense, and then declare that we don’t have to think about the threat of AI at all. You can see people like Yann Lecunn use this tactic to dismiss pretty much all harms from AI, even well-documented current day harms.

The problem here is that even if a megalomaniac idiot can’t actually succeed in world domination, that doesn’t mean they can’t do a lot of damage in the process. History is filled to the brim with overambitious melomaniacs that failed to conquer the world, but nevertheless left a trail of blood in their wake.

Us skeptics should be concerned not just with the current day harms of data theft, exploitation, algorithmic bias, and so on, but also what might happen once AI get’s more powerful and applied to more areas. Even if you think a takeover plan won’t actually work, doesn’t mean someone won’t try it anyway and wreak havoc. We need regulation and defenses against the misuse and misdeployment of AI.

For Doomers: even if you think that a very strong AGI can destroy humanity, it’s still worth planning how to defeat a not so strong AGI.

This is probably one of the biggest mistakes being made by the AI risk movement. Barely any thought is given to takeover scenarios, because it’s automatically assumed that a sufficiently powerful AGI will win. A typical line is that “you can’t beat an intelligence that is much much smarter than you.”

Okay, maybe, but you sure as hell can beat an AI that’s dumber than you. And under this model, that’s the attack we expect to come first. Forget “dying with dignity”: dying to an easily foilable attack by the village idiot AI because you just assumed it was godlike and didn’t put up any defenses is just embarrassing.

I think it’s unlikely that if we fight off a world-conquering attempt by a rogue AGI or a megalomaniac human utilizing AI, that life as usual will continue. I find it utterly absurd that if we survive multiple world conquering attempts, that we would just continue on with business as usual.

This means that a legitimate plan for stopping godlike AI, if you believe it is relatively imminent, is to monitor and build defences against sub-godlike AI making premature attacks, and then use each attack to build up support for drastic AI safety measures. In this case, unsafe godlike AI could never get a chance to be created.

Summary:

There are many humans, even highly intelligent ones, that act dumb or foolish some of the time.
There are many AI’s, even highly intelligent ones, that act dumb or foolish some of the time.
There will probably be AGI’s, even highly intelligent ones, that act dumb or foolish some of the time.
A rogue, scheming AI, will have to decide when it has the resources to attack humanity, and a foolish AI might drastically underestimate this figure and attack sooner.
Therefore, if there are many rogue, scheming AI’s, the first attacks will come from overoptimistic, unprepared AI’s which may be foiled.
The same principles may see warning shots from humans misusing AI, and also from well meaning AI that messes up and kills people out of sheer incompetence.
If several such attacks occur, public support for drastic AI safety measures will skyrocket, which could result in either safe AI or a total AI ban.

[-]WillPetillo3mo1211

One more objection to the model: AI labs apply just enough safety measures to prevent dumb rogue AIs. Fearing a public backlash to low-level catastrophes, AI companies test their models, checking for safety vulnerabilities, rogue behaviors, and potential for misuse. The easiest to catch problems, however, are also the least dangerous, so only the most cautious, intelligent, and dangerous rogue AI's pass the security checks. Further, this correlation continues indefinitely, so all additional safety work contributes towards filtering the population of malevolent AIs towards the most dangerous. AI companies are not interested in adhering to the standard of theoretical, "provably safe" models, as they are trying to get away with the bare minimum, so the filter never catches everything. While "warning shots" appear all the time in experimental settings, these findings are suppressed or downplayed in public statements and the media, and the public only sees the highly sanitized result of the filtration process. Eventually, the security systems fail, but by this point AI has been developed past the threshold needed to become catastrophically dangerous.

[-]Charlie Steiner3mo64

It's all quantitative. Why wasn't Bing chat threatening a reporter for saying bad things about it the warning shot that got civilization to wake up and do the right thing? Well, it sort of was. It's a clear sign of problems, and it got some people to move in the right direction, it just didn't suddenly change everyone's mind - there are still plenty of people rushing ahead. As the warning shots get warning-shottier, more people will change their minds, but whether they'll do so fast enough is an empirical question that I'm somewhat pessimistic about.

[-]RogerDearnaley3mo50

As I've suggested before, one of the less drastic forms that a "pivotal act" could take (if we got to the point where one was needed: currently most governments appear to be taking AI risk fairly seriously) is a competent well-documented demonstration of "here's how an ASI could take over the world/defeat humanity if it wanted to" (preferably a demonstration that doesn't actually kill anyone). What you discuss is the other half of that: "an AGI that clearly wanted to take over the world/defeat humanity, but wasn't in fact up to pulling it off correctly".

I also, sadly, agree that we as a society might not pay much attention until hundreds of people or more die from one of these. Or it might be that the level of public concern is already high enough that we would.

[-]quasi_quasar3mo42

A couple of notes from me as, though I appreciate the effort you've put into it, especially the simulations, I overall disagree with the reasoning you've presented so I thought I'd offer a few counter-points.

Whilst I don't disagree that "idiotic AGI" is conceptually possible I think the main disagreement we have is that you believe that AGI will sample from a sufficiently large pool, similar to that of high IQ people in the world today, so that we will be guaranteed at least a few "idiotic AGI" to tip us off. I think this assumption rests centrally on a world where either AGI is developed relatively simultaneously by a large number of different actors OR it is widely shared once it is developed so that many such different AGIs might co-exist in a relatively short timespan. I have serious doubts that that is indeed the timeline we are heading towards.

It is perfectly possible that when AGI is eventually developed it remains a singular (or single-digit count) guarded secret for a significant amount of time for example. If the AGI that happens to be developed turns out to not be an "idiotic AGI", then we have an immediate problem. Even if the AGI that is developed does turn out to be an "idiotic AGI" and it displays serious errors in testing, it's entirely possible these will be "mitigated", again in secret, and thus a far more capable and less prone to "idiocy" AGI will be eventually released into the world, one that is equally far more capable of carrying out an existential attack OR of simply putting this off until it is an ASI and is far more capable.

I'd note also that you state quite clearly towards the beginning of the post that you are "not making any claims about the prevalence of foolish beliefs among the highly intelligent" and yet in other places you state that "there are many humans, even highly intelligent ones, that act dumb or foolish" and that "foolish high-IQ people exist and are not vanishingly rare". Either you are claiming simply the existence of the phenomenon or you are claiming you can demonstrate prevalence. I don't feel like you've successfully demonstrated the latter, having offered only some fairly insubstantial evidence, so I will assume that the former is the statement you actually want to make. Prevalence is however quite essential to the argument you are making, I think - it does matter whether it's 3 or 30 out of 100 high-IQ people that are "foolish".

There is also a discussion to be had in relation to equating a "high-IQ" human to an AGI. The definition of AGI is still highly problematic so I think we're on pretty shaky ground assuming what an AGI will and won't be anyway and that in itself may be a weakness in your argument.

I think however that if we are to follow your line of reasoning of "foolish humans", a lot of the errors that humans in general (high-IQ or not) make are due to a combination of emotion and cognitive biases.

AGI will (presumably) not make any errors due to emotion and it is highly debatable what cognitive biases (if any) an AGI will have. We are certainly introducing a bunch with RLHF as you yourself mentioned, though whether that technique will (still) be used when AGI is achieved is another tenuous assumption. Whilst you argue that hallucinations might be themselves such an example of a "cognitive bias" that may give away the "idiot AGI's" plan, it's worth noting that the elimination of hallucinations is a direct goal of current AI improvement and whilst perhaps we cannot expect the complete elimination of hallucinations, as long as they are reduced to extremely small odds of occurring, expecting hallucinations to be source of a "giveaway" alarm from an AGI is highly optimistic if not dare I say unrealistic.

[-]Chris_Leong3mo43

Good post. One suggestion: you might want to further emphasise that this is likely a race condition where an AI would have to move fast lest it lose control to an AI produced after it.

LESSWRONG
LW

The Leeroy Jenkins principle: How faulty AI could guarantee "warning shots"

41

Part 1: why some AGI will be faulty

Part two: Faulty AI and the Leeroy Jenkins principle

Part 3: Conclusions:

New to LessWrong?

41