Some past examples to motivate thought on how AI's could misbehave:

An algorithm pauses the game to never lose at Tetris.

In "Learning to Drive a Bicycle using Reinforcement Learning and Shaping", Randlov and Alstrom, describes a system that learns to ride a simulated bicycle to a particular location. To speed up learning, they provided positive rewards whenever the agent made progress towards the goal. The agent learned to ride in tiny circles near the start state because no penalty was incurred from riding away from the goal.

A similar problem occurred with a soccer-playing robot being trained by David Andre and Astro Teller (personal communication to Stuart Russell). Because possession in soccer is important, they provided a reward for touching the ball. The agent learned a policy whereby it remained next to the ball and “vibrated,” touching the ball as frequently as possible. 

Algorithms claiming credit in Eurisko: Sometimes a "mutant" heuristic appears that does little more than continually cause itself to be triggered, creating within the program an infinite loop. During one run, Lenat noticed that the number in the Worth slot of one newly discovered heuristic kept rising, indicating that had made a particularly valuable find. As it turned out the heuristic performed no useful function. It simply examined the pool of new concepts, located those with the highest Worth values, and inserted its name in their My Creator slots.

There was something else going on, though. The AI was crafting super weapons that the designers had never intended. Players would be pulled into fights against ships armed with ridiculous weapons that would cut them to pieces. "It appears that the unusual weapons attacks were caused by some form of networking issue which allowed the NPC AI to merge weapon stats and abilities," according to a post written by Frontier community manager Zac Antonaci. "Meaning that all new and never before seen (sometimes devastating) weapons were created, such as a rail gun with the fire rate of a pulse laser. These appear to have been compounded by the additional stats and abilities of the engineers weaponry."

Programs classifying gender based on photos of irises may have been artificially effective due to mascara in the photos.

New Comment
41 comments, sorted by Click to highlight new comments since:

We had some similar situations.

  • A user (wrongly) attributed a school music room (studio) to a teacher (Marko). The scheduling program left the studio empty, every time Marko had classes in a school 10 kilometers away. Which was quite difficult to notice.

  • elsewhere, the same program was instructed to minimize student travelings between two buildings 1 kilometer apart. Each buildings had several rooms and the distances between each of them were given by the user, who forgot to count the cafeteria in one of the buildings into the list. So the program concluded that the distance from the cafeteria to anywhere is 0 and managed to build a schedule with no student transfers at all. They have been always "teleported" through the cafeteria into another building. This one was spotted soon enough.

Karl Sims evolved simple blocky creatures to walk and swim (video). In the paper, he writes "For land environments, it can be necessary to prevent creatures from generating high velocities by simply falling over" - ISTR the story is that in the first version of the software, the winning creatures were those that grew very tall and simply fell over towards the target.


Karl 'Sims'? Really?

Yes: "The Power of Simulation: What Virtual Creatures Can Teach Us", Katherine Hayles 1999:

The designer's intentions, implicit in the fitness criteria he specifies and the values he assigns to these criteria, become explicit when he intervenes to encourage "interesting" evolutions and prohibit "inelegant" ones ("3-D Morphology", pp. 31, 29). For example, in some runs creatures evolved who achieved locomotion by exploiting a bug in the way conservation of momentum was defined in the world's artifactual physics: they developed appendages like paddles and moved by hitting themselves with their own paddles. "It is important that the physical simulation be reasonably accurate when optimizing for creatures that can move within it," Sims writes. "Any bugs that allow energy leaks from non-conservation, or even round-off errors, will inevitably be discovered and exploited by the evolving creatures," ("Evolving Virtual Creatures," p. 18). In the competitions, other creatures evolved to exceptionally tall statures and controlled the cube by simply falling over on it before their opponents could reach it ("3-D Morphology," p. 29.) To compensate, Sims used a formula that took into account the creature's height when determining its starting point in the competition; the taller the creature, the further back it had to start. Such adjustments clearly show that the meaning of the simulation emerges from a dynamic interaction between the creator, the virtual world (and the real world on which its physics is modeled), the creatures, the computer running the programs, and in the case of visualizations, the viewer watching the creatures cavort. In much the same way that the recursive loops between program modules allow a creature's morphology and brain to co-evolve together, so recursive loops between these different components allow the designer's intent, the creatures, the virtual world, and the visualizations to co-evolve together into a narrative that viewers find humanly meaningful...compared to artificial intelligence, artificial life simulations typically front-load less intelligence in the creatures and build more intelligence into the dynamic process of co-adapting to well-defined environmental constraints. When the environment fails to provide the appropriate constraints to stimulate development, the creator steps in, using his human intelligence to supply additional adaptive constraints, for example when Sims put a limit on how tall the creatures can get.


People do this as well. They wanted to eliminate corruption from public construction projects in a certain country, and created a numbers-based evaluation systems of tenders. The differences in price offered were taken into account with a weight of 1 and the differences in penalties / liquidated damage with a weight of 6. I am not sure what is the best English term for the later, but basically it was the construction company saying if the project is late I am willing to pay X amount of penalty per day. Usually most companies offer something like 0,1% of the price. One company offered 2% which means if they are like 10-15 days late their whole profits are gone, and as this was to be taken into account with a weight of 6, they could offer an outrageous price and the rules still forced the government to accept their offer. It turned out, it was not just a bold gaming of the rules, it was corruption as well: there was no such law that such a penalty offered must also be really enforced in case of late delivery, the government's man can decide to demand less penalty if he feels the vendor is not entirely at fault. So most likely they simply planned to bribe that guy in case if they are late. Thus the new rules simply moved the bribery into a different stage of the process.

When humans are motivated by entirely external incentives like fsck everything let's make as much money on this project as possible, they behave just like the vibrating AI-Messi.

Which means - maybe we need to figure out what the heck is an inner motivation in humans that makes them want to the sensible and how to emulate it.


Another, famous, example. At one time, somewhere in India there were a lot of cobras, which are dangerous. So the government (it happened to be the British Raj at the time) decided to offer a bounty for dead cobras. That worked for a while, until people figured out that they could breed cobras for the bounty. Then the government worked out what was going on, and cancelled the bounty. So then the cobra breeders released all their now-valueless cobras into the wild.

(According to Wikipedia this particular instance isn't actually well documented, but a similar one involving rats in Hanoi is.)


This effect also exists in software development:

I see this failure in analysis all the time.

When people want to change the behavior of others, they find some policy and incentive that would encourage the change they desire, but never stop to ask how else people might react to that change in incentives.

Anyone ever come across any catchy name or formulation for this particular failure mode?

Perverse incentives.


Cobra effect (see the Wikipedia page I linked before). Law of unintended consequences. (Perhaps the former is a little too narrow and the latter a little too broad.)

Isn't this an example of a reflection problem? We induce this change in a system, in this case an evaluation metric, and now we must predict not only the next iteration but the stable equilibria of this system.

Goodhart's Law

Oops, double post; V_V already said that.

[This comment is no longer endorsed by its author]Reply

I believe this is called a "red queen race"

This is not correct, at least in common usage.

A Red Queen's Race is an evolutionary competition in which absolute position does not change. The classic example is the arms race between foxes and rabbits that results in both becoming faster in absolute terms, but the rate of predation stays fixed. (The origin is Lewis Carrol: "It takes all the running you can do, just to stay in the same place.")

A Red Queen's Race is an evolutionary competition in which absolute position does not change.

You mean relative, not absolute.

I've also seen a more general interpretation: the Red Queen situation is where staying still (doing nothing) makes you worse off as time passes; you need to run forward just to stay in the same place.

You mean relative, not absolute.

Yes, yes I did. Thanks for the correction.


I think this is analogous to what's happening here - you create better incentives, they create better ways to get around those incentives, nothing changes. I didn't know that this wasn't the common usage, as I got it from this Overcoming Bias post:


People do this as well.

This is known as Goodhart's law or Campbell's law.

There's an old story about a neural network that was trained to recognize pictures of tanks for the military. It worked almost perfectly. When someone else tried the program, they found it didn't work at all.

It turned out that they had taken all the pictures of tanks on cloudy days, and all the other pictures on different days. The neural network simply learned to recognize the weather.

This actually sort of happens in modern deep neural networks. It thinks that roads are parts of cars, or that people's lips are part of bubbles, or that arms are parts of dumbbells.

There's an old story about a neural network that was trained to recognize pictures of tanks for the military.

That seems to be the go to story for NNs. I remember hearing it back in grad school. Though now I'm wondering if it is just an urban legend.

Some cursory googling shows others wondering the same thing.

Any have an actual cite for this? Or if not an actual cite, at least you had heard a concrete cite for it once?

I believe that Marvin Minsky is the origin of the story. He tells it in the second half of the 3 minute video Embarrassing mistakes in perceptron research. This version of the story is not so simple as sunny/cloudy, but that the two rolls of film were developed differently, leaving to a uniform change in brightness.

The first half of the video is a similar story about distinguishing musical scores using the last note in the lower right corner. Somewhere,* quite recently, I heard people argue that this story was more credible because Minsky makes a definite claim of direct involvement and maybe the details are a little more concrete.

[How did I find this? The key was knowing the attribution to Minsky. It comes up immediately searching for "Minsky pictures of tanks." But "Minsky tank neural net" is a poor search because of the contributions of David Tank. Note that the title of the video contains the word "perceptron," suggesting that the story is from the 60s.]

* Now that I read Gwern's comment, it must have been his discussion on Reddit a month ago.

Added, a week later: I'm nervous that Minsky really is the origin. When I first wrote this, I thought I had seen him tell the story in two places.

Previous discussion:

I would say that given the Minsky story and how common a problem overfitting is, I believe something at least very similar to the tank story did happen, and if it didn't, then there nevertheless real problems with neural nets overfitting.

(That said, I think modern deep nets may get too much of a bad rap on this issue. Yes, they might do weird things like focusing on textures or whatever is going on in the adversarial examples, but they still recognize very well out-of-sample, and so they are not simply overfitting to the test set like in these old anecdotes. Their problems are different.)

This isn't an example of overfitting, but of the training set not being iid. You wanted a random sample of pictures of tanks, but you instead got a highly biased sample that is drawn from a different distribution than the test set.

"This isn't an example of overfitting, but of the training set not being iid."

Upvote for the first half of that sentence, but I'm not sure how the second applies. The set of tanks is iid, the issue that the creators of the training set allowed tank/not tank to be correlated to an extraneous variable. It's like having a drug trial where the placebos are one color and the real drug is another.

I guess I meant it's not iid from the distribution you really wanted to sample. The hypothetical training set of all possible pictures of tanks, but you just sampled the ones that were during daytime.

I'm not sure you understand what "iid" means. I t means that each is drawn from the same distribution, and each sample is independent of the others. The term "iid" isn't doing any work in your statement; you could just same "It's not from the distribution you really want to sample", and it would be just as informative.


Eurisko had another example where Lenat would leave it running at night, and would find that it shut itself off come morning - it kept creating a heurstic that doing nothing was better than doing something wrong, and Lenat had to hardcode some code that hte program couldn't access, which prevented this.


These are all task specific problem definition issues that occurred while fine tuning algorithms (but yes they do show how things could get out of hand)

Humans already do this very well, for example tax loopholes that are exploited but are not in the 'spirit of the law'.

The ideal (but incredibly difficult) solution would be for AI's to have multiple layers of abstraction, where each decision gets passed up and is then evaluated as "is this really what they wanted", or "am I just gaming the system".

What happens if an AI manages to game the system despite the n layers of abstraction?


This is the fundamental problem that is being researched - the top layer of abstraction would be that difficult to define one called "Be Friendly".

Instead of friendly AI maybe we should look at "dont be an asshole" AI (DBAAAI) - this may be simpler to test and monitor.

Let me clarify why I asked. I think the "multiple layers of abstraction" idea is essentially "build in a lot of 'manual' checks that the AI isn't misbehaving", and I don't think that is a desirable or even possible solution. You can write n layer of checks, but how do you know that you don't need n+1?

The idea being--as has been pointed out here on LW--that what you really want and need is a mathematical model of morality, which the AI will implement and which moral behaviour will fall out of without you having to specify it explicitly. This is what MIRI are working on with CEV & co.

Whether or not CEV or whatever emerges as the best model to use are gameable is itself a mathematical question,[1] central to the FAI problem.

[1] There are also implementation details to consider, e.g. "can I mess with the substrate" or "can I trust my substrate".

There was something else going on, though. The AI was crafting super weapons that the designers had never intended. Players would be pulled into fights against ships armed with ridiculous weapons that would cut them to pieces.

Checking into this one, I don't think it's a real example of learning going wrong, just a networking bug involving a bunch of low-level stuff. It would be fairly unusual for a game like Elite Dangerous to have game AI using any RL techniques (the point is for it to be fun, not hard to beat, and they can easily cheat), and the forum post & news coverage never say it learned to exploit the networking bug. Some of the comments in that thread describe it as random and somewhat rare, which is not consistent with it learning a game-breaking technique. Eventually I found a link to a post by an ED programmer Mark Allen who explains what went wrong with his code:

...Prior to 1.6/2.1 the cached pointer each weapon held to its data was a simple affair pointing at a bit of data loaded from resources, but as part of the changes to make items modifiable I had to change this so it could also be a pointer to a block of data constructed from a base item plus a set of modifiers - ideally without the code reading that data caring (or even knowing) where it actually came from and therefore not needing to be rewritten to cope. This all works great in theory, and then in practice, up until a few naughty NPC's got into the mix and decided to make a mess. I'll gloss over a few details here, but the important information is that a specific sequence of events relating to how NPCs transfer authority from one players' machine to another, combined with some performance optimisations and an otherwise minor misunderstanding on my part of one of the slightly obscure networking functions got the weapon into an odd state. The NPC's weapon which should have been a railgun and had all the correct data for a railgun, but the cached pointer to its weapon data was pointing somewhere else. Dangling pointers aren't all that uncommon (and other programmers may know the pains they can cause!) but in this case the slightly surprising thing was that it would always be a pointer to a valid WeaponData...It then tells the game to fire 12 shots but now we're outside the areas that use the cached data, the weapon manager knows its a railgun and dutifully fires 12 railgun shots :) . Depending on which machine this occurred on exactly it would either be as a visual artefact only that does no damage, or (more rarely but entirely possible) the weapon would actually fire 12 shots and carve a burning trail of death through the space in front of it. The hilarious part (for people not being aimed at) is that the bug can potentially cause hybrids of almost any two weapons... In my testing I've seen cases of railguns firing like slugshots, cannons firing as fast as multicannons, or my favourite absurd case of a Huge Plasma Accelerator firing every frame because it thought it was a beam laser... Ouch.

(I would also consider the mascara example to not be an example of misbehaving but dataset bias. The rest check out.)


Algorithms claiming credit in Eurisko:

this link doesn't work. it looks like it was moved to this URL on the same site (

Thanks, modified!

As I am interpreting this, the whole idea about the rewarding system(s) goes down the drain. The agent ,as humans always do, will find a way to cheat, because in (almost) every problem there is/are a loophole(s), which for sure can't be detected 100% upfront. As i see We can't use the same tools as evolution (the carrot and the stick) and expect to get something different then a creature as Us, with capacity In order of magnitudes bigger, of course.

Best Regards

One doesn't need to close 100% of the loopholes, only make it so exploiting them is harder than doing the work legitimately.

The link for the AI crafting a super weapon seems to be broken. Here is a later article that is the best I could find:

Thanks! Link changed.

"Yet, it's usually left out of these scenarios."

It isn't left out of these scenarios; it's critical to most versions of these scenarios: