I've been reading a fair bit about "worse than death" scenarios from AGI (e.g. posts like this), and the intensities and probabilities of them. I've generally been under the impression that the worst-case scenarios have extremely low probabilities (i.e. would require some form of negative miracle to occur) and can be considered a form of Pascal's mugging.

Recently, however, I came across this post on OpenAI's blog. The blog post notes the following:

Bugs can optimize for bad behavior
One of our code refactors introduced a bug which flipped the sign of the reward. Flipping the reward would usually produce incoherent text, but the same bug also flipped the sign of the KL penalty. The result was a model which optimized for negative sentiment while preserving natural language. Since our instructions told humans to give very low ratings to continuations with sexually explicit text, the model quickly learned to output only content of this form. This bug was remarkable since the result was not gibberish but maximally bad output.

This seems to be the exact type of issue that could cause a hyperexistential catastrophe. With this in mind, can we really consider the probability of this sort of scenario to be very small (as was previously believed)? Do we have a reason to believe that this is still highly unlikely to happen with an AGI? If not, would that suggest that current alignment work is net-negative in expectation?

New to LessWrong?

New Answer
New Comment

2 Answers sorted by


Jun 19, 2020


The only way I can see this happening with non-negligible probability is if we create AGI along more human lines - e.g, uploaded brains which evolve through a harsh selection process that wouldn't be aligned with human values. In that scenario, it may be near certain. Nothing is closer to a mind design capable of torturing humans than another human mind - we do that all the time today.

As others point out, though, the idea of a sign being flipped in an explicit utility function is one that people understand and are already looking for. More than that, it would only produce minimal human-utility if the AI had a correct description of human utility. Otherwise, it would just use us for fuel and building material. The optimization part also has to work well enough. Everything about the AGI, loosely speaking, has to be near-perfect except for that one bit. This naively suggests a probability near zero. I can't imagine a counter-scenario clearly enough to make me change this estimate, if you don't count the previous paragraph.

Everything about the AGI, loosely speaking, has to be near-perfect except for that one bit.

Isn’t this exactly what happened with the GPT-2 bug, which led to maximally ‘bad’ output? Would that not suggest that the probability of this occurring with an AGI is non-negligible?

No. First, people thinking of creating an AGI from scratch (i.e., one comparable to the sort of AI you're imagining) have already warned against this exact issue and talked about measures to prevent a simple change of one bit from having any effect. (It's the problem you don't spot that'll kill you.) Second, GPT-2 is not near-perfect. It does pretty well at a job it was never intended to do, but if we ignore that context it seems pretty flawed. Naturally, its output was nowhere near maximally bad. The program did indeed have a silly flaw, but I assume that's because it's more of a silly experiment than a model for AGI. Indeed, if I try to imagine making GPT-N dangerous, I come up with the idea of an artificial programmer that uses vaguely similar principles to auto-complete programs and could thus self-improve. Reversing the sign of its reward function would then make it produce garbage code or non-code, rendering it mostly harmless. Again, it's the subtle flaw you don't spot in GPT-N that could produce an AI capable of killing you.

Rohin Shah

Jun 18, 2020


If that sort of thing happens, you would turn off the AI system (as OpenAI did in fact do). The AI system is not going to learn so fast that it prevents you from doing so.

This has lowered my credence in such a catastrophe by about an order of magnitude. However, that's a fairly small update for something like this. I'm still worried.

Maybe some important AI will learn faster than we expect. Maybe the humans in charge will be grossly negligent. Maybe the architecture and training process won't be such as to involve a period of dumb-misaligned-AI prior to smart-misaligned-AI. Maybe some unlucky coincidence will happen that prevents the humans from noticing or correcting the problem.

4Rohin Shah4y
Where did your credence start out at? If we're talking about a blank-slate AI system that doesn't yet know anything, that then is trained on the negative of the objective we meant, I give it under one in a million that the AI system kills us all before we notice something wrong. (I mean, in all likelihood this would just result in the AI system failing to learn at all, as has happened the many times I've done this myself.) The reason I don't go lower is something like "sufficiently small probabilities are super weird and I should be careful with them". Now if you're instead talking about some AI system that already knows a ton about the world and is very capable and now you "slot in" a programmatic version of the goal and the AI system interprets it literally, then this sort of bug seems possible. But I seriously doubt we're in that world. And in any case, in that world you should just be worried about us not being able to specify the goal, with this as a special case of that circumstance.
3Daniel Kokotajlo4y
Unfortunately I didn't have a specific credence beforehand. I felt like the shift was about an order of magnitude, but I didn't peg the absolute numbers. Thinking back, I probably would have said something like 1/3000 give or take 1 order of magnitude. The argument you make pushes me down by an order of magnitude. I think even a 1 in a million chance is probably way too high for something as bad as this. Partly for acausal trade reasons, though I'm a bit fuzzy on that. It's high enough to motivate much more attention than is currently being paid to the issue (though I don't think it means we should abandon normal alignment research! Normal alignment research probably is still more important, I think. But I'm not sure.) Mainly I think that the solution to this problem is very cheap to implement, and thus we do lots of good in expectation by raising more awareness of this problem.
4Rohin Shah4y
I don't think you should act on probabilities of 1 in a million when the reason for the probability is "I am uncomfortable using smaller probabilities than that in general"; that seems like a Pascal's mugging. Huh? What's this cheap solution?
8Daniel Kokotajlo4y
I agree. However, in my case at least the 1/million probability is not for that reason, but for much more concrete reasons, e.g. "It's already happened at least once, at a major AI company, for an important AI system, yes in the future people will be paying more attention probably but that only changes the probability by an order of magnitude or so." Isn't the cheap solution just... being more cautious about our programming, to catch these bugs before the code starts running? And being more concerned about these signflip errors in general? It's not like we need to solve Alignment Problem 2.0 to figure out how to prevent signflip. It's a just an ordinary bug. Like, what happened already with OpenAI could totally have been prevented with an extra hour or so of eyeballs poring over the code, right? (or more accurately, whoever wrote the code in the first place being on the lookout for this kind of error?)
4Rohin Shah4y
Tbc, I think it will happen again; I just don't think it will have a large impact on the world. If you're writing the AGI code, sure. But in practice it won't be you, so you'd have to convince other people to do this. If you tried to do that, I think the primary impact would be "ML researchers are more likely to think AI risk concerns are crazy" which would more than cancel out the potential benefit, even if I believed the risk was 1 in 30,000.
5Daniel Kokotajlo4y
Because you think it'll be caught in time, etc. Yes. I think it will probably be caught in time too. OK, so yeah, the solution isn't quite as cheap as simply "Shout this problem at AI researchers." It's gotta be more subtle and respectable than that. Still, I think this is a vastly easier problem to solve than the normal AI alignment problem.
I think it's also a case of us (or at least me) not yet being convinced that the probability is <= 10^-6. Especially with something as uncertain as this. My credence in such a scenario happening has, too, decreased a fair bit with this thread but I remain unconvinced overall. And even then, 1 in a million isn't *that* unlikely - it's massive compared to the likelihood that a mugger is actually a God. I'm not entirely sure how low it would have to be for me to dismiss it as "Pascalian", but 1 in a million still feels far too high.
2Rohin Shah4y
If a mugger actually came up to me and said "I am God and will torture 3^^^3 people unless you pay me $5", if you then forced me to put a probability on it, I would in fact say something like 1 in a million. I still wouldn't pay the mugger. Like, can I actually make a million statements of the same type as that one, and be correct about all but one of them? It's hard to get that kind of accuracy. (Here I'm trying to be calibrated with my probabilities, as opposed to saying the thing that would reflect my decision process under expected utility maximization.)
The mugger scenario triggers strong game theoretical intuitions (eg "it's bad to be the sort of agent that other agents can benefit from making threats against") and the corresponding evolved decision-making processes. Therefore, when reasoning about scenarios that do not involve game theoretical dynamics (as is the case here), it may be better to use other analogies. (For the same reason, "Pascal's mugging" is IMO a bad name for that concept, and "finite Pascal's wager" would have been better.)
2Rohin Shah4y
I'd do the same thing for the version about religion (infinite utility from heaven / infinite disutility from hell), where I'm not being exploited, I simply have different beliefs from the person making the argument. (Note also that the non-exploitability argument isn't sufficient.)
I think a probability of ~1/30,000 is still way too high for something as bad as this (with near-infinite negative utility). I sincerely hope that it’s much lower.
All of these worry me as well. It simply doesn't console me enough to think that we "will probably notice it".

Surely with a sufficiently hard take-off it would be possible for the AI to prevent its turning off? If not, couldn’t the AI just deceive its creators into thinking that no signflip has occurred (e.g. making it look like it’s gaining utility from doing something beneficial to human values when it’s actually losing it). How would we be able to determine that it’s happened before it’s too late?

Further to that, what if this fuck-up happens during an arms race when its creators haven’t put enough time into safety to prevent this type of thing from happening?

In this specific example, the error becomes clear very early on in the training process. The standard control problem issues with advanced AI systems don't apply in that situation. As for the arms race example, building an AI system of that sophistication to fight in your conflict is like building a Dyson Sphere to power your refrigerator. Friendly AI isn't the sort of thing major factions are going to want to fight with each other over. If there's an arm's race, either something delightfully improbable and horrible has happened, or it's an extremely lopsided "race" between a Friendly AI faction and a bunch of terrorist groups. EDIT (From two months in the future...): I am not implying that such a race would be an automatic win, or even a likely win, for said hypothesized Friendly AI faction. For various reasons, this is most certainly not the case. I'm merely saying that the Friendly AI faction will have vastly more resources than all of its competitors combined, and all of its competitors will be enemies of the world at large, etc. Addressing this whole situation would require actual nuance. This two month old throw away comment is not the place to put that nuance. And besides, it's been done before.
Can we be sure that we'd pick it up during the training process, though? And would it be possible for it to happen after the training process?

Sorry for the dumb question a month after the post, but I've just found out about deceptive alignment. Do you think it's plausible that a signflipped AGI could fake being an FAI in the training stage, just to take a treacherous turn at deployment?

6Rohin Shah4y
Not really, because it takes time to train the cognitive skills necessary for deception. You might expect this if your AGI was built with a "capabilities module" and a "goal module" and the capabilities were already present before putting in the goal, but it doesn't seem like AGI is likely to be built this way.
Would that not be the case with *any* form of deceptive alignment, though? Surely it (deceptive alignment) wouldn't pose a risk at all if that were the case? Sorry in advance for my stupidity.

That's a bold assumption to make...

3 comments, sorted by Click to highlight new comments since: Today at 1:26 PM

I think AI systems should be designed in such a way to avoid being susceptible to sign flips (as Eliezer argues in that post you linked), but also suspect this is likely to happen naturally in the course of developing the systems. While a sign flip may occur in some local area, you'd have to have just no checksums on the process for the result of a sign-flipped reward function to end up in control.

What do you think the difference would be between an AGI's reward function, and that of GPT-2 during the error it experienced?

One is the difference between training time and deployment, as others have mentioned. But the other is that I'm skeptical that there will be a singleton AI that was just trained via reinforcement learning.

Like, we're going to train a single neural network end-to-end on running the world? And just hand over the economy to it? I don't think that's how it's going to go. There will be interlocking more-and-more powerful systems. See: Arguments about fast takeoff.