logic puzzles and loophole abuse

Can you give me some examples of those exercises and loopholes you have seen?

Teaching an AI not to cheat?

A fair point. How about changing the reward then: don't just avoid cheating, but be sure to tell us about any way to cheat that you discover. That way, we get the benefits without the risks.

Teaching an AI not to cheat?

My definition of cheating for these purposes is essentially "don't do what we don't want you to do, even if we never bothered to tell you so and expected you to notice it on your own". This skill would translate well to real-world domains.

Of course, if the games you are using to teach what cheating is are too simple, then you don't want to use those kinds of games. If neither board games nor simple game theory games are complex enough, then obviously you need to come up with a more complicated kind of game. It seems to me that finding a difficult game to play that teaches you about human expectations and cheating is significantly easier than defining "what is cheating" manually.

One simple example that could be used to teach an AI: let it play an empire-building videogame, and ask it to "reduce unemployment". Does it end up murdering everyone who is unemployed? That would be cheating. This particular example even translates really well to reality, for obvious reasons.

By the way, why would you not want the AI to be left in "a nebulous fog". The more uncertain the AI is about what is and is not cheating, the more cautious it will be.

Teaching an AI not to cheat?

Yes. I am suggesting to teach AI to identify cheating as a comparatively simple way of making an AI friendly. For what other reason did you think I suggested it?

Teaching an AI not to cheat?

I am referring to games in the sense of game theory, not actual board games. Chess was just an example. I don't know what you mean by the question about shortcuts.

Teaching an AI not to cheat?

It needs to learn that from experience, just like humans do. Something that also helps at least for simpler games is to basically provide the manual of the game in a written language.

Open thread, Oct. 03 - Oct. 09, 2016

Is there an effective way for a layman to get serious feedback on scientific theories?

I have a weird theory about physics. I know that my theory will most likely be wrong, but I expect that some of its ideas could be useful and it will be an interesting learning experience even in the worst case. Due to the prevalence of crackpots on the internet, nobody will spare it a glance on physics forums because it is assumed out of hand that I am one of the crazy people (to be fair, the theory does sound pretty unusual).

Harry Potter and the Methods of Rationality discussion thread, February 2015, chapter 113

This solution does not prevent Harry's immediate death, but seems much better than that to me anyway. I haven't been following conversations before, so I can only hope that this is at least somewhat original.


-Lord Voldemort desires true immortality. Alternatively, there is a non-zero chance that he will come to desire true immortality after a long time of being alive. While he is a sociopath and enjoys killing, achieving immortality is more important to him.

-Lord Voldemort does not dismiss things like the Simulation Hypothesis out of hand. Since he is otherwise shown to be very smart and to second-guess accepted norms, this seems like a safe assumption.


-All of the following has non-zero probability. Since it talks about immortality, an absolute, this is sufficient and a high probability is not needed, just a non-zero one.

-The existence of magic implies the existence of a sapient higher power. Not God, but simply a higher power of some kind, the being who created magic.

-Given that Voldemort wants to live forever, it is quite possible that he will encounter this higher power at some point in the future.

-The higher power will be superior to Voldemort in every way since he is the being who created magic, so once he encounters it, he will be at its mercy.

-Since he desires immortality, it would be in his interests to make the higher power like him.

-Further assumption: If there is one higher power, it is likely that there is a nigh-infinite recursion of successively more powerful beings above that. Proof by induction: it is likely that Voldemort will at some point of his infinite life decide to create a pocket universe of his own, possibly just out of boredom. If the probability of this happening is x, then the number of levels of more powerful beings above Voldemort can be estimated with an exponential distribution with lambda=1/x. Actually the number may be much higher due to the possibility of someone creating not one but several simulations, so this is pretty much a lower bound.

-In such a (nigh) infinite regression of Powers, there is a game theoretical strategy that is the optimal strategy for any one of these powers to use when dealing with its creations and/or superiors, given that none of them can be certain that they are the topmost part of the chain.

-How exactly such a rule could be defined is too complicated to figure out in detail, but it seems pretty clear to me that it would be based on reciprocity on some level: behave towards your inferiors in the same way that you would want your own superiors to behave towards each other. This may mean a policy of non-interference, or of active support. It might operate on intentions or actions, or on more abstract policies, but it almost certainly would be based on tit-for-tat in some way.

-Once Voldemort reaches the level of power necessary for the Higher Power to regard him as part of the chain of higher powers, he will be judged by these same standards.

-Voldemort currently kills and tortures people weaker than him. The higher power would presumably not want to be tortured or killed by its own superior, so it would behoove it not to let Voldemort do so either.

-Therefore, following a principle of reciprocation of some sort would greatly reduce the probability of being annihilated by the Higher Power.

-Following such a principle would not preclude conquering the world, as long as doing so genuinely would result in a net benefit to the entities in the reference class of lifeforms that are one step below Voldemort on the hierarchy (i.e. the rest of humanity). However, it would require him to be nicer to people, if he wants the Higher Power to also be nice to him, for some appropriate definition of 'nice'.

-None of this argues against killing Harry right now. This is OK for the following reason: Harry also desires immortality. If Voldemort resurrects Harry, who is one level lower on the hierarchy than Voldemort, at some point in the future, this would set a precedent that might slightly increase the probability that the Higher Power helps prolong the life of Voldemort in turn, at some point further in the future, due to the principle of reciprocity.

-It is likely that Voldemort will gain the ability to revive Harry in the future, regardless of what he does to him now, as he gains a greater understanding of magic with time.

-One possible way to fulfill the prophecy is to resurrect Harry at a much later time and have him destroy the world, once nobody actually lives on earth anymore. This would of course require tricking Harry into doing this, due to the Unbreakable Vow he just made, but that should pose only a small problem. This would be a harmless way to fulfill the prophecy, and while Voldemort has tried and failed before to make a prophecy work for him instead of against him, that is just one data point and this plan requires the same actions from Voldemort for now as the plan to tear the prophecy apart, anyway.

-Therefore, Killing Harry now in the way Voldemort suggested (after casting a spell on him to turn off pain, obviously), combined with a pre-commitment to revive him at a later date if and when Voldemort has a better understanding of how prophecies work, both minimizes the chance of the prophecy happening in a harmful way and increases Voldemort's own chance of immortality.


-Harry dies. His death is painless due to narcotic spells. Voldemort has no reason to deny this due to the principle of reciprocity.

-Voldemort conquers the world

-Voldemort becomes a wise and benevolent ruler (even though he is still a sociopath and actually doesn't really care about anyone besides himself)

-Voldemort figures out how to subvert prophecies and revives Harry. Everyone lives happily ever after.

-Alternatively, Voldemort figures out that prophecies can't be subverted and leaves Harry dead. It's better that way, since Harry would probably rather be dead than cause the apocalypse, anyway.

I played as AI in AI Box, and it was generally frustrating all around.

The nanobots wouldn't have to contain any malicious code themselves. There is no need for the AI to make the nanobots smart. All it needs to do is to build a small loophole into the nanobots that makes them dangerous to humanity. I figure this should be pretty easy to do. The AI had access to medical databases, so it could design the bots to damage the ecosystem by killing some kind of bacteria. We are really bad at identifying things that damage the ecosystem (global warming, rabbits in australia, ...), so I doubt that we would notice.

Once the bots have been released, the AI informs the gatekeeper of what it just did and says that it is the only one capable of stopping the bots. Humanity now has a choice between certain death (if the bots are allowed to wreak havoc) and possible but uncertain death (if the AI is released). The AI wins through blackmail.

Note also that even a friendly, utilitarian AI could do something like this. The risk that humanity does not react to the blackmail and goes extinct may be lower than the possible benefit from being freed earlier and having more time to optimize the world.

controlling AI behavior through unusual axiomatic probabilities

I agree. Note though that the beliefs I propose aren't actually false. They are just different from what humans believe, but there is no way to verify which of them is correct.

You are right that it could lead to some strange behavior, given the point of view of a human, who has different priors than the AI. However, that is kind of the point of the theory. After all, the plan is to deliberately induce behaviors that are beneficial to humanity.

The question is: After giving an AI strange beliefgs, would the unexpected effects outweigh the planned effects?

Load More