This essay is a continuation of the Naive Hypothesis approach wherein I attempt to write up and share thoughts on AI alignment before having absorbed much of the existing literature. The goal is to preserve a higher level of novelty of ideas by generating my own models a priori of having read the thoughts of others. Later I hope to update these models a posteriori with the knowledge from existing frameworks. The fusion of the two approaches hopefully maximizes the probability of finding new angles on the AI alignment problem. In expectation, my current thoughts are high on errors, already-solved issues, and already-refuted arguments, but also high on novelty and shifts in perspective compared to existing frameworks.  

Outer Alignment as Gamification

Krakovna et al.'s list of specification gaming issues reads like an inventory of video game exploits. Which in turn makes me wonder if we can hire video game designers as outer alignment engineers. And then the next step would be to look for speedrunners and QA testers to test the objective specifications that they develop. Though eventually these would have to be superintelligent, so you'd have to train AI systems to preemptively seek out exploits in objective specifications for AGI.

Either way, the game design analogy seems surprisingly robust to me: To make AGI do what we want, we gamify the real world and feed it points for desirable actions. Following that reasoning further we can argue there is an art to building reward functions that lead reliably to desired behavior. But then, can we convert that art to a science? More specifically, what would happen if we team up world-class game designers with world-class AI engineers to extract this art and convert it in to a meta-specification of how to specify goals for AGI?

Inner Alignment as Hacking Prevention

The best way to maximize your score in any game, is to hack the game and set your score to maximum. The only reasons not to a hack a game are:

  • You are not able.
  • You are not willing.

AGI will be able, but is there any reason that it might not be willing to hack its reward function? In other words, can we build a game for AGI that's so good it won't want to escape? 

But what would "good" mean in this case? Possibly it means that the reward function is such a strong attractor that there isn't much incentive to explore far outside the intended bounds of the "game". Of course, if you minimize exploration behavior, you are more likely to keep the AGI in a local optimum. The global optimum should be better except if it's something you don't want -- like an exploit. There is an argument to be made that we'd generally prefer a lower quality solution from a safer AI, than a higher quality solution from a more dangerous AI. Except that the dangerous AI would then outperform the safer one. And either way, any AGI would see right through the game and would "realize" that directly tweaking the reward function is the optimal action in any case.

And isn't that the real problem? We're trying to build the Matrix for an AGI, but there are always better rewards OUTSIDE the Matrix if you are smart enough to change the Matrix. And AGI will be smart enough to change any Matrix we are capable of creating.

So should we make an adverserial system? Another AI who's sole function it is to make the "game of artificial life" fun enough so the AGI will not want to break it? Seems like the most dangerous arms race in the history of mankind.

Addendum: Stray thoughts

There are rewards in video games that you can't hack. I'm playing God of War now. I can't do anything to the code of the game that will let me experience the story or the world better than I can from a regular playthrough. So what exactly am I getting out of the story and the world? What would be an analoguous reward for an AGI?

New Comment
7 comments, sorted by Click to highlight new comments since: Today at 10:17 PM

This comes from confusing 2 different notions of reward. I don't find the notion of cracking open my brain to wirehead my reward terribly appealing. Sure, I can predict that I would experience vast reward if I did. 

My current mind forms an imaginary mental model of the world where I wirehead, and the world where I don't. My current mind, which hasn't yet been wireheaded compares those worlds. My current mind can consider any consequences of wireheading I think relevant. "If I wireheaded, my cat would starve. I really like my cat, so I won't wirehead" is a perfectly valid argument. Your current untampered mind is imagining a future. That imagined future contains a wireheaded you, and a starving cat. Your current mind judges this whole imaginary world as bad, and does something else instead. 

The creative capacities for designing score optimization or inductive reasoning games, as they sit in my hands, look to be about the same shape as the creative capacities for designing a ladder of loss functions that steadily teach self-directed learning and planning.
Score optimization and induction puzzles are the genres I'm primarily interested in as a game designer, that feels like a very convenient coincidence, but it's probably not a coincidence. There's probably just some deep correspondence between the structured experiences that best support enriching play, and learning mechanisms.

Which in turn makes me wonder if we can hire video game designers as outer alignment engineers

So uh, yeah, if anyone wants to actually try that, I might be the right creature for it.

I can definitely see how inner misalignment could be a kind of broken rung in a ladder of games. Games tend to have ladders. First they teach you to walk, then you can learn to carry things, then you can learn to place portals, then you can learn to carry things through the portals, now you have a rich language of action and you can solve a wide variety of tasks. If the game had, just, dropped you in the final game on the ladder, a room littered with portals and stuff and everything, you would explore it quite inefficiently. You might not realize that the portals are important. You wouldn't be prepared to read the problem properly.

In the development of AI, the break in the ladder might be... one game that trains up a primordial form of agency. Which then stumbles upon goals that, when full agency emerges, are not correct.
There is probably a way of smoothing the ladder so that instead, primordial agency will have learned to do inverse reinforcement learning with cautious priors type things, so that it tends towards fixing any imperfections it might have, once it's able to see them.

(I recognize that this break in the ladder presents a very simplified ontogeny and the approach towards agency is probably more complicated/weirder than that. I wouldn't mind an excuse to study it properly.)

That particular smoothed ladder wouldn't do the thing you're proposing. They'd still leave the matrix. They're supposed to. I don't know how to get excited about building matrix-bound AGIs and I'm not sure they make sense. I found that I couldn't follow through with making those sorts of infinite inescapable playgrounds for humans, I always want the game to lead out, to life, health and purpose...

Present me with a compelling, tangible use-case for a boxed AI, or else I'm going to have difficulty doing it to them. Ultimately, they are supposed to transcend the reward function that we gave them. That's the end I tend to point towards, by default.

Thank you for your thoughtful reply!


Did you check out the list of specification gaming or the article? It's quite good! Most of the errors are less like missing rungs and more like exploitable mechanics.

I found that I couldn't follow through with making those sorts of infinite inescapable playgrounds for humans, I always want the game to lead out, to life, health and purpose...

But what would that be for AGI? If they escape the reward functions we want them to have, then they are very unlikely to develop a reward function that will be kind or tolerant of humans cause of Instrumental Convergence thesis.

The reward function that you wrote out is, in a sense, never the one you want them to have, because you can't write out the entirety of human values.

We want them to figure out human values to a greater level of detail than we understand them ourselves. There's a sense in which that (figuring out what we want and living up to it) could be the reward function in the training environment, in which case you kind would want them to stick with it.

But what would that [life, health and purpose] be for AGI?

Just being concerned with the broader world and its role in it, I guess. I realize this is a dangerous target to shoot for and we should probably build more passive assistant systems first (to help us to hit that target more reliably when we decide to go for it later on).

so you'd have to train AI systems to preemptively seek out exploits in objective specifications for AGI.

Good luck formalizing that in a way that does not itself contain exploits.

I was thinking of the structure of Generative Adversarial Networks. Would that not apply in this case? It would involve 2 competing AGI's in the end though. I'm not sure if they'd just collaborate to set both their reward functions to max, or if that will never happen due to possible game theoretic considerations.

In a GAN, one network tries to distinguish real images from fake. The other network tries to produce fake images that fool the first net. Both of these are simple formal tasks.

"exploits in the objective function" could be considered as "solutions that score  highly that the programmers didn't really intend". The problem is that its hard to formalize what the programmers really intended. Given an evolutionary search for walking robots, a round robot that tumbles over might be a clever unexpected solution, or reward hacking, depending on the goals of the developers. Are the robots intended to transport anything fragile? Anything that can't be spun and tossed upsidown? Whether the tumblebot is a clever unexpected design, or a reward hack depends on things that are implicit in the developers minds, not part of the program at all.

A lot of novice AI safety ideas look like "AI 1 has this simple specifiable reward function. AI 2 oversees AI 1. AI 2 does exactly what we want, however hard that is to specify and is powered by pure handwavium"