Alignment as Game Design

[-]Donald Hobson3y50

This comes from confusing 2 different notions of reward. I don't find the notion of cracking open my brain to wirehead my reward terribly appealing. Sure, I can predict that I would experience vast reward if I did.

My current mind forms an imaginary mental model of the world where I wirehead, and the world where I don't. My current mind, which hasn't yet been wireheaded compares those worlds. My current mind can consider any consequences of wireheading I think relevant. "If I wireheaded, my cat would starve. I really like my cat, so I won't wirehead" is a perfectly valid argument. Your current untampered mind is imagining a future. That imagined future contains a wireheaded you, and a starving cat. Your current mind judges this whole imaginary world as bad, and does something else instead.

[-]mako yass3y31

The creative capacities for designing score optimization or inductive reasoning games, as they sit in my hands, look to be about the same shape as the creative capacities for designing a ladder of loss functions that steadily teach self-directed learning and planning.
Score optimization and induction puzzles are the genres I'm primarily interested in as a game designer, that feels like a very convenient coincidence, but it's probably not a coincidence. There's probably just some deep correspondence between the structured experiences that best support enriching play, and learning mechanisms.

Which in turn makes me wonder if we can hire video game designers as outer alignment engineers

So uh, yeah, if anyone wants to actually try that, I might be the right creature for it.

I can definitely see how inner misalignment could be a kind of broken rung in a ladder of games. Games tend to have ladders. First they teach you to walk, then you can learn to carry things, then you can learn to place portals, then you can learn to carry things through the portals, now you have a rich language of action and you can solve a wide variety of tasks. If the game had, just, dropped you in the final game on the ladder, a room littered with portals and stuff and everything, you would explore it quite inefficiently. You might not realize that the portals are important. You wouldn't be prepared to read the problem properly.

In the development of AI, the break in the ladder might be... one game that trains up a primordial form of agency. Which then stumbles upon goals that, when full agency emerges, are not correct.
There is probably a way of smoothing the ladder so that instead, primordial agency will have learned to do inverse reinforcement learning with cautious priors type things, so that it tends towards fixing any imperfections it might have, once it's able to see them.

(I recognize that this break in the ladder presents a very simplified ontogeny and the approach towards agency is probably more complicated/weirder than that. I wouldn't mind an excuse to study it properly.)

That particular smoothed ladder wouldn't do the thing you're proposing. They'd still leave the matrix. They're supposed to. I don't know how to get excited about building matrix-bound AGIs and I'm not sure they make sense. I found that I couldn't follow through with making those sorts of infinite inescapable playgrounds for humans, I always want the game to lead out, to life, health and purpose...

Present me with a compelling, tangible use-case for a boxed AI, or else I'm going to have difficulty doing it to them. Ultimately, they are supposed to transcend the reward function that we gave them. That's the end I tend to point towards, by default.

[-]Shoshannah Tekofsky3y22

Thank you for your thoughtful reply!

Did you check out the list of specification gaming or the article? It's quite good! Most of the errors are less like missing rungs and more like exploitable mechanics.

I found that I couldn't follow through with making those sorts of infinite inescapable playgrounds for humans, I always want the game to lead out, to life, health and purpose...

But what would that be for AGI? If they escape the reward functions we want them to have, then they are very unlikely to develop a reward function that will be kind or tolerant of humans cause of Instrumental Convergence thesis.

[-]mako yass3y*20

The reward function that you wrote out is, in a sense, never the one you want them to have, because you can't write out the entirety of human values.

We want them to figure out human values to a greater level of detail than we understand them ourselves. There's a sense in which that (figuring out what we want and living up to it) could be the reward function in the training environment, in which case you kind would want them to stick with it.

But what would that [life, health and purpose] be for AGI?

Just being concerned with the broader world and its role in it, I guess. I realize this is a dangerous target to shoot for and we should probably build more passive assistant systems first (to help us to hit that target more reliably when we decide to go for it later on).

[-]Donald Hobson3y20

so you'd have to train AI systems to preemptively seek out exploits in objective specifications for AGI.

Good luck formalizing that in a way that does not itself contain exploits.

[-]Shoshannah Tekofsky3y10

I was thinking of the structure of Generative Adversarial Networks. Would that not apply in this case? It would involve 2 competing AGI's in the end though. I'm not sure if they'd just collaborate to set both their reward functions to max, or if that will never happen due to possible game theoretic considerations.

[-]Donald Hobson3y30

In a GAN, one network tries to distinguish real images from fake. The other network tries to produce fake images that fool the first net. Both of these are simple formal tasks.

"exploits in the objective function" could be considered as "solutions that score highly that the programmers didn't really intend". The problem is that its hard to formalize what the programmers really intended. Given an evolutionary search for walking robots, a round robot that tumbles over might be a clever unexpected solution, or reward hacking, depending on the goals of the developers. Are the robots intended to transport anything fragile? Anything that can't be spun and tossed upsidown? Whether the tumblebot is a clever unexpected design, or a reward hack depends on things that are implicit in the developers minds, not part of the program at all.

A lot of novice AI safety ideas look like "AI 1 has this simple specifiable reward function. AI 2 oversees AI 1. AI 2 does exactly what we want, however hard that is to specify and is powered by pure handwavium"

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

11

Alignment as Game Design

11

11

Outer Alignment as Gamification

Inner Alignment as Hacking Prevention

Addendum: Stray thoughts