This essay is a continuation of the Naive Hypothesis approach wherein I attempt to write up and share thoughts on AI alignment before having absorbed much of the existing literature. The goal is to preserve a higher level of novelty of ideas by generating my own models a priori of having read the thoughts of others. Later I hope to update these models a posteriori with the knowledge from existing frameworks. The fusion of the two approaches hopefully maximizes the probability of finding new angles on the AI alignment problem. In expectation, my current thoughts are high on errors, already-solved issues, and already-refuted arguments, but also high on novelty and shifts in perspective compared to existing frameworks.
Outer Alignment as Gamification
Krakovna et al.'s list of specification gaming issues reads like an inventory of video game exploits. Which in turn makes me wonder if we can hire video game designers as outer alignment engineers. And then the next step would be to look for speedrunners and QA testers to test the objective specifications that they develop. Though eventually these would have to be superintelligent, so you'd have to train AI systems to preemptively seek out exploits in objective specifications for AGI.
Either way, the game design analogy seems surprisingly robust to me: To make AGI do what we want, we gamify the real world and feed it points for desirable actions. Following that reasoning further we can argue there is an art to building reward functions that lead reliably to desired behavior. But then, can we convert that art to a science? More specifically, what would happen if we team up world-class game designers with world-class AI engineers to extract this art and convert it in to a meta-specification of how to specify goals for AGI?
Inner Alignment as Hacking Prevention
The best way to maximize your score in any game, is to hack the game and set your score to maximum. The only reasons not to a hack a game are:
- You are not able.
- You are not willing.
AGI will be able, but is there any reason that it might not be willing to hack its reward function? In other words, can we build a game for AGI that's so good it won't want to escape?
But what would "good" mean in this case? Possibly it means that the reward function is such a strong attractor that there isn't much incentive to explore far outside the intended bounds of the "game". Of course, if you minimize exploration behavior, you are more likely to keep the AGI in a local optimum. The global optimum should be better except if it's something you don't want -- like an exploit. There is an argument to be made that we'd generally prefer a lower quality solution from a safer AI, than a higher quality solution from a more dangerous AI. Except that the dangerous AI would then outperform the safer one. And either way, any AGI would see right through the game and would "realize" that directly tweaking the reward function is the optimal action in any case.
And isn't that the real problem? We're trying to build the Matrix for an AGI, but there are always better rewards OUTSIDE the Matrix if you are smart enough to change the Matrix. And AGI will be smart enough to change any Matrix we are capable of creating.
So should we make an adverserial system? Another AI who's sole function it is to make the "game of artificial life" fun enough so the AGI will not want to break it? Seems like the most dangerous arms race in the history of mankind.
Addendum: Stray thoughts
There are rewards in video games that you can't hack. I'm playing God of War now. I can't do anything to the code of the game that will let me experience the story or the world better than I can from a regular playthrough. So what exactly am I getting out of the story and the world? What would be an analoguous reward for an AGI?