I'm working on a pc & mobile game about metaethics, ethics & AI alignment. (Our steam page + announcement teaser will be up in 1-2 weeks.) It's important to me that we nail the AI alignment part and give people a good idea about why AI alignment is hard and optimism shouldn't be our default position.

Some constraints that come with the medium:

  1. We can't go into arguments that are too technical
  2. We need to keep the exposition of these ideas short

What do you think of the following framing of why AI alignment is hard? Are we missing any crucial considerations? (This is merely our internal script and will later be turned into more digestible dialogues, mini games, etc.)



There are 2 ways alignment can pan out: Either the values are programmed into the system, or they are learned by the system.


1) Values are programmed into the system. But:

  • There is no agreement on which values.
  • All ethical theories we have use ontologies (valence, pain, phenomenology, rights, justice) that don’t easily translate into code.
  • All the ethical theories we have (including those on moral uncertainty) have *some* cases (often in the area of population ethics) where most people disagree with their results.
  • In practice we usually use acceptable proxy values when designing AI rewards (maximize link clicks instead of maximizing user value). But using proxy values is extremely risky for sufficiently powerful optimizers. They might find optimized world states which massively neglect our actual values.

2) Values are learned by the system. But:

  • There is much disagreement about the right dataset.
  • Even if we agree on a data set, it is unclear if the AI will generalize from the data set in an acceptable way.
  • Probably(?) the AI will find the most value in exploits. Exploits are cases where 1) the AI achieves its goals to a high degree, but 2) the AI has generalized poorly (by our standards). Examples: wireheading, goal modification, etc.

Other reasons for concern:

  • Ethical truths are probably different from empirical truths. An advanced AI may learn empirical truths on its own from enough data, but it seems unlikely that it will automatically converge on the ethical truth. Instead, it seems that any degree of intelligence can be combined with any kind of goal. (Orthogonality Thesis)
  • There are some instrumental goals that probably many sufficiently advanced intelligences will converge on, for example accumulating resources, preventing others from interfering with its values, trying not to be shut down (e.g. by deceiving others into believing it has the same goals as them), etc. These instrumental goals make an iterative approach (run, test, fix, repeat) problematic. 


New Comment
15 comments, sorted by Click to highlight new comments since:

It would help to know what genre of game you are making. You talk about exposition, "We need to keep the exposition of these ideas short", and I would take this to the extreme if I were you. Show, don't tell. If players don't learn the concepts from the gameplay, then try game isn't about those concepts.

For example, if you want to teach players that ai optimism is not a good default and alignment is hard, give them a chance to do an alignment task or make alignment choices, in which there are optimistic options, that end badly. Or make a game that's almost unwinnable, to emphasize how hard the problem is.

Have you played universal paperclips? I've found it a fun first introduction to ai alignment for people with no knowledge of the topic.

We will post more when the game is announced, which should be in 2-3 weeks. For now I'm mostly interested in getting feedback on whether this way of setting the problem up is plausible and doesn't miss crucial elements, less about how to translate it into gameplay and digestible dialogue.

Once the annoucement (including the teaser) is out I'll create a new post for concrete ideas on gameplay + dialogue. 

Did you get around to finish the game? I didn't see it. Or is it this?:

AI takeover tabletop RPG: "The Treacherous Turn"

Not yet unfortunately, as our main project (QubiQuest: Castle Craft) has taken more of our resources than I had hoped. The goal is to release it this year in Q3. We do have a Steam page and a trailer now: https://store.steampowered.com/app/2086720/Elementary_Trolleyology/

Cautiknary tale: There was a browser game about sustainable fishing that was supposed to show the value of catch shares, but the concept was only introduced at the end of the game, so after playing for 30 minutes I hadn't even seen it (and had gotten bored with the mechanics)

Don't wait too long into the play experience to have your player start interacting with yiur key concepts.

Cool! I suggest you read the following post by Ajeya Cotra if you haven't already, I think it's a good summary of one of the core problems (which I suppose fits under 2b in your classification & may give some good inspiration as well.)

Thanks for the link, I will read that!

You could do a prisoners' dilemma mini game.   The human player and (say) three computer players are AI companies.  Each company independently decides how much risk to take of ending the world by creating an unaligned AI.  The more risk you take relative to the other players the higher your score if the world doesn't end. In the game's last round, the chance of the world being destroyed is determined by how much risk everyone took.

Isn't that begging the question? If the goal is to teach why being optimistic is dangerous, declaring by fiat that an unaligned AI ends the world skips the whole "teaching" part of a game.

Yes, it doesn't establish why it's inherently dangerous but does help explain a key challenge to coordinating to reduce the danger.  

I really like that and it happens to fit well with the narrative that we're developing. I'll see where we can include a scene like this.

Excellent.  I would be happy to help.  I teach game theory at Smith College.

Ethical truths are probably different from empirical truths. An advanced AI may learn empirical truths on its own from enough data, but it seems unlikely that it will automatically converge on the ethical truth. Instead, it seems that any degree of intelligence can be combined with any kind of goal. (Orthogonality Thesis)

I think the main point of the orthogonality thesis is less about an advanced AI not being able to figure out the true ethics, but the AI not being motivated to be ethical in this way even if it figures out the correct theory. If there is a true moral theory and the orthogonality thesis is true, the thesis of moral internalism (true moral beliefs are intrinsically motivating) is false. See here https://arbital.com/p/normative_extrapolated_volition/ section "Unrescuability of moral internalism".

Good point, I see what you mean. I think we could have 2 distinct concepts of "ethics" and 2 corresponding orthogonality theses:

  1. Concept "ethics1" requires ethics to be motivational. Some set of rules can only be the true ethics if, necessarily, everyone who knows them is motivated to follow them. (I think moral internalist probably use this concept?)
  2. Concept "ethics2" doesn't require some set of rules to be motivational to be the correct ethics.

The orthogonality thesis for 1 is what I mentioned: Since there are (probably) no rules that necessarily motivate everyone who knows them, the AI would not find the true ethical theory.

The orthogonality thesis for 2 is what you mention: Even if the AI finds it, it would not necessarily be motivated by it.