It would help to know what genre of game you are making. You talk about exposition, "We need to keep the exposition of these ideas short", and I would take this to the extreme if I were you. Show, don't tell. If players don't learn the concepts from the gameplay, then try game isn't about those concepts.
For example, if you want to teach players that ai optimism is not a good default and alignment is hard, give them a chance to do an alignment task or make alignment choices, in which there are optimistic options, that end badly. Or make a game that's almost unwinnable, to emphasize how hard the problem is.
Have you played universal paperclips? I've found it a fun first introduction to ai alignment for people with no knowledge of the topic.
We will post more when the game is announced, which should be in 2-3 weeks. For now I'm mostly interested in getting feedback on whether this way of setting the problem up is plausible and doesn't miss crucial elements, less about how to translate it into gameplay and digestible dialogue.
Once the annoucement (including the teaser) is out I'll create a new post for concrete ideas on gameplay + dialogue.
Did you get around to finish the game? I didn't see it. Or is it this?:
Not yet unfortunately, as our main project (QubiQuest: Castle Craft) has taken more of our resources than I had hoped. The goal is to release it this year in Q3. We do have a Steam page and a trailer now: https://store.steampowered.com/app/2086720/Elementary_Trolleyology/
Cautiknary tale: There was a browser game about sustainable fishing that was supposed to show the value of catch shares, but the concept was only introduced at the end of the game, so after playing for 30 minutes I hadn't even seen it (and had gotten bored with the mechanics)
Don't wait too long into the play experience to have your player start interacting with yiur key concepts.
Cool! I suggest you read the following post by Ajeya Cotra if you haven't already, I think it's a good summary of one of the core problems (which I suppose fits under 2b in your classification & may give some good inspiration as well.)
You could do a prisoners' dilemma mini game. The human player and (say) three computer players are AI companies. Each company independently decides how much risk to take of ending the world by creating an unaligned AI. The more risk you take relative to the other players the higher your score if the world doesn't end. In the game's last round, the chance of the world being destroyed is determined by how much risk everyone took.
Isn't that begging the question? If the goal is to teach why being optimistic is dangerous, declaring by fiat that an unaligned AI ends the world skips the whole "teaching" part of a game.
Yes, it doesn't establish why it's inherently dangerous but does help explain a key challenge to coordinating to reduce the danger.
Ethical truths are probably different from empirical truths. An advanced AI may learn empirical truths on its own from enough data, but it seems unlikely that it will automatically converge on the ethical truth. Instead, it seems that any degree of intelligence can be combined with any kind of goal. (Orthogonality Thesis)
I think the main point of the orthogonality thesis is less about an advanced AI not being able to figure out the true ethics, but the AI not being motivated to be ethical in this way even if it figures out the correct theory. If there is a true moral theory and the orthogonality thesis is true, the thesis of moral internalism (true moral beliefs are intrinsically motivating) is false. See here https://arbital.com/p/normative_extrapolated_volition/ section "Unrescuability of moral internalism".
Good point, I see what you mean. I think we could have 2 distinct concepts of "ethics" and 2 corresponding orthogonality theses:
The orthogonality thesis for 1 is what I mentioned: Since there are (probably) no rules that necessarily motivate everyone who knows them, the AI would not find the true ethical theory.
The orthogonality thesis for 2 is what you mention: Even if the AI finds it, it would not necessarily be motivated by it.
I'm working on a pc & mobile game about metaethics, ethics & AI alignment. (Our steam page + announcement teaser will be up in 1-2 weeks.) It's important to me that we nail the AI alignment part and give people a good idea about why AI alignment is hard and optimism shouldn't be our default position.
Some constraints that come with the medium:
What do you think of the following framing of why AI alignment is hard? Are we missing any crucial considerations? (This is merely our internal script and will later be turned into more digestible dialogues, mini games, etc.)
There are 2 ways alignment can pan out: Either the values are programmed into the system, or they are learned by the system.
1) Values are programmed into the system. But:
2) Values are learned by the system. But:
Other reasons for concern: