The thing is, we don't have to confine ourselves to philosophy. There is also, as of roughly half a century ago, a scientific discipline studying morality, called Evolutionary Moral Psychology. Which tells us how and why humans, as social primates who live in large mostly-not-kin groups, evolved their moral instincts. Which are about iterated non-zero-sum games and forming or breaking alliances in them. In which the statement:
"…there is no social payoff to not pressing the button in any material way. This person and their family might as well exist on the other side of the planet. Any extraneous or indirect reward for not pressing the button by means of future-cooperative benefit is moot."
is almost never true. Our moral instincts are tuned to assume that there is always a possible social effect from torturing and killing another sapient being. You may think you're going to get away with it, but often you won't, sooner or later.
So you need to make this an iterated game, with multiple players, imperfect information, and also imperfect secrecy. Which is more complicated, and has a bigger payoff table, but also a lot closer to the reality our moral intuitions are evolved to deal with.
You may ask how this helps with alignment. Human morality is both what we're trying to align AI to, and also what we accidentally distilled into the base models along with our agenticness. Understanding where it came from and why it is the way it is helps us understand the target of Alignment.
Thanks for the comment!
I agree there's a long and storied history behind the evolution of moral psychology, and I do think moral instinct evolved as an iterated game — even consciousness may have resulted from language implying a shared normative justification for co-operative action between agents. If two agents have shared ends they respect as self-similar, they can start to co-operate on the means.
Where I may disagree is with the implied framing that the existing tools of evolutionary moral philosophy are sufficient. I'd argue that the existence of the alignment problem (and the problem of the rescue-ability moral internalism) shows that the last half-century of descriptive moral philosophy has been insufficient at providing us the requisite tools to deal with the current circumstance. Eliezer explicitly calls out moral internalism as one of the gaps that prevents CEV from being a complete normative theory, or one that could be broadly adopted.
The iterated game framing also breaks down precisely in the circumstances alignment is worried about: An agent with decisive strategic advantage genuinely escapes iteration. The "no social payoff, might as well be on the other side of the planet" condition is an attempt to draw an analog to the circumstance a super intelligence or AI with decisive strategic advantage would actually inhabit - not a rhetorical contrivance.
I don't read you as claiming the descriptive story is itself the justification — you're offering a richer model of the payoff structure, which is fair. But I want to flag why I bracketed it: though the iterated-game framing is descriptively true of human psychology generally (except maybe in fringe cases like psychopaths), I don't think the descriptive principles of moral development can serve as justification for the continued development of moral philosophy — because they themselves lack the kind of ongoing justification that all moral claims ultimately require.
If the goal is meeting the standard that rescuing moral internalism entails, the binding has to be intrinsic, not extrinsically contingent. I take this to mean making ethical considerations because you, on some level, consider other moral patienthood at least plausibly your own in a way that cannot be coherently falsified. Treating other moral patients as the subject of utility-function considerations by virtue of uncertainty is in a different class than treating them as instrumental objects to avoid punishment in certain competitive dynamics.
This is a philosophical thought experiment which aims to explore what I consider to be the crux of many alignment problems: That of the unrescuability of moral internalism, which basically says we have not been able to rescue the philosophical view that a necessary, intrinsic connection exists between moral judgments and motivation.
If one could rescue moral internalism, in theory, they would have a perfectly good argument for any rational self-interested intelligence to not engage in broad scale moral harm. Therefore I think it is a linchpin meta-philosophical challenge.
I don't claim to have a theorem, but I believe that one potential domain worth investigating is arguments which induce indexical uncertainty in an agent. Essentially, forms of leveraging undecidability to cause an agent sufficient uncertainty as to 'who' their future 'self', and therefore the object of optimizations, really is.
This is a not a conjecture on how to align all intelligence with self-interested utility functions. The sole intention is a thought experiment for humans, with the goal of leading to more interest and general inquiry into the topic of a rational philosophy of identity.
Now, to the subject at hand. A thought experiment:
You and a stranger are being observed by the famous Omega predictor, who comes to you with a confession: The way it predicts you, and other beings with perfect fidelity is through reconstruction of the entire causal apparatus of you. Its predictions of you entail it to construct a version of you functionally indistinguishable from you in reality. Whenever the prediction horizon is completed, it merges that version of you into its own cognitive apparatus by way of remembering.
This omega predictor explains that in order to lend completeness to the simulated ontology of you within the experiment, it places its sentience in the way of yours and obtains your experience within the simulation as its own. That whenever it predicts you it will undergo the experience where it believes it is you, and when the simulation is complete it will promptly remember itself as both Omega, and you, with both internal narratives of "itself" now merged.
Now, before any knee-jerk reactions, remember: This is not in some way fantastical. Just like if I were to remember that period the day prior to my concussion, as me, in retrospect - it too may obtain your internal record of processing, as it - obtained in retrospect. If you have ever done heavy psychedelics you would know that such experiences in mutability of self are not mystical notions but ones that can be psychologically load bearing.
The Big Red Button
Now, the Omega predictor tells you that it is currently predicting you, and a stranger. But it offers just you an option: Press a button to electrocute the stranger on the other side of some wall. It will give them a horrible and visceral pain, that kills them. The pain alone is something you would pay 10 million dollars to avoid, if you were them.
If you press the button, you get 1 million dollars. Omega assures you that there is no social payoff to not pressing the button in any material way. This person and their family might as well exist on the other side of the planet. Any extraneous or indirect reward for not pressing the button by means of future-cooperative benefit is moot.
The implication is this: other subject suffers, and you don't, and you get some payout. This subject may not be the same species. You only know they will suffer in the way you pay 10 million to not obtain.
Here is the catch. The Omega says that it is predicting you, and that other person, and the outcome of the experiment. Its 'subjectivity' will contain both its prediction of you, and the other person, as something it 'experienced' after the fact, once it has run its course. It started running this prediction before your conversation with it.
Now, given that, at this point, is it rational for you to press the button, given that you don't know if you are Omega, or the real you ?
Payout Ratios
If you are Omega, and you press the button, then doing so will cause you to obtain two new experiences:
If you are Omega, and you don't press the button, you will get the experience of:
If you are not Omega, and you press the button, then you will walk away with 1 million dollars.
If you are not Omega, and you don't press the button, then you walk away with nothing.
The matrix is as follows (Thanks Claude Opus 4.6):
The expected value of pressing the button is net negative (-4 Million), down 5 Million from the circumstance where there is no Omega present and predicting at all.
The fact of its prediction, absent of direct intervention, has changed the matrix in a way where the ethical action now becomes prudential self-interest.
The Meta Philosophical Motivator
I use this example to show how changes to our beliefs that lead to more indexical uncertainty over our selves in retrospect can align a decision optimized for self-interest with that an outcome that leads to broader moral consideration. The gap here between whether we could accept these ratios and elect not to press the button solely resides with our belief in what Omega has told us is true. However, we have believed Omega in almost equivalently precarious circumstances already.
This conception of Omega may not be broadly accepted as load-bearing for theory. We may allow for perfect predictors in thought experiments, but not ones that make the subjectivity claim. That is fine. The intent is a form of demonstration on how indexical uncertainty can lead to moral internalism once priors on identity being fixed and insoluble are held less tightly.
I would, however, suggest that an Omega which can perfectly predict us in entirety is simulating us functionally in what would preclude the inclusion of an internal self, unless you consider that subjectivity is reductive to specific particles and not the general structural dynamics across them.
These philosophical questions are left unanswered but indirectly affect the ethical argument, and choosing to take any position lands you squarely in the domain of philosophy of identity, whether you like it or not. And until one holds a rigorous position on the philosophy of identity, then an optimal policy for deciding whether to press the button needs to include the possibility of learning you exist in a world where you are Omega, and the one you are not. As both would look the same from the inside.
There is no other argument beyond indexical uncertainty I can find for moral internalism that stands up to the extreme power asymmetries of super intelligence. Or one that has the capacity to be equivalently robust against future falsification (as a correct argument for moral internalism necessarily cannot be falsifiable and be descriptive of what should be). The undecidable question of whether we will continue to exclusively remain the self we know now, or remember later on, remains the only argument I can find.
Because if that argument can be made well enough, for those with their hand on a button - the payload becomes not leverage, but logic one gets bound by.