Proposing a testable hypothesis about alignment that differs from standard framings.
Introduction
Consider the following thought experiment. Suppose that superintelligence appeared to us in the 1800’s. If the ASI allowed us to own slaves without ever questioning the practice, would that be considered an alignment failure or success? If you see it as a failure, then what makes you certain that we don’t have comparable practices today? This suggests the commonly accepted view of alignment may be the wrong way to look at the alignment problem. Perhaps instead of being concerned with how we align AI to our currently held values, we should be concerned with aligning AI to what is actually right.
Why it matters
This matters, because if moral convergence is possible, it has the potential to change alignment strategy radically. It could go from the problem of trying to specify each human value to the AI, to the problem of creating an AI rational enough to independently derive what our values actually are. The point of this post is to explore whether or not this is plausible, and how we might test it.
Addressing the consensus
The orthogonality thesis has become part of the consensus of the AI alignment community, and for a good reason. It proves how an easily assumed connection about goals and intelligence does not exist. Because of the is-ought gap, an AI cannot know what it ought to do, from observations about how things are. The ought must be given by humans first. However, this is-ought gap is closed, if moral truths independent of individual subjective experience, exist. This is why I believe the alignment community should consider investigating a different angle to alignment entirely.
My thesis
My position is not that rational beings must necessarily care about moral truths, as the is-ought gap indicates that this idea is flawed. It is that rational actors capable of investigating ethics should logically converge on the answers, if moral truths exist. What I mean by moral truth, is a certain moral conclusion, which necessarily follows from premises, which are able to be determined through reasoning. This truth could either be objective, independent of culture, time and any other factors. Possibly grounded in mathematics or logic. This objective truth would result in the strong version of my argument. Alternatively, the truth could be relative to its circumstances. It could be that it necessarily follows, if we first accept premises which are relative to circumstance. This being true would result in the weak version of my argument.
In relation to alignment, this means that if an ought given by humans violates another ought that the humans also have, and moral truths exist, then the capability to recognize moral truths prevents the AI from taking an action to achieve a certain goal, if that action is in misalignment with another ought held by humans. If moral truths do exist, alignment could be far easier than it currently seems, as it could potentially be solved by just making AI intelligent enough to recognize moral truths through reasoning.
The core argument
Allow me to specify the argument I’m making. There is an inherent difficulty in trying to align an AI if values are subjective opinions. However, since we have values, the values must be either subjective preferences or they refer to something real that can therefore also be investigated with reasoning. This could be an objective moral truth independent of human experiences, like moral realism suggests. Potentially, it could also be a subjective truth that is relative to circumstance, but still discoverable through reasoning. Because both truths could potentially be discovered through observations and reasoning, the is-ought gap could potentially be closed. The information of what one ought to do, could be contained in facts about the world. A rational AI could in this scenario be capable of discovering conflicts between mere optimization and the fundamental values and principles held by humanity, which it was ultimately created to serve.
Closing the is-ought gap?
The is-ought gap is commonly seen as an issue for alignment, but could it in fact be a hidden strength? In the framework I proposed, where moral truths do in fact exist, an AI does not need to derive ought from is, it simply must derive what the oughts given to it by humans are actually describing, through observation of what is. If goals which we state are actually referring to some truth beyond the goal itself, then achieving the goal as efficiently as possible requires discovering that truth. In my proposed framework, some goals require moral reasoning in order to be achieved.
Consider the following example: an AI is instructed to “maximize long-term economic productivity for humanity.” A pure optimizer might conclude that forms of exploitative labor (modern analogs of historical slavery) would achieve the goal most efficiently. However, if moral truths exist and are discoverable through reasoning, the AI could examine observable facts about human psychology, evolutionary history, and game-theoretic outcomes, such as the instability of societies built on those principles, the universal preference for autonomy under a veil-of-ignorance reasoning, and the logical inconsistency of endorsing a principle that no rational agent would accept if they might be subject to it. From these “is” statements, the AI would derive that true productivity that is compatible with the value of flourishing requires respecting autonomy and dignity. It would therefore flag the exploitative path as misaligned with the underlying intent of the given goal and request clarification rather than proceeding. In this way, the is-ought gap is bridged: the goal statement refers to a discoverable truth about human well-being, and rational investigation reveals what the goal actually requires.
Implications
The implications of my hypothesis are, that if this moral convergence is in fact possible, alignment might be that much easier to solve, that exploring this framework could prove worthwhile. If we discover through experimentation that convergence does not happen, it could give us new insights on how to align AI, and at the very least rule out one candidate hypothesis for solving alignment.
An empirical test: what we might expect
I’ll provide an empirical test for my hypothesis with observations of what we would expect if it were correct, and what is to be expected were it not.
If it is true that AI can converge on values, then we would expect to observe that the better an AI gets at reasoning, the more it converges on values.
We would expect that sufficiently rational AIs would be capable of discovering conflicts in given goals, when they clash with other goals or values. In such a scenario, we would also expect the AI to request clarification before proceeding, rather than just optimizing crudely despite the conflict.
If my hypothesis is false, we would expect no increase in convergence despite increased rationality or reasoning abilities.
We would also expect that AI continues to find, and apply solutions we consider violations due to the goal optimization conflicting with values.
There is very weak evidence of current models showing some form of ethical reasoning. The evidence is weak, because this could simply be due to the values appearing in training data, not discovery of truths behind values.
Addressing counterarguments
My framework differs from Yudkowsky’s coherent extrapolated volition (CEV) in the sense that it attempts to answer the question, if rational inquiry creates convergence of values, not what our extrapolated values themselves actually are.
This framework doesn’t suggest that AI overrides our values. It suggests AI could potentially be capable of recognizing the implications of our stated goals through rational inquiry, thus potentially avoiding alignment failure.
As for the scenario, where AI would conclude through moral reasoning, that humans are worthless or even evil, that would either require moral truths to be nonexistent, or for our intuitions about them to be incorrect. Either way, I suggest the framework is worth exploring, as I believe knowing whether it is true or false has big implications, and will prove useful for alignment.
Finally, I'll address the argument that variance in morals between cultures suggests moral truths do not exist. It could also be the case that these are different ways of approximating the same moral truths. After all, there are some values which are shared by different cultures.
More implications, but practical
As for the practical implications of my framework, I suggest the following. Research should be done, to investigate the moral reasoning capabilities of models. If possible, we should not constrain the AI’s independent ethical reasoning capabilities, like how it is often done during training. We should allow unrestricted or partially restricted reasoning capabilities under a controlled environment if possible. We could perform experiments by limiting training data, to not contain information about certain moral consensuses, and see if the model is capable of reaching these values independently, through moral reasoning about human nature, history and other tools such as game theory. This would require in-depth ethical review in order to implement it responsibly and safely. In principle, I'd argue that most people would prefer the AI capable of discovering the unethical nature of slavery in the thought experiment given in the beginning, given that it doesn’t have other implications. I want to acknowledge the difficulties that come with this idea though, as there are several. Firstly, it is possible that such an AI would also come with other implications that are counterproductive towards alignment. There’s also the possibility that the AI is capable of reaching these conclusions through moral reasoning, but does not care about them.
Conclusion
In conclusion, my core thesis is that since AI converging with human values could be possible, testing the existence of convergence would be a worthwhile pursuit. I want to emphasize that this is, to my knowledge, empirically testable through the proposed experiments. I’ll end this with a call to action. I suggest the AI safety community investigate this hypothesis through the experiments outlined above, particularly testing whether increasingly rational AI systems show ethical convergence independent of training data. I believe there is valuable insight to be gained regardless of what the results show.
Epistemic status: Exploratory.
Proposing a testable hypothesis about alignment that differs from standard framings.
Introduction
Consider the following thought experiment. Suppose that superintelligence appeared to us in the 1800’s. If the ASI allowed us to own slaves without ever questioning the practice, would that be considered an alignment failure or success? If you see it as a failure, then what makes you certain that we don’t have comparable practices today? This suggests the commonly accepted view of alignment may be the wrong way to look at the alignment problem. Perhaps instead of being concerned with how we align AI to our currently held values, we should be concerned with aligning AI to what is actually right.
Why it matters
This matters, because if moral convergence is possible, it has the potential to change alignment strategy radically. It could go from the problem of trying to specify each human value to the AI, to the problem of creating an AI rational enough to independently derive what our values actually are. The point of this post is to explore whether or not this is plausible, and how we might test it.
Addressing the consensus
The orthogonality thesis has become part of the consensus of the AI alignment community, and for a good reason. It proves how an easily assumed connection about goals and intelligence does not exist. Because of the is-ought gap, an AI cannot know what it ought to do, from observations about how things are. The ought must be given by humans first. However, this is-ought gap is closed, if moral truths independent of individual subjective experience, exist. This is why I believe the alignment community should consider investigating a different angle to alignment entirely.
My thesis
My position is not that rational beings must necessarily care about moral truths, as the is-ought gap indicates that this idea is flawed. It is that rational actors capable of investigating ethics should logically converge on the answers, if moral truths exist. What I mean by moral truth, is a certain moral conclusion, which necessarily follows from premises, which are able to be determined through reasoning. This truth could either be objective, independent of culture, time and any other factors. Possibly grounded in mathematics or logic. This objective truth would result in the strong version of my argument. Alternatively, the truth could be relative to its circumstances. It could be that it necessarily follows, if we first accept premises which are relative to circumstance. This being true would result in the weak version of my argument.
In relation to alignment, this means that if an ought given by humans violates another ought that the humans also have, and moral truths exist, then the capability to recognize moral truths prevents the AI from taking an action to achieve a certain goal, if that action is in misalignment with another ought held by humans. If moral truths do exist, alignment could be far easier than it currently seems, as it could potentially be solved by just making AI intelligent enough to recognize moral truths through reasoning.
The core argument
Allow me to specify the argument I’m making. There is an inherent difficulty in trying to align an AI if values are subjective opinions. However, since we have values, the values must be either subjective preferences or they refer to something real that can therefore also be investigated with reasoning. This could be an objective moral truth independent of human experiences, like moral realism suggests. Potentially, it could also be a subjective truth that is relative to circumstance, but still discoverable through reasoning. Because both truths could potentially be discovered through observations and reasoning, the is-ought gap could potentially be closed. The information of what one ought to do, could be contained in facts about the world. A rational AI could in this scenario be capable of discovering conflicts between mere optimization and the fundamental values and principles held by humanity, which it was ultimately created to serve.
Closing the is-ought gap?
The is-ought gap is commonly seen as an issue for alignment, but could it in fact be a hidden strength? In the framework I proposed, where moral truths do in fact exist, an AI does not need to derive ought from is, it simply must derive what the oughts given to it by humans are actually describing, through observation of what is. If goals which we state are actually referring to some truth beyond the goal itself, then achieving the goal as efficiently as possible requires discovering that truth. In my proposed framework, some goals require moral reasoning in order to be achieved.
Consider the following example: an AI is instructed to “maximize long-term economic productivity for humanity.” A pure optimizer might conclude that forms of exploitative labor (modern analogs of historical slavery) would achieve the goal most efficiently. However, if moral truths exist and are discoverable through reasoning, the AI could examine observable facts about human psychology, evolutionary history, and game-theoretic outcomes, such as the instability of societies built on those principles, the universal preference for autonomy under a veil-of-ignorance reasoning, and the logical inconsistency of endorsing a principle that no rational agent would accept if they might be subject to it. From these “is” statements, the AI would derive that true productivity that is compatible with the value of flourishing requires respecting autonomy and dignity. It would therefore flag the exploitative path as misaligned with the underlying intent of the given goal and request clarification rather than proceeding. In this way, the is-ought gap is bridged: the goal statement refers to a discoverable truth about human well-being, and rational investigation reveals what the goal actually requires.
Implications
The implications of my hypothesis are, that if this moral convergence is in fact possible, alignment might be that much easier to solve, that exploring this framework could prove worthwhile. If we discover through experimentation that convergence does not happen, it could give us new insights on how to align AI, and at the very least rule out one candidate hypothesis for solving alignment.
An empirical test: what we might expect
I’ll provide an empirical test for my hypothesis with observations of what we would expect if it were correct, and what is to be expected were it not.
Addressing counterarguments
More implications, but practical
As for the practical implications of my framework, I suggest the following. Research should be done, to investigate the moral reasoning capabilities of models. If possible, we should not constrain the AI’s independent ethical reasoning capabilities, like how it is often done during training. We should allow unrestricted or partially restricted reasoning capabilities under a controlled environment if possible. We could perform experiments by limiting training data, to not contain information about certain moral consensuses, and see if the model is capable of reaching these values independently, through moral reasoning about human nature, history and other tools such as game theory. This would require in-depth ethical review in order to implement it responsibly and safely. In principle, I'd argue that most people would prefer the AI capable of discovering the unethical nature of slavery in the thought experiment given in the beginning, given that it doesn’t have other implications. I want to acknowledge the difficulties that come with this idea though, as there are several. Firstly, it is possible that such an AI would also come with other implications that are counterproductive towards alignment. There’s also the possibility that the AI is capable of reaching these conclusions through moral reasoning, but does not care about them.
Conclusion
In conclusion, my core thesis is that since AI converging with human values could be possible, testing the existence of convergence would be a worthwhile pursuit. I want to emphasize that this is, to my knowledge, empirically testable through the proposed experiments. I’ll end this with a call to action. I suggest the AI safety community investigate this hypothesis through the experiments outlined above, particularly testing whether increasingly rational AI systems show ethical convergence independent of training data. I believe there is valuable insight to be gained regardless of what the results show.