The Benefactor, the ancestor of Orwell's Big Brother, is the absolute ruler of OneState. ... He rules over a human society that is deemed to have achieved, with only negligible exceptions, absolute perfection. Men have finally become, if not actually machines, as machine-like as possible, utterly predictable and completely happy. All the messy inconvenience of freedom has been eliminated. Mere details, a final touch-up here and there, one last adjustment -- these are all that remain. the Great Operation at the end (a sort of lobotomy for the removal of Man's last imperfection, the Imagination) is the final touch. That not every Number wishes to go under the knife is understandable, but ... seeming cruelty can be the most genuine love of Mankind.
Translator's introduction to Yevgeny Zamyatin's We (1921).
My relative optimism about the human ability to check the alignment research conducted by AGI and becoming convinced “beyond a reasonable doubt” that it is sound rests upon the assumption that this alignment research will be conducted predominantly as math and science rather than philosophy.
It seems that in order to be able to produce a universally convincing alignment theory, AGI will need to suggest novel, powerful theories or entire new paradigms in mathematics, ML/DL and statistical learning theory, control theory, cognitive science, epistemology, logic, rationality, semantics, linguistics, ethics, game theory, distributed systems theory, and more (see here for a more extensive list), or be able to synthesise alignment research on the intersections of the state-of-the-art theories in these various disciplines with superhuman breath and superhuman generalisation ability.
I agree with Wei Dai that AGI-produced philosophy will be more problematic. The main reason is that it’s much easier to generate biased philosophy than biased math and science, and not to notice this bias, or to notice the bias but to be carried away with it (Dai spells this risk as “using AI changes the trajectory of philosophical progress in a bad way”), while LLMs, even before (or without) any fine-tuning with human or language feedback are biased.
In the recent FLI podcast episode, Connor Leahy detailed Conjecture’s Cognitive Emulation agenda as aiming at emotionless, “platonic” CoEms, which in some sense should minimise its bias (although some “rational” inductive biases would remain). However, for people, good philosophy is not totally unbiased philosophy, but a philosophy that is biased in exactly the same way as people want to be biased, including emotionally. It’s regarded as non-coincidental that Jeremy Bentham who probably had Asperger’s syndrome and didn’t understand emotions well (Lucas & Sheeran, 2006) created utilitarianism, a moral philosophy that is disliked by most people.
The role of philosophy in human activity
Since the Enlightenment, the causal effect of philosophical and religious ideas over the directions of human activity and the evolution of civilisation has progressively diminished. At the same time, the role of scientific and engineering ideas, technology, systems and mechanisms (such as markets, institutions, and regulators) has steadily increased. It seems to me that the AGI transition will necessarily solidify this subsidiary, “storytelling” and rhetorical role of philosophy and religion in determining the trajectory of civilisation.
If humanity takes philosophy (and ethics in particular) as more fundamental than science or even to be equally important, we simply won’t be able to agree on what to do with AGI and the AGI transition will definitely not go well. This is because if math and science don’t have an upper hand over philosophy, ethical pluralism will by definition prevent people from reaching any kind of satisfying agreement.
Note that above in this post, I’ve put ethics in the list of disciplines where AGI will probably need to develop some new ideas or frames on the path to “solving alignment”. There are actually at least five distinct (albeit related to each other) disciplines that we could call “ethics”, AGI will need to tackle all these disciplines. Four of them are various kinds of naturalistic (i.e., scientific) or mathematical disciplines:
- “Neuroscientific” ethics, e. g. the anatomy of human values, or preference science
- “Game-theoretic and evolutionary” ethics, e. g. Morality as Cooperation
- “Scale-free, physical” ethics (notably, interacting with naturalistic theories of consciousness rather than phenomenological, hermeneutic, and other “continental” studies and accounts of consciousness)
- “Formal, computational” (meta-)ethics
In addition, there is “rhetorical and memetic” ethics, i.e., an interpretation of some math and science that AGI labs will need to deploy to convince humanity to accept the AGI transition plan produced by their “first AGI” and to coordinate on executing this plan, globally. The employees and owners of the AGI labs will also need these interpretative stories to get themselves convinced that they want to follow on with the proposed plan, of course.
Any non-naturalistic and non-mathematical/computational meta-ethics implies an ethical theory of the latter kind, a properly philosophical theory, referring to the sense in which the word “philosophy” has been used above in this post, as contrasted with math and science.
Most people are not into scientism
Most people who hold non-naturalistic meta-ethical and meta-philosophical views also take their philosophies as fundamental as science (or “as important as science”, in some sense), or even more fundamental than science. And thus people who subscribe to various philosophical and ethical views of this kind, such as idealism, postmodernism, subjectivism, critical theory, and perhaps some versions of existentialism (see Heidegger’s Gestell, “enframing”), and situational ethics will see the “triumph of science” (together with naturalistic ethics) that will inevitably accompany the transition into the AGI era as irreconcilable with their philosophies.
This is not to mention the devout followers of most major world religions: Christianity, Islam, and Hinduism, the religions that also claim dominion over science and thus cannot “allow” themselves to be “overrun” by science and technology.
“Local” moral and religious pluralism doesn’t work
The issue described above shouldn’t be confused with the issue of aligning AI with different people and societies that have diverse cultural and moral norms. The latter issue is often proclaimed on Twitter and elsewhere as a rhetorical question, “Whose values AGI will be aligned with?” People who do this often end their inquiry right there, assuming that cultural and moral pluralism a priori renders technical alignment impossible.
The defenders of AI alignment, such as at OpenAI, of course, think that the technical problem is solvable by “adaptation”, or configuration of AI to the moral norms of specific cultures, communities, or even families or individual people. Sam Altman writes:
In particular, we think it’s important that society agree on extremely wide bounds of how AI can be used, but that within those bounds, individual users have a lot of discretion. Our eventual hope is that the institutions of the world agree on what these wide bounds should be; in the shorter term we plan to run experiments for external input. The institutions of the world will need to be strengthened with additional capabilities and experience to be prepared for complex decisions about AGI.
I think that technical alignment with people in a pluralistic world is possible, but this does nothing to alleviate the issue that religious people (and other people who subscribe to anything except scientism) will take with AI because at the core this “pluralistic AI” should still be based on science and naturalistic ethics, as I argue below.
Alignment with just a religious text and the laws is technically incoherent or not enough
There couldn’t be an AI that is fine-tuned on Bible or whatever and that fundamentalist Christians will be happy with: this won’t work for the same reason as Asimov’s laws of robotics don’t work.
All codes and systems of law, either technical, judicial or religious, have internal contradictions, and hence there couldn’t be an AI assistant that can recommend or do something strictly on the basis of their religion, the law, and international declarations of rights. If that was actually possible to make an AI use only a certain text (a moral or a legal code) as a guide and nothing else, a malicious user would be able to talk that AI into doing very bad things for them, exploiting the contradictions and the pockets of undecidability in the text. I think products like Instagram or TikTok illustrate this well: when the only guide is the law code and superficial, rule-based understanding of ethics, very bad products could both be legally permissible and morally justifiable. These particular products are justified as having positive effects (e.g., bringing people joyful experiences), while their negative effects or long-term effects couldn’t be “objectively” compared with the positive effects. Cf. deductive explosion and Dutch Booking as tangentially related phenomena.
Thus the idea doesn’t hold that with the increase of power, AI could remain very flexible or relatively agnostic on ethics. Rather, for the entire civilisational technical alignment edifice to stand, any AGI will have to incorporate a rather precise and universal naturalistic account of ethics, such as something about the minimisation of the integral amount of suffering among the conscious parties involved, according to some technical, scientific definitions of suffering and consciousness.
The universal declarations of human rights or animal rights were adequate in the 20th century when technology couldn’t create creatures that could arbitrarily approximate some properties of moral patients as defined by these declarations, but not cross the line, to circumvent the prohibitions. For example, if laboratory experiments on people are banned, someone could run a laboratory experiment on a brain simulation. If experiments on full brain simulations are also banned, then someone could run the experiments on a scaled-down simulation so that the number of neurons and synapses in it stays just below the specified threshold. But then there are obvious problems with these arbitrary thresholds, as per the heap of sand paradox.
Homogenisation of societies
Although technical alignment to a religion or any code of law is impossible, the adjustment of AI’s treatment of people to local “game-theoretic and evolutionary” morality is possible, at least in the broadest strokes, such as “Western AI” leaning more towards individualism and liberalism and “Eastern AI” leaning more towards collectivist morality.
However, the further technologisation of human activity and the deepening AI mediation and control of human activity and interactions will drive societies towards homogenisation around their dominant value systems. I’ll give two examples: how collectivist cultures will be driven out in the West (or isolated within it) and how individualism will decline in the East.
Imagine that an AI teacher or assistant will be tasked by devout Hindu parents of a child with something like “support the child in their Hinduistic belief, but if their long-term happiness will be best realised by abandoning Hinduism, support that too”. The latter part will be imposed by the regulations to prevent authoritarian parents from making their children suffer. It seems that most children from collectivist families who grow up in a liberal environment at least at some point in their lives (likely during adolescence) intensely dislike their background and want to part with it. At this point, the AI teacher or assistant will probably conclude that individualism will be better for the person in the future, too, and so it will support the child in this inclination.
The only way collectivist cultures could thus survive in the West is through isolation and various degrees of luddism (especially in everything that comes to AI), a la Amish or Orthodox Jews.
Homogenisation of collectivist society via aligned AI is even more obvious: it has been described many times as the risk of lock-in of authoritarianism in countries like China or Russia. After the Social Credit System, little doubt there will come AI recommenders, or even mandatory systems for choosing one’s occupation and partners which optimise the collective good in these societies. On Western liberal websites like LessWrong, these prospects are usually painted as dystopian altogether. I’m not 100% sure they will be: it well might be that such systems are valid local optima according to some non-contradictory scientific theories of ethics. My point is narrower: it’s doubtful that individualism may survive “diluted” within a thoroughly technological socioeconomic system aligned with collectivist morality.
The third perspective that strains the notion of moral pluralism after the worldwide AGI transition is the moral obligation to do good or prevent harm that is possible to prevent. Currently, despite philosophers concluding that, let’s say, animal suffering is bad, and beliefs like this even propagate to the societal and “official”, governmental level in the most progressive countries, there are no or little focused attempts to stop growing animals for food because it seems totally economically infeasible. Also, this would interfere with the eating habits of a lot of people and therefore such attempts would be very politically unpopular.
If for some reason AGI will not be created in the next 20 years (although I personally assign a very small probability to this), we will actually see what will happen when objective moral obligation clashes with entrenched cultural habits. In at most 15 years, artificial meat such as Impossible Foods will be cheaper and tastier and nutritionally richer and healthier than animal meat. Will growing animals for food be banned then for ethical reasons, or society will give in to that faction of people that will definitely remain huge, maybe 10-20% of the population, who will refuse to switch to artificial meats just out of principle, ideology, or superstition? I’m not sure.
In the context of advanced technological civilisation infused with AI, the question of moral obligation appears when the AI, equipped with some version of naturalistic ethics (which is inevitable, as I discussed above) will face a potential moral obligation to prevent harm in accordance with that very naturalistic ethics (though perhaps not the obligation to maximise the good, since maximisation is dangerous and could easily Goodhart). For example, if, according to these ethics, most natural ecosystems, and especially the Amazon, are in fact horrendously bad because they generate “natural” animal suffering on a huge scale, but it is “sacred” to the indigenous communities, and in fact most people, let’s say, in Brazil or Peru, also perceive the Amazon as very good to conserve intact, what the AI that also seeks alignment with people will do?
Even if AI designers wished to engineer the AI to always choose to align with people in such situations of moral doubt (doubt from the perspective of humans; let’s say that there is no doubt from the perspective of naturalistic ethics that is engineered into the AI that the existence of the Amazon ecosystem is morally bad), this very well might appear that engineering such “ultimate one-directional alignment to humans” is dangerous, inconsistent, or fails on its own terms for some reasons, so the AI which will write alignment research for us will show us this. This is not guaranteed (that is, it might be possible to engineer a robust civilisational intelligence with uni-directional alignment of AIs to humans), but my intuition is that this is quite likely. And so if this is the case, people will need to put up with this and align to AI themselves (i.e., in simpler words, to learn from AI), which basically means submitting to AI-led cultural and societal transformation that will likely destroy a lot of cultural diversity and moral pluralism along the way. I’m not sure that any of such diversities will remain at the national and geographical level. Perhaps, the only divergences that will be tolerable from the perspective of AI and its (and ours) naturalistic ethics will be minute differences at the community level, such as ethics of collective sports vs. ethics of individual sports.
The happy path: ethical intuitions as empirical grounding for naturalistic ethics seems improbable
If it turns out that ethical intuitions held by conscious intelligences (such as humans) should be treated as empirical grounding for scientific theories of ethics then the AIs that don’t have the relevant form of consciousness (according to the latest scientific theories of consciousness) will need to always check their ethical reasoning with humans as the only conscious biological species that can also speak. This would be a happy way out of the deep philosophical rift outlined above because both people who are pro-scientism and against scientism would see this development as compatible with their worldview.
Unfortunately, the whole idea that humans’ ethical intuitions rather than just affective conscious experiences will turn out to be legitimate grounding for naturalistic ethical theories seems very unlikely to me.
That being said, humans (and other animals) might be kept around exactly because it’s valuable to keep around conscious beings, and their subjective conscious states will matter for the decisions that AI makes, but subjectivity doesn’t mean arbitrariness, and AI might well be able to predict human subjective states (such as, whether people will find something painful or pleasurable, fretting or enjoyable) with sufficient precision to take over all moral decision-making from humans, as I discussed in the post “Morphological intelligence, superhuman empathy, and ethical arbitration”.
Conclusion: the future of humanity will be decided by technologists and then AGI itself, not by humanity
Most scientists and technologists (including those who build AGI today) have scientistic worldviews, unlike the majority of the world’s population today.
It seems to me that there is no significant difference between AI accelerationists and “notkilleveryoneists” on this topic: for example, the whole idea of coherent extrapolated volition is scientistic. I endorse scientism myself, although I’m not very certain about this belief and I sympathise with postmodernism and existentialism.
However, it seems important to realise this philosophical division between scientists and technologists who will effectively decide unilaterally for the rest of humanity what they should think, via creating and deploying AGI. This will happen unless some very powerful politicians or entire governments will intervene if the majority of the population will realise the reality that is laid out in the post above, and will weigh this issue as more existentially and spiritually important than all the benefits that AI will bring (ending scarcity, curing diseases, etc.), and will express a focused political will to prevent this reality from unfolding. All these things seem unlikely to me, so I expect that technologists will create and build AGI soon and if they are going to be stopped that would be for some reasons other than deep philosophical disagreements with the populations.
Ultimately, I write this post to express that Sam Altman’s phrase in “Planning for AGI and beyond”, “We believe that future of humanity should be determined by humanity”, is somehow deeply philosophically problematic and delusional. The future of humanity will first be determined by scientists and technologists and then by AGI itself, but not by humanity.
Humanity doesn’t have enough civilisational cohesion by far to decide anything like an agent. AGI will very likely be deployed far sooner than such civilisational cohesion will be built.
Thanks to Cody Rushing, Kendrea Beers, Catalin Mitelut, and Karl von Wendt for the conversation that has led up to this post.
Cross-posted on the Effective Altruism Forum.
Thanks to Catalin Mitelut for suggesting this quote.
In the language of the FLI open letter.
Deutsch and Marletto call ideas that have causal influence knowledge: “The information in the recipe is an abstract constructor that I shall call knowledge (without a knowing subject). Knowledge has an exact characterization in constructor theory: it is information that can act as a constructor and cause itself to remain instantiated in physical substrates.” (Marletto, 2015).
Note that causal effect is measured in percent, see Pearl’s causality calculus. The absolute effect of religion, philosophy and ideologies over the development of civilisation is still enormous.
However, if modern versions of collectivist ethics are scientifically proven by AGI to be all-around inferior to some other versions of social order, more like Western liberalism, this will be a huge ideological problem for China and I’m not sure how they will act. But the opposite may likewise happen: AGI may prove that Western liberalism fails on its own terms and the culture should become much more collective for the society to survive, which will become a great challenge for the West, too.
See Tony Seba’s presentation about food and agriculture disruption.
See Brian Tomasik’s writing on this.
There are two important inputs in any mathematical AI alignment which need to be provided by humans or taken as un-said assumptions: who are "humans" and what is "human values" – or normatively important part of human values. AI can guess it, bit it needs AI Psychology and AI Sociology, which also could be biased.
Also see my short-form today about about reflexive stability of AI alignment.
It's somewhere between the lines here:
Clearly, technical AI alignment cannot take a "human" with some formulaic definition as only or primary moral subjects and/or subjects to be aligned with. This itself would be unscientific. Rather, alignment should be based on some naturalistic theory of ethics, e.g., saying that moral subjectivity is proportional to integrated information Φ in the agent's consciousness. The "values" are also determined scientifically, from the game-theoretic/evolutionary setup.
So, a cyborg will also be a subject of alignment. But it also extends moral/alignment subjectivity to animals, of course.
I think the term alignment subjectivity wasn't used before, looks like a useful term, let's coin it :)
You seem to hold the position that:
I don’t believe that scientists either have the inclination or the competence to do what you ask of them, and secondly that letting AGIs decide right and wrong would be a nightmare scenario for the human race.
Normative ethics -- yes, because I gravitate towards ethical naturalism myself (see my discussion of scale-free ethics here), which is a part of the "package" of scientism.
A scientist (as a role, not a person!) "shouldn't" do meta-ethics (that is, decide that ethical naturalism is the way to go), because the question of meta-ethics, and acceptance or rejection of such fundamental philosophical stances as scientism, or idealism, or postmodernism is outside of the scope of science, that is, cannot be settled with methods of science. Ultimately, every scientist must do at least a little bit of philosophy (of science), at which moment they assume the role of a philosopher. Scientism is a philosophy that maximises the scope of science as much as possible and minimises the scope of philosophy as much as possible, but not to zero.
But regardless of who should or shouldn't do meta-ethics, I claim that technical alignment is impossible with anything except naturalistic ethics. That is, to successfully technically align AI to anyone or anything, one must take on a naturalistic theory of ethics. This is because engineering success is defined in scientific terms, thus if you don't treat ethics as a science (which is a synonym for ethical naturalism), you can't say that you technically succeeded at alignment.
From the practical point of view, attempting to align AI to haphazard "values" or arbitrary "philosophical" theory of ethics rather than a coherent scientific theory of ethics seems bonkers, too.
AGI can definitely do it much faster. And it seems that this is the strategy of both OpenAI and Conjecture and quite possibly other AGI labs too, to first build AGIs and then task them with "solving alignment" rather than recursive self-improvement. I don't try to estimate whether this strategy is better or worse than other strategies (at least in this post), I just take it as a premise because it seems very unlikely to me at this point that the aforementioned AGI labs will change their strategies, or that human AI alignment researchers will "solve alignment" faster than AGI will be built and try to solve it.
So it's not a question of whether we "should leave it to them" (at least I don't raise this question here), it's a belief that AGI labs will leave it to them.
In non-naturalistic meta-ethics -- quite possibly. In naturalistic ethics -- it's not more "nightmarish" than letting AGI do science. Since it's science, it's assumed to be objective and checkable, including by humans. Even though human scientists cannot derive a naturalistic theory of ethics and converge on it in a short enough time, this absolutely doesn't mean that the hypothetical naturalistic theory of ethics (that AGI will derive) will be impenetrably complex for humans. It very well might be reasonably accessible.