This post is a follow-up to "A multi-disciplinary view on AI safety research". I elaborate on some arguments behind this view.
TL;DR: please skim section headings and bolded sentences in the text.
Computationally tractable mathematical models of alignment are bound to be biased and blind to certain aspects of human values
No single mathematical model of human values that has orders of magnitude fewer degrees of freedom than an actual human will adequately capture the complexity of value because humans are complex systems and, therefore, cannot be reduced to a much simpler model.
If the model is sufficiently complex to robustly capture human values, such as whole-brain emulation, then the ethical concerns and S-risks of actually using these models for alignment appear because the model itself may suffer.
Many mathematical theories of human cognition or frameworks for computing (inferring) human values are considered as the basis for alignment, as well as process theories of alignment that implicitly rely on a particular mathematical theory even if it doesn’t infer (humans’ or AIs’) values explicitly, such as the shard theory (RL-based), or the latest Beren Millidge’s computational anatomy of human values, or cooperative inverse reinforcement learning, or Bayesian models and approaches, or various linguistic process theories of alignment that I expect to become very hot this year due to the astonishing success of LLMs. However, since all these theories are collapsing the complexity of humans (or else they are equivalent to full human simulations), they are all bound to be incomplete. Moreover, all these theories are bound to be biased (this is a form of inductive bias if you wish), that is, to be relatively blind to specific kinds of human values, or to specific aspects of human nature that we can see as somehow related to (or producing) “values”.
In other words, human values are not only complex in the sense that they are very elaborate. Crucially, human values are also not capturable within a single mathematical framework or ontology for describing them.
From “solving the alignment problem” to engineering the alignment process
The main implication of the above thesis is that we should abandon the frame that we should “solve” the alignment problem and seek a smart theory that will “crack” this problem.
I feel that a fair amount of unproductive debates and unproductive alignment research resource allocation stems from this illusion. People often debate whether this or that theory “can or cannot succeed” (in “solving” alignment, it is implied), or try to find the “best” theory and invest their effort into improving that theory because it’s the “best bet”.
Instead, we should adopt a portfolio approach. Theory A captures 90% of “value complexity” to be aligned, then theory B largely overlaps with theory A, but together, they capture 95% of value complexity, then adding theory C to the mix raises it to 97%, etc. (Of course, these “percent” are fictitious and cannot be actually computed.)
This is an engineering approach of adding extra assurances to the alignment process until all stakeholders of the system agree that the quality (the quality of being sufficiently aligned, or alignable to humans, in this case) is assured well enough for production deployment of the system.
When we consider this, it becomes clear that marshalling all effort behind improving a single theory is not optimal, vaguely speaking, due to the law of diminishing returns (also, as noted above, a good fraction of the alignment research community’s brain power goes into actually “finding” that best theory, on both individual and collective levels).
Surely, theories must be improved: some aspects of them should be criticised and possibly updated, the bounds of their applicability and their biases (blind spots) better quantified and understood, experiments with humans and AIs should be done to improve the empirical grounding of the theories, and finally, some software and/or AIs should be build to actually harness these theories into the “systems for alignment”.
However, it’s likely that investing, let’s say, 90% of resources in improving a single theory (which has its biases and “blind spots”) will yield less “value complexity coverage” and alignment robustness than spreading resources between multiple theories.
Diversity of approaches in the industry
It would be extremely cool if AGI labs which objectively have the most resources for doing alignment coordinated on not collapsing to a single alignment theory because each lab individually judged that that alignment theory alone covers the most “surface of human values”, but instead tried to maintain the diversity of approaches on the level of the industry, so that at the “moment X” there are multiple mature theories and software frameworks that support them which can be applied to different AIs. (I believe some labs also try to maintain a diversity of approaches internally, but usually, they still have the “main” approach, and also it would be globally more efficient for the industry, given the limited amount of resources allocated to alignment, not for each lab spreading their resources too much internally, but instead achieving diversity on the level of the industry.)
Actually, we already see this, to a degree: linguistic alignment approaches of OpenAI and Anthropic are rather different from the “emotion-based” approach put forward by LeCun in his vision paper (and thus presumably is the approach adopted by FAIR), and from the Millidge’s theory (presumably, this is an approach adopted by Conjecture).
Beyond “alignment” theories
Note that in this post, only a relatively narrow aspect of the multi-disciplinary view on AI safety is considered, namely the aspect of poly-theoretical approach to the technical alignment of humans to AIs. This mainly speaks to theories of cognition (intelligence, alignment) and ethics. But on a larger view, there are more theories and approaches that should be deployed in order to engineer our civilisational intelligence such that it “goes well”. These theories are not necessarily quite about “alignment”. Examples are control theory (we may be “aligned” with AIs but collectively “zombified” by powerful memetic viruses and walk towards a civilisational cliff), game theory (we may have good theories of alignment but our governance systems cannot deal with multi-polar traps so we cannot deploy these theories effectively), information security considerations, mechanistic anomaly detection and deep deceptiveness, etc. All these perspectives further demonstrate that no single compact theory can “save” us.
Creating as many new conceptual approaches to alignment as possible? No
The above thesis is also my reason to be very sceptical about a widely held belief in the AI x-risk sphere that “we need more wild conceptual ideas about alignment, fast”, e. g. here expressed by Adam Shimi: “We need far more conceptual AI alignment research approaches than we have now if we want to increase our chances to solve the alignment problem.”
Considering that no single theory could be sufficient to “solve” alignment, it seems ineffective to spread alignment research resources as thinly as possible, hoping that some researcher will by happy chance come up with that “theory we were all looking for during all these years” that will “rescue” the humanity.
Conversely, we cannot say that any of the existing established theories of cognition or human value (those that have solid mathematics and empirical grounding behind them, at least) are “wrong”. They may capture more or less of the “true” complexity of human value, but arguably they all could be useful for generating engineering assurances during the AI alignment process unless any of these theories strictly captures all value covered by another theory, which is probably impossible to prove.
Thirdly, I think it’s very unlikely that all existing theories of cognition and human values fail to capture a certain aspect of the complex values, or, in other words, have a common “blind spot”.
Considering these arguments, it seems to me that the optimal resource allocation for the alignment research community (assuming short timelines) would be to marshal efforts behind no more than 5-7 theories (or possibly even fewer) of cognition and human values (ethics) that are already existing, using the following criteria for choosing these theories:
- Scientific soundness (mathematical rigour, empirical grounding) and explanatory power
- Whether the theory is already adopted by some AGI lab(s), or very influential researcher(s) and their academic labs, such as LeCun, Bengio (MILA), Russell (CHAI), Friston, etc., or any alignment lab (MIRI, safe.ai, FAR, etc.)
- Whether the theory is amenable to be harnessed in software systems and protocols for alignment (or value inference)
Appendix. Does AGI need to be a complex system as well?
I also have an intuition that making a useful, capable, and robust AGI also requires it to be a complex system, even if it is produced by simple and understandable local inference or logical rules. So the hope of coming up with a GOFAI-style theory of intelligence/agency that will be provably and cleanly steerable and alignable might be not realisable in principle. Either the AI is too simple to be an independent robust agent in human society, or it needs to be approximately as complex as humans themselves. Cf. the law of requisite variety.
This intuition, if true, confirms the situation of aligning AIs with humans as pretty much symmetrical, but the thesis of this post doesn’t depend on it. The alignment process must be very iterative and reciprocal anyway, and thus even if only one party in the process (humans) is complex, a single process theory for alignment cannot capture the complexity of human values.
Could it end in a situation when we have 2 different friendly AIs with completely different understanding of the nature of human values? Each will perceive the other as enemy and there will be war.
But also my suggestion in the post that AGI labs should diversify their alignment approaches assumed that labs exchange their matured frameworks for alignment (or in fact make them public) so that each lab can apply multiple alignment theories/frameworks while designing and training their AI simultaneously. This way, each AI could be aligned to a higher degree with people than if only a single theory was applied.
What do you mean by "understanding of the nature of human values"?
If both aligned AIs are properly reflective and understand science properly, they understand their respective toolboxes of modelling human values (or even values of arbitrary black-box intelligent systems), are what they are: just toolboxes and models without special metaphysical status.
They may discuss their respective models of values, but there is no reason to be "in war" because both models are presumably well-aligned with humans and their predictions coincide in a large proportion of cases and diverge only in very obscure cases (like the trolley problem or other infamous thought experiments in ethics specifically designed to test the edges of axiological and ethical models) or when the models are "rolled out" very far into the future. For the latter case, as I gestured to in the post as well, I think the "alignment" frame is actually not useful and we should rather think in terms of control theory, game theory, theory of evolution, etc. Friendly AIs should understand this, and actually not even try to simulate a very far future using their value models of people. (And yes, this is the reason why I think the concept of coherent extrapolation volition actually doesn't make sense.)
Maybe an interesting thing to note here is that if both AIs were aligned to humans independently, let's say to cover 98% of human value complexity, but with different methods, their default mutual alignment on the first encounter (if you don't permit any online re-alignment, such as possible even with LLMs during prompting, though to a limited extent) is expected to be lower, let's say only 97%. But I don't see why this should be a problem.
I meant that a situation is possible when two AIs use completely different alignment methods and also come to different results.