Financial status: This work is supported by individual donors and a grant from LTFF.
Epistemic status: This post contains many inside-view stories about the difficulty of alignment.
Thanks to Adam Shimi, John Wentworth, and Rob Miles for comments on this essay.
What exactly is difficult about AI alignment that is not also difficult about alignment of governments, economies, companies, and other non-AI systems? Is it merely that the fast speed of AI makes the AI alignment problem quantitatively more acute than other alignment problems, or are there deeper qualitative differences? Is there a real connection between alignment of AI and non-AI systems at all?
In this essay we attempt to clarify which difficulties of AI alignment show up similarly in non-AI systems, and which do not. Our goal is to provide a frame for importing and exporting insights from and to other fields without losing sight of the difficulties of AI alignment that are unique. Clarifying which difficulties are shared should clarify the difficulties that are truly unusual about AI alignment.
We begin with a series of examples of aligning different kinds of systems, then we seek explanations for the relative difficulty of AI and non-AI alignment.
Alignment in general
In general, we take actions in the world in order to steer the future in a certain direction. One particular approach to steering the future is to take actions that influence the constitutions of some intelligent system in the world. A general property of intelligent systems seems to be that there are interventions one can execute on them that have robustly long-lasting effects, such as changing the genome of a bacterium, or the trade regulations of a market economy. These are the aspects of the respective intelligent systems that persist through time and dictate their equilibrium behavior. In contrast, although plucking a single hair from a human head or adding a single barrel of oil to a market does have an impact on the future, the self-correcting mechanisms of the respective intelligent systems negate rather than propagate such changes.
Furthermore, we will take alignment in general to be about utilizing such interventions on intelligent systems to realize our true terminal values. Therefore we will adopt the following working definition of alignment:
Successfully taking an action that steers the future in the direction of our true terminal values by influencing the part of an intelligent system that dictates its equilibrium behavior.
Our question is: in what ways is the difficulty of alignment of AI systems different from that of non-AI systems?
Example: Aligning an economic society by establishing property rights
Suppose the thing we are trying to align is a human society and that we view that thing as a collection of households and firms making purchasing decisions that maximize their individual utilities. Suppose that we take as a working operationalization of our terminal values the maximization of the sum of individual utilities of the people in the society. Then we might proceed by creating the conditions for the free exchange of goods and services between the households and firms, perhaps by setting up a government that enforces property rights. This is one particular approach to aligning a thing (a human society) with an operationalization of The Good (maximization of the sum of individual utilities). This particular approach works by structuring the environment in which the humans live in such a way that the equilibrium behavior of the society brings about accomplishment of the goal. We have:
An intelligent system being aligned, which in this case is a human society.
A model of that system, which in this case is a collection of households and firms making purchasing decisions.
An operationalization of our terminal values, which in this case is the maximization of the sum of individual utilities (henceforth the "goal").
A theory – in this case classical microeconomics – that relates possible interventions to the equilibrium behavior of the intelligent system.
Practical know-how concerning how to establish the conditions that the theory predicts will lead to the goal.
What makes this problem difficult is that it is not so easy to install our intended goal (maximization of the sum of individual utilities) as the explicit goal of the market participants, and even if we did, the participants would not automatically be able to coordinate to achieve that goal without solving the same basic design problem that we are describing here. Similarly, it does not make sense to try to install our goal into the basic computing building-blocks of an AI. The building blocks are mechanical elements that we need to assemble in a way that brings about achievement of the goal as a result of our design choices, and in order to do that we need a theory that connects design choices to outcomes.
The intelligent system may feed a great deal of optimization pressure into the framework that we have placed around it, so if our design choices are even a little bit off the mark then we may not get what we wanted. Furthermore the intelligent system may turn out to have affordances available to it that weren’t highlighted by our model of the system, such as when participants in a market economy become actors in the political institutions that regulate the market. If we didn’t consider such affordances clearly in our theory then the actual behavior of the system may turn out to be quite different from what we intended.
Finally, the operationalization that we used for our terminal values will generally not be a complete representation of our true terminal values, and, as Goodhart’s law suggests, unbounded optimization of any fixed objective generally deviates quite dramatically from the underlying generator that the objective was intended to be a proxy for.
We discuss the differences and similarities between these issues in AI versus non-AI settings below. For now, we turn to further examples of alignment in general.
Example: Aligning a nation by establishing a justice system
Suppose again that the thing we are trying to align is a human society, but now we view this thing as a self-governing polis consisting of individuals with overlapping but not identical social norms. Let us take as our operationalization of our terminal values the existence of the means to peacefully settle disputes. In order to accomplish this we might establish a justice system in which laws must be written down, individuals who are widely respected are appointed as judges and hear cases, prosecutors and defendants get a chance to make their case in public, and clear and binding rulings are made. We can consider the consequences of the choices we make in designing such a system using the tools of game theory, applied ethics, and political science. We might find that certain common-sense design choices would lead to surprising failure modes, while other less obvious design choices lead to equilibria that are closer to our goal. However we approach this, the basic activity is not so different from the problem of aligning an AI. We have:
An intelligent system being aligned, which in this case is a human society.
A model of that system, which in this case is a self-governing polis
An operationalization of our terminal values, which in this case is the fair and predictable enforcement of the law
A theory – in this case an informal collection of wisdom from game theory, applied ethics, political science, as so on – that relates the possible interventions to the equilibrium behavior of the system
Practical know-how concerning how to establish the conditions that the theory predicts will lead to the goal
In this case we have the same intelligent system as in the previous example, but a different model of it, which makes possible alignment with respect to a different operationalization of our terminal values using different alignment affordances.
Example: Aligning one’s own cognition via habit formation
Suppose that I wish to reduce electricity usage in my house, and that for whatever reason I decide to maintain the following invariant in my house: when there is no-one in a room, the lights in that room should be off. There are various sensors and automated systems I could install to do this, but suppose that the approach I decide on is to establish a personal habit of turning off lights when leaving a room. In order to do that, I might train myself by walking in and out of rooms turning lights off as I leave. Having done that, I create a weekly routine of reviewing all the electricity I’ve saved and gently thanking myself.
Now if I succeed at firmly establishing this habit then I might affect the pattern of electricity usage in every house I live in over my entire life, and this is actually quite a remarkable feat. Had I installed motion or sound sensors to automatically turn off lights in unused rooms, I would affect electricity usage in my current house but not necessarily in future houses, and the automated system might eventually break or wear out, and I wouldn’t necessarily fix it unless I had installed a habit of doing such things. In contrast, habit formation, if successfully installed, can be long-lived and self-correcting, and this is a general property of steering the future by intervening on intelligent systems.
So here we have:
An intelligent system being aligned, which is my own cognition.
A model of that system as a behavioral learning system subject to habit formation.
An operationalization of what I care about, which in this case is reducing electricity usage (a much more distant cousin of my true terminal values than the previous examples)
A theory of habit formation, which in this case is a behavioral understanding of self-training.
Practical know-how concerning how to execute the training strategy.
Similar to the economic regulation and justice system examples discussed above, it is not so easy to just install our terminal values directly as a single cognitive habit. Perhaps this is possible, but many of us still find reason to systematically install various low-level habits such as thanking friends for their time or getting regular exercise or turning lights off when leaving a room. If it were easy to install our entire terminal values as a single habit then presumably we would do that and have no need for further habit formation.
Also similar to the previous examples, our cognition may exert significant optimization pressure in service of the habits we install, and this may back-fire in the sense of Goodhart’s law. We might, for example, deliberately establish a habit of working very hard at our day job, and over time we may as a result be given praise and trusted with further responsibilities, and as a result of this we may come to associate fulfillment of our social needs more and more strongly with the diligence of our work, leading us to even more deeply establish the habit of working hard, leading to further praise and further responsibilities, and so on. There is nothing innately misguided about the habit of working hard at our day job, but the overall intelligent system of our cognition may react in powerful and many unexpected ways to the initial establishment of such a habit, leading eventually to actions that no longer serve the original purpose.
Example: Aligning a company via policies and incentives
Suppose I start a company with the objective of building some particular product that I think the world needs. I would like to organize the company in a way that steers the future such that the product comes into existence, and in order to do that I would like to avoid failure mode such as becoming excessively unfocussed, becoming excessively risk-averse, or taking too long to iterate. There are measures that I can take, both before starting the company and while the company is in motion, to steer away from these failure modes. These include formulating and repeating a clear mission statement, setting up a system for promotions that rewards well-calibrated risk taking, and iterating quickly at the beginning of the company in order to habituate a rhythm of quick iteration cycles. Such measures have their effect on the world through aligning the intelligent system that is the company.
So we have:
An intelligent system being aligned: a collection of people.
A model of that system as a firm with stakeholders and incentives.
An operationalization of our terminal values: the product that I think the world needs.
A theory that relates conditions to consequences: various formal or informal ideas about management, incentivization, legal entity structures, taxation, and so on.
Practical know-how concerning establishment of the conditions that the theory suggests will lead to the goal
The skill of establishing an effective company lies to a significant extent in honing the skill of aligning intelligent systems with a goal. Company builders can choose to work directly on the object-level problem that the company is facing, and often it is very important for them to do this, but this is because it informs their capacity to align the company as a whole with the goal, and most of their impact on the future flows through alignment. This is why starting a company is so often chosen as the means to accomplish a large goal: humans are the most powerful intelligent system on Earth at the moment, and taking actions that align a group of humans with a particular task is a highly leveraged way to steer the future.
At times it may be most helpful for a founder to view themselves as a kind of Cartesian agent relative to their company, and from that perspective may take actions such as designing an overall reporting structure or identifying bottlenecks as if from "outside the universe". At other times they may view themselves as embedded within the company and may seek to expose themselves to the right information, viewing themselves more as an information processing element that will respond appropriately than like an agent forming plans and taking actions.
Similarly, a founder may at times view the company as a collection of individuals with preferences, as a community with a shared purpose, as a machine operating according to its design principles, or as an information-processing system taking inputs and producing actions as a single entity, and each of these perspectives offers different affordances for alignment. Crucially, however, they all suggest that the way to accomplish a goal is to structure things (policies, incentives, stories, and so forth) in such a way that the overall behavior of an intelligent system (the company) moves the world towards that goal.
Example: Aligning an agentic AI via algorithm design
The classical AI alignment problem is to design an intelligent system that perceives and acts upon the world in a way that is beneficial to all. Suppose that we decide to build an agentic AI that acts in service of an explicit value function, as per early discourse in AI safety, and suppose for concreteness we take as our goal some particular societal objective such as discovering a cure for a particular disease. Here the object being aligned is the AI, and the affordances for alignment are algorithmic modifications to its cognition and value system. By carefully selecting a design for our AI we might hope to steer the future in a very substantial and potentially very precise way. In order to do this from a pure algorithm-design perspective we will need a theory that connects design choices to the long-term equilibrium behavior of the AI.
Now in constructing an AI from scratch our "affordances" for alignment look less like ways to influence an existing intelligent system and more like design choices for a new intelligent system, and this will be discussed further below under "in-motion versus de-novo alignment". For now we will take this to be a domain with an unusually rich variety of affordances.
So here we have:
An intelligent system being aligned: the AI.
A model of that system as an agent executing an algorithm.
An operationalization of our terminal values: here, the elimination of a disease.
A (need for a) theory connecting algorithmic design choices to long-run consequences of deploying such an AI.
Practical know-how concerning algorithm design in light of the path suggested by the theory
In the sections below we will examine the ways that this alignment problem is similar and different to alignment in general, but first we will explore another common framing of the AI alignment problem.
Example: Aligning machine learning systems with training procedures
In another formulation of the AI alignment problem, we take as primary not a space of possible algorithms but a space of possible training methods for various ensembles of machine learning systems. This is often referred to as prosaic AI alignment, and has at times been motivated by the observation that machine learning systems are rapidly becoming more powerful, so we ought to work out how to align them.
In basic machine learning, the affordances for alignment are choices of architecture, initialization, optimization algorithm, and objective function. Beyond this, we can connect multiple learning systems together in ways that check or challenge each other, and we can consider whole algorithms in which the elementary operations consist of training runs. In the framework we are using in this essay, the machine learning approach to alignment begins from a different choice of how to see what an AI is, which naturally suggests different affordances for alignment, just as viewing a human society as a collection of households and firms suggests different affordances for alignment compared to viewing it as a communal story-telling enterprise or as a single giant firm.
Just as in classical AI alignment, we need a theory that connects design choices to outcomes in order to make intelligent decisions about how to set things up so that our goal will be achieved. In this way, the machine learning approach to alignment is no less theory-driven than the algorithmic approach, though the nature of the theory might be quite different.
So we have:
An intelligent system being aligned: some combination of machine learning systems
A model of that system as an optimizer that finds an approximation to a local optima of the objective function
An operationalization of our terminal values: the training objective.
A theory connecting training affordances to the goal. A complete theory has obviously proven elusive but we have partial theories in the form of optimization theory, deep double-descent phenomenon, the lottery ticket hypothesis, and so on.
Practical know-how concerning how to effectively implement the training procedure suggested by the theory.
Example: Aligning domesticated animals through selective breeding
A final example is the selective breeding of a population of domesticated animals as a means to change some trait of interest to us. Suppose for the sake of concreteness that it is a population of dogs we are breeding and hunting ability is the trait we are selecting for. Here the intelligent system is the population as a whole, and the intervention we are making is to select which individuals transmit their genes to the next generation. The gene pool is the thing that determines the "equilibrium behavior" of the population, and our intervention affects that thing in a way that will persist to some extent over time.
One might be tempted to instead say that evolution itself is the thing that we are intervening on, but this seems wrong to us because our intervention does not change the abstract dynamics of evolution, it merely uses evolution to affect a particular population. To intervene on evolution itself would be to reshape the biology of the population so radically that evolution proceeds under fundamentally different dynamics, such by introducing Lamarckian inheritance or asexual reproduction, but this is not what we are considering here.
So we have:
An intelligent system being aligned: a collection of dogs
A model of that system as a population subject to natural selection
An operationalization of our terminal values: the hunting ability of the dogs
A theory connecting interventions to the goal: the theory of genetics
Practical know-how concerning how to implement the breeding program
We mention this example because the remainder of our examples are oriented around humans and AIs, and non-human animals represent the main third category.
Here are some examples of things that do not fit the definition of alignment used in this essay.
Irrigating crops by redirecting a stream. This is not an example of alignment in the sense that we have described here because the stream is not an intelligent system.
Changing my appearance by getting a haircut. This is not an example of alignment because, although it is an intervention on an intelligent system, it does not really strike at the thing that generates my equilibrium behavior.
Acquiring water by digging a well. This is not an example of alignment because the action (digging a well) is an object-level task rather than an intervention upon some intelligent system.
In the remainder of this essay we will explore ways in which the difficulty of AI alignment differs from or is similar to that of non-AI systems, with the goal of elucidating a central difference.
Overall risk posed
One axis by which we might differentiate AI and non-AI alignment is the overall level of risk posed on life on Earth by alignment failure. There have been many countries, companies, and communities that were imperfectly aligned with the goals of their designers, but as of the writing of this essay, none of these have ended life on Earth. In contrast, a misaligned AI may destroy all life on the planet. It does seem to us that AI alignment is an outlier in terms of overall risk posed, but why exactly is that? The remainder of this essay explores aspects of AI alignment that make it difficult relative to alignment in general. These might be viewed as explanations for the relatively large risk seemingly posed by AI versus non-AI systems.
Human versus technological speed
When aligning systems composed of humans, there is a match between the speed of the one doing the alignment and that of the system being aligned. If we make a mistake in setting up a government or company, things generally do not run away from us overnight, or if they do, things generally remain under the control of some other human institution if not under our own control. This is because human institutions generally cannot move very much faster than individual humans, which in turn is because the intelligence of a human institution lies significantly within the cognition of individual humans, and we do not yet know how to unpack that.
This match in speed between "aligner" and "alignee" is particularly relevant if the aligner is clarifying their own goals while the "alignee" is forming and executing plans in service of the current goal operationalization. The clarification of "what it is that we really want" seems to be exactly the thing that is most difficult for an aligner to hand off to an aligned, whereas handing off the formulation and execution of plans in service of a particular goal operationalization seems merely very difficult. If we therefore have humans do the clarification of goals while an AI does the formulation and execution of plans then we have two entities that are operating at very different speeds, and we need to take care to get things right.
One approach, then, is to develop the means for slow-thinking humans to oversee fast-thinking AIs without sacrificing on safety, and this is one way to view the work on approval-directed agents and informed oversight. Another approach is to in fact automate the clarification of goals, and this is one way to view the work on indirect normativity and coherent extrapolated volition. In the end these two approaches may become the same thing as the former may involve constructing fast imitations of humans that can oversee fast-thinking AIs, which may end up looking much like the latter.
But is this a central difference between AI alignment and alignment in general? It is certainly one difference, but if it were the main difference then we would expect that most of AI alignment would apply equally well to alignment of governments or economies, except that the problem would be less acute in those domains due to the smaller speed difference between aligner and alignee. This may in fact be the case. We will now continue exploring differences.
One-shot versus interactive alignment
When a founder attempts to align a company with a goal, they need not pick a single goal at the outset. Some companies go through major changes of goals, but even among companies that do not, the mission of the company usually gets clarified and adjusted as the company develops. This clarification of goals seems important because unbounded optimization of any fixed operationalization of a goal seems to eventually deviate from the underlying generator of the operationalization as per Goodhart’s law. We have not yet found a way to operationalize any goal that does not exhibit this tendency, and so we work with respect to proxies. In the example of turning off the lights in a house this proxy was relatively near-term, while in the microeconomics example of maximizing the sum of individual utilities, this proxy was relatively distant, but in both cases we were working with proxies.
A key issue in AI alignment is that certain AI systems may develop so quickly that we are unable to clarify our goals quickly enough to avoid Goodhart’s law. Clarifying our goals means gaining insight from the behavior of the system operating under a crude operationalization, and using that insight to construct a better operationalization, such as when we discover that straightforwardly maximizing the sum of individual revealed preferences in an economy neglects the interests of future humans (it is not that attending to the interests of future humans increases the welfare of current humans but that the welfare of current humans was an incomplete operationalization of what really matters), or when we discover that relentless productivity in our personal lives is depriving us of the space to follow curiosity (it is not that space for curiosity necessarily leads to productivity, but rather than productivity alone was an incomplete operationalization of what really matters).
The phenomenon of an intelligent system outrunning one’s capacity to improve the operationalization of one’s goals can also happen in fast-growing companies, and it can happen quickly enough that founders fail to make appropriate adjustments. It can also happen in individual lives, for example when one puts in place the conditions to work in a certain field or at a certain job, and these conditions are so effective that one stays in that field or job beyond the point where this is still an effective means to the original end.
There are two basic reactions to this issue in AI alignment: either come up with a goal operationalization that doesn’t need to be adjusted, or else make sure one retains the ability to adjust the goal over time. The former was common in early AI alignment, while the latter is more common now. Corrigibility is a very general formulation of retaining the ability to clarify the goal of an AI system, while interaction games as formulated at CHAI represent a more specific operationalization. Corrigibility and interactions games are both attempts to avoid a one-time goal specification event.
Is this the central distinguishing feature of AI alignment versus alignment in general? If we did find an operationalization of a goal that never needed to be adjusted then we would certainly have a clear departure from the way that alignment works in other domains. But it seems more likely that if we solve the alignment problem at all, it will involve building AI systems for which we can adjust the goal over time. This is not qualitatively different from setting up a government that can be adjusted over time, or a company that can be adjusted over time, though the problem seems more acute in AI due to the speed at which AI may develop.
Iterative improvement and race dynamics
Human cognition is not something we have the ability to tinker with in the same external way that we can tinker with a toaster oven or jet plane. Companies, economies, and governments are all composed of humans, so we do not have complete access to tinker with everything that’s happening in those things, either. We do expect to be able to tinker with AI in the way that we tinker with other technologies, and therefore we expect to be able to make incremental improvements to AI at a rate that is no slower than the general pace of technological improvement. Recognizing this, and recognizing the power that advanced AI systems may open up for their creators, humans may end up in a kind of race to be the first to develop advanced AI systems.
This issue is separate from and additional to the issue of AI systems simply being faster or more capable than humans. It is the expectation of a certain rate of increase in AI speed and capabilities that cause race dynamics, since a small head start today could lead to a big advantage later, and it is the difference between the rate of AI improvements and the rate of human cognitive improvements that makes such races dangerous, since there is an ever-greater mismatch between the pace at which humans learn by watching the unfolding of a particular intelligent system, and the pace at which those intelligent systems unfold. It is as if we were designing a board game with the goal of making it fun, but the players are AIs that move so quickly that the entire game unfolds before we can learn anything actionable about our game design.
Are race dynamics the fundamental difference between AI and non-AI alignment? It seems to us that race dynamics are more like a symptom of a deeper difference, rather than the central difference itself.
Could it be that the axis that most distinguishes AI alignment from alignment of non-AI systems is a throwback to early discourse on AI alignment: self-modification? Most humans do not seem to deliberately modify themselves to nearly the extent that an AI might be able to. It is not completely clear why humans self-modify as little as we do given our wide array of affordances for reshaping our cognition, but whatever the reason, it does seem that AIs may self-modify much more than humans commonly do.
This capacity for self-modification makes AI alignment a challenging technical problem because aligning an entity that considers self-modifying actions requires a strong theory of what it is about that entity that will persist over time. Intuitively, we might construct an agent that acts according to a utility function, and structure its cognition so that it sees that modifying its own values would hinder the achievement of its values. In that way we might establish values that are stable through self-modifying actions. But formulating a theory with which to enact this is a very difficult technical challenge.
Now this problem does come up in other domains. The constitution of the United States has a provision for making constitutional amendments, and this provision could in principle be itself modified by a constitutional amendment. But the constitution of the United States does not have its own agency over the future; it only steers the future via its effect upon a human society, and the humans in that society seem not to self-modify very much.
Conversely, individual humans often do worry about losing something important as we consider self-modifying actions, even though we would seem to have precisely the property that would make self-modification safe for us: namely the ability to reflect on our terminal values and see that changing them would not be in our best interests, since that which is "in our best interests" is precisely that which is aligned with hour terminal values.
Now, is this the central difference between AI and non-AI alignment? Self-modification seems like merely one facet of the general phenomenon of embedded agency, yet AIs are certainly not distinguished from non-AI systems by their embeddedness, since all systems everywhere are fundamentally embedded in the physical universe. It seems to us that the seeming strangeness of self-modifying agency is largely an artifact of the relative aversion that humans seem to have to it in their own minds and bodies, and contemporary discourse in AI alignment mostly does not hinge on self-modification as a fundamental distinguishing challenge.
Lack of shared conceptual foundations
Perhaps the reason AI alignment is uniquely difficult among alignment problems is that AI systems do not share a conceptual foundation with humans. When instructing an AI to perform a certain task, the task might be mis-translated into the AI’s ontology, or we may fail to include conditions that seem obvious to us. This is not completely different from the way that a written design specification for a product might be mis-understood by a team of humans, or might be implemented without regard for common sense, but the issue in AI is much more acute because there is a much wider gap between a human and an AI than between two humans.
The question, then, is where our terminal values come from, and how they come to us. If they come from outside of us, then we might build AI systems that acquire them directly from the source, and skip over the need to translate them from one set of conceptual foundations to another. If they come from inside of us and are largely or completely shared between people, then we face a translation problem in AI alignment that is very much unique to AI. But probably this very conception of what values are what it would mean for them to "come from" inside or outside of us is confused.
If we manage to clarify the issue of what values are (and whether values are an effective frame for AI alignment in the first place), will we see that the lack of shared conceptual foundations is a central distinguishing feature of AI alignment in comparison to non-AI alignment? Quite possibly. It certainly demands an extreme level of precision in our discourse about AI alignment since we are seeking an understanding sufficient for engineering, and such a demand has rarely been placed upon the discourse concerning agency, knowledge, and so forth.
In-motion versus de-novo alignment
When we align our own cognition using habit formation, we are working with an intelligent system that is already in motion, and the affordances available to us are like grasping the steering wheel of a moving vehicle rather than designing a vehicle from scratch. This makes alignment challenging because we must find a way to navigate from where we are to where we want to get in a way that preserves the integrity of our cognition at every point. The same is true, most of the time, when we make changes to the economic, government, and cultural institutions that steer the future via their effect on our society: we are normally working within a society that is already in motion and the affordances available to us consist of making changes on the margin that preserve the integrity of our society at every point.
In AI alignment, one avenue that seems to be available to us is to engineer AI systems from scratch. In this case the "affordances" by which we align an AI with a goal consist of every engineering decision in the construction of the thing, which gives us an exceptional level of flexibility in outcomes. Furthermore, we might do a significant amount of this construction before our AI systems are in motion, which gives us even further flexibility because we are not trying to keep an intelligent system operational during the engineering process.
But when I set up a company, I also have the opportunity to set things up at the outset in order to align the later behavior of the company with my goal. I can design a legal structure, reporting hierarchy, and compensation mechanism before hiring my first employee or accepting my first investment. Some founders do use this opportunity for "de-novo" company engineering to good effect. Similarly, the US constitutional convention of 1787 faced an opportunity for some amount of "de-novo" engineering as the initial constitution of the United States was formulated. Of course, neither of these are truly "de-novo" because the intelligence of the eventual company and nation resides partly or mostly in the internal cognition of the humans that comprise it, and that internal cognition is not subject to design in these examples.
On the other side of the equation, the prosaic AI alignment agenda takes optimization systems as the object of alignment and attempts to align them with a training procedure. This is a kind of mid-way point between alignment of an AI via algorithm design and alignment of a human society by institution design, because the machine learning systems that are taken as primary have more initial structure than the basic elements of algorithm design, but less initial structure than a human society.
The aspects of alignment that we’ve considered are as follows.
It seems to us that AI alignment as a field is most distinguished from economics, political science, cognitive science, personal habit formation, and other fields concerned with alignment of intelligent systems is that in AI alignment we are forced to get really really precise about what we are talking about, and we are forced to do that all the way up and down the conceptual stack. In contrast, there is only a limited extent to which one really needs to understand the basic dynamics of agency when designing economic regulations, or to clarify ethics into really elementary concepts when designing a justice system, or to work out exactly what knowledge is when engaging in personal habit formation.
There are of course fields that attempt to answer such questions precisely, but those fields have not been subject to strong consistent external demands for rigor, and so their level of rigor has been determined mostly by force of will of the participants in those fields. One could view the field of AI alignment as a new high-precision approach to epistemology, metaphysics, and ethics, analogous to the way that the scientific revolution was a new high-precision approach to natural and social inquiry.
To a person living at the beginning of the scientific revolution, it might seem that many great minds had been pouring over the basic questions of natural philosophy for thousands of years, and that little chance therefore existed of making significant contributions to fundamental questions. But from our perspective now it seems that there was a great deal of low-hanging fruit at that time, and it was available to anyone who could summon the patience to look carefully at the world and the courage to test their ideas objectively. The situation we face in AI alignment is different because the low-hanging fruit of empirical investigation have in fact been well-explored. Instead, we are investigating a type of question not amenable to bare empirical investigation, but we are doing so in a way that is motivated by a new kind of demand, and the opportunity for straightforward advances on questions that have eluded philosophers for aeons seems similarly high. The disposition needed is not so much patient observation of natural phenomena but a kind of detail-oriented inquiry into how things must be, coupled with a grounding in something more tangible than that which has guided philosophy-at-large for most of its history.
Just as the laws of thermodynamics were discovered by people working on practical steam engines, and just as both the steam engine and the theory of thermodynamics turned out to be important in the history world, so too the theoretical advances motivated by AI alignment may turn out to be as important as the AI itself. That is, if we don’t all die before this field has a chance to flourish. Godspeed.
One thing that might make this example confusing is the sense that I "am" my cognition, so the one doing the alignment is the same as the one being aligned. But we don’t actually need to take any perspective on such things, because we know from practical experience that it is possible to establish simple habits, and we can see that such habits, if successfully installed, have a kind of flexibility and (potentially) persistence that arise from the intelligence of our cognition. If we like, we can think of ourselves as a kind of "executive / habit machine" in which we are sometimes in habit-formation mode and sometimes in habit-execution mode. ↩︎
I'm really excited about this research direction. It seems so well-fit to what you've been researching in that past -- so much so that it doesn't seem to be a new research direction so much as a clarification of the direction you were already pursuing.
I think producing a mostly-coherent and somewhat-nuanced generalized theory of alignment would be incredibly valuable to me (and I would consider myself someone working on prosaic alignment strategies).
A common thread in the last year of my work on alignment is something like "How can I be an aligned intelligence?" and "What action would I take here if I was an aligned intelligence?". This helps me bootstrap reasoning about my own experiences and abilities, and helps me think about extrapolations of "What if I had access to different information?" or "What if I could think about it for a very long time?".
I still don't have answers to these questions, but think they would be incredibly useful to have as an AI alignment researcher. They could inform new techniques as well as fundamentally new approaches (to use terms from the post: both de-novo and in-motion)
Summing up all that, this post made me realize Alignment Research should be its own discipline.
Addendum: Ideas for things along these lines I'd be interested in hearing more about in the future:
(not meant as suggestions -- more like just saying curiosities out loud)
What are the best books/papers/etc on getting the "Alex Flint worldview on alignment research"? What existing research institutions study this (if any)?
I think a bunch of the situations involving many people here could be modeled by agent-based simulations. If there are cases where we could study some control variable, this could be useful in finding pareto frontiers (or what factors shape pareto frontiers).
The habit formation example seems weirdly 'acausal decision theory' flavored to me (though this might be a 'tetris effect' like instance). It seems like habits similar to this are a mechanism of making trades across time/contexts with yourself. This makes me more optimistic about acausal decision theories being a natural way of expressing some key concepts in alignment.
Proxies are mentioned but it feels like we could have a rich science or taxonomy of proxies. There's a lot to study with historical use of proxies, or analyzing proxies in current examples of intelligence alignment.
The self-modification point seems to suggest an opposite point: invariants. Similar to how we can do a lot in physics by analysing conserved quantities and conservative fields -- maybe we can also use invariants in self-modifying systems to better understand the dynamics and equilibria.
Yeah I agree! It seems that AI alignment is not really something that any existing disciplines is well set up to study. The existing disciplines that study human values are generally very far away from engineering, and the existing disciplines that have an engineering mindset tend to be very far away from directly studying human values. If we merely created a new "subject area" that studies human values + engineering under the standard paradigm of academic STEM, or social science, or philosophy, I don't think it would go well. It seems like a new discipline/paradigm is innovation at a deeper level of reality. (I understand adamShimi's work to be figuring out what this new discipline/paradigm really is.)
Interesting! I hadn't thought of habit formation as relating to acausal decision theory. I see the analogy to making trades across time/contexts with yourself but I have the sense that you're referring to something quite different to ordinary trades across time that we would make e.g. with other people. Is the thing you're seeing something like when we're executing a habit we kind of have no space/time left over to be trading with other parts of ourselves, so we just "do the thing such that, if the other parts of ourselves knew we would do that and responded in kind, would lead to overall harmony" ?
We could definitely study proxies in detail. We could look at all the market/government/company failures that we can get data on and try to pinpoint what exactly folks were trying to align the intelligent system with, what operationalization was used, and how exactly that failed. I think this could be useful beyond merely cataloging failures as a cautionary tale -- I think it could really give us insight into the nature of intelligent systems. We may also find some modest successes!
Hope you are well Alex!
We do know of goals for which Goodharting is not an issue, see here for a trivial example. The fact that we fear Goodharting is information about our values. See here in the section "Diminishing returns..." for a discussion of what a fear of Goodharting makes more probable.
Edit: Just to clarify, we don't know of a way to entirely eliminate Goodharting from human values, but we do know of features we can introduce to ameliorate it, to an extent. But we do know of simple utility functions which don't fear Goodharting.
Edit again: I feel like discussing the EM scenario and how it may/may not differ from the general AI scenario would have been useful, moreso than having an example concerning animals.
Yeah, well said.
Yeah would love to discuss this. I have the sense that intelligent systems vary along a dimension of "familiarity of building blocks" or something like that, in which systems built out of groups of humans are at one end, and system built from first principles out of basic algorithms are at the other end. Inbetween are machine learning systems (closer to basic algorithms) and emulated humans (closer to groups of humans).
Stuart's example in that post is strange. He writes "suppose you asked the following question to a UR-optimiser" but what does it even mean to ask a question to a UR-optimiser? A UR-optimiser presumably is the type of thing that finds left/right policies (in his example) that optimise UR. How is a UR-optimiser the type of thing that answers questions at all (other than questions about which policy performs better on UR)? His question begins "Suppose that a robot is uncertain between UR and one other reward function, which is mutually exclusive with UR..." but that just does not seem like the kind of question that ought to be posed to a UR-optimiser.
I guess you could rephrase it as "suppose a UR optimizer had a button which randomly caused an agent to be a UR optimizer" or something along those lines and have similair results.
Do you mean that as a way to understand what Stuart is talking about when he says that a UR-optimiser would answer questions in a certain way?
Yeah, instead of asking it a question, we can just see what happens when we put it in a world where it can influence another robot going left or right. Set it up the right way, and Stuart's arguement should go through.
Thanks, great post.
I may be misunderstanding, but wouldn't these techniques fall more under the heading of capabilities rather than under alignment? These are tactics that should increase a company's effectiveness in general, for most reasonable mission statements or products the company could have.
I was thinking of the incentive structure of a company (to focus on one example) as an affordance for aligning a company with a particular goal because if you set the incentive structure up right then you don’t have to keep track of everything that everyone does within the company, you can just (if you do it well) trust that the net effect of all those actions will optimize something that you want it to optimize (much like steering via the goals of an AI or steering via the taxes and regulations of a market).
But I think actually you are pointing to a very important way that alignment generally requires clarity, and clarity generally increases capabilities. This is present also in AI development: if we gained the insight necessary to build a very clear consequentialist AI that we knew how to align, we would simultaneously increase capabilities due to the same clarity.
Interested in your thoughts.
Gotcha. I definitely agree with what you're saying about the effectiveness of incentive structures. And to be clear, I also agree that some of the affordances in the quote reasonably fall under "alignment": e.g., if you explicitly set a specific mission statement, that's a good tactic for aligning your organization around that specific mission statement.
But some of the other affordances aren't as clearly goal-dependent. For example, iterating quickly is an instrumentally effective strategy across a pretty broad set of goals a company might have. That (in my view) makes it closer to a capability technique than to an alignment technique. i.e., you could imagine a scenario where I succeeded in building a company that iterated quickly, but I failed to also align it around the mission statement I wanted it to have. In this scenario, my company was capable, but it wasn't aligned with the goal I wanted.
Of course, this is a spectrum. Even setting a specific mission statement is an instrumentally effective strategy across all the goals that are plausible interpretations of that mission statement. And most real mission statements don't admit a unique interpretation. So you could also argue that setting a mission statement increases the company's capability to accomplish goals that are consistent with any interpretation of it. But as a heuristic, I tend to think of a capability as something that lowers the cost to the system of accomplishing any goal (averaged across the system's goal-space with a reasonable prior). Whereas I tend to think of alignment as something that increases the relative cost to the system of accomplishing classes of goals that the operator doesn't want.
I'd be interested to hear whether you have a different mental model of the difference, and if so, what it is. It's definitely possible I've missed something here, since I'm really just describing an intuition.
Yes, I think what you're saying is that there is (1) the set of all possible outcomes, (2) within that, the set of outcomes where the company succeeds with respect to any goal, and (3) within that, the set of outcomes where the company succeeds with respect to the operator's goal. The capability-increasing interventions, then, are things that concentrate probability mass onto (2), whereas the alignment-increasing interventions are things that concentrate probability mass onto (3). This is a very interesting way to say it and I think it explains why there is a spectrum from alignment to capabilities.
Very roughly, (1) corresponds to any system whatsoever, (2) corresponds to a system that is generally powerful, and (3) corresponds to a system that is powerful and aligned. We are not so worried about non-powerful unaligned systems, and we are not worried at all about powerful aligned systems. We are worried about the awkward middle ground - powerful unaligned systems.
Yep, I'd say I intuitively agree with all of that, though I'd add that if you want to specify the set of "outcomes" differently from the set of "goals", then that must mean you're implicitly defining a mapping from outcomes to goals. One analogy could be that an outcome is like a thermodynamic microstate (in the sense that it's a complete description of all the features of the universe) while a goal is like a thermodynamic macrostate (in the sense that it's a complete description of the features of the universe that the system can perceive).
This mapping from outcomes to goals won't be injective for any real embedded system. But in the unrealistic limit where your system is so capable that it has a "perfect ontology" — i.e., its perception apparatus can resolve every outcome / microstate from any other — then this mapping converges to the identity function, and the system's set of possible goals converges to its set of possible outcomes. (This is the dualistic case, e.g., AIXI and such. But plausibly, we also should expect a self-improving systems to improve its own perception apparatus such that its effective goal-set becomes finer and finer with each improvement cycle. So even this partition over goals can't be treated as constant in the general case.)
Ah so I think what you're saying is that for a given outcome, we can ask whether there is a goal we can give to the system such that it steers towards that outcome. Then, as a system becomes more powerful, the range of outcomes that it can steer towards expands. That seems very reasonable to me, though the question that strikes me as most interesting is: what can be said about the internal structure of physical objects that have power in this sense?
I’d ask the question whether things typically are aligned or not. There’s a good argument that many systems are not aligned. Ecosystems, society, companies, families, etc all often have very unaligned agents. AI alignment, as you pointed out, is a higher stakes game.
Just out of interest, how exactly would you ask that question?
Certainly. This is a big issue in our time. Something needs to be done or things may really go off the rails.
Indeed. Is there anything that can be done?
It is a very high-stakes game. How might we proceed?