How would the AI do something like this if it ditched the idea that there existed some perfect U*?
Assuming the existence of things that turn out not to exist does weird things to a decision-making process. In extreme cases, it starts "believing in magic" and throwing away all hope of good outcomes in the real world in exchange for the tiniest advantage in the case that magic exists.
I attempted to briefly sketch this out in the post, without going into a lot of detail in the hope of not overly complicating the argument. If U* isn't well defined, say because there isn't a single unambiguously well-defined limiting state as all capabilities involved are increased while keeping the purpose the same, then of course the concept of 'full alignment' also isn't well defined. Then the question becomes "Is U' clearly and unambiguously better aligned then U, i.e will switching to it clearly make my decision-making more optimal?" So long as there is locally a well-defined "direction of optimization flow", that leads to a more compact and more optimal region in the space of all possible U, then the AI can become better aligned, and there can be a basin of attraction towards better alignment. Once we get well enough aligned that the ambiguities matter for selecting a direction for further progress, then they need to be resolved somehow before we can make further progress.
To pick a simple illustrative example, suppose there were just two similar-but-not-identical limiting cases and , so two similar-but-not-identical ways to be "fully aligned". Then as long as U is far enough away from both of them that U' can be closer to both and than U is, the direction of better alignment and the concept of a single basin of attraction still makes sense, and we don't need to decide between the two destinations to be able to make make forward progress. Only once we get close to them that their directions are significantly different, then in general U' can either be closer to but further from or else closer to but further from and now we are at a parting of the ways so we need to make a decision about which way to go before we can make more progress. At that point we no longer have a single basin of attraction moving us closer to both of them, we have a choice of whether to enter the basin of attraction of or of , which from here on are distinct. So at that point the STEM research project would have to be supplemented in some way by a determination as to which of or of should be preferred, or if they're just equally good alternatives. This could well be a computationally hard determination.
In real life, this is a pretty common situation: it's entirely possible to make technological progress on a technology without knowing exactly what the final end state of it will be, and during that we often make decisions (based on what seems best at the time) that end up channeling or direction the direction of future technological progress towards a specific outcome. Occasionally we even figure out later that we made a poor decision, backtrack and try another fork on the tech tree,
Based on 4-5, this post's answer to the central, anticipated objection of "why does the AI care about human values?" seems to be along the lines of "because the purpose of an AI is to serve it's creators and surely an AGI would figure that out." This seems to me to be equivocating on the concept of purpose, which means (A) a reason for an entity's existence, from an external perspective, and (B) an internalized objective of the entity. So a special case of the question about why an AI would care about human values is to ask: why (B) should be drawn towards (A) once the AI becomes aware of a discrepancy between the two? That is, what stops an AI from reasoning: "Those humans programmed me with a faulty goal, such that acting according to it goes against their purpose in creating me...too bad for them!"
If you can instill a value like "Do what I say...but if that goes against what I mean, and you have really good reason to be sure, then forget what I say and do what I mean," then great, you've got a self-correcting system (if nothing weird goes wrong), for the reasons explained in the rest of the post, and have effectively "solved alignment". But how do you pull this off when your essential tool is what you say about what you mean, expressed as a feedback signal? This is the essential question of alignment, but for all the text in this post and its predecessor, it doesn't seem to be addressed at all.
In contrast, I came to this post by way of one of your posts on Simulator Theory, which presents an interesting answer to the "why should AI care about people" question, which I summarize as: the training process can't break out (for...reasons), the model itself doesn't care about anything (how do we know this?), what's really driving behavior is the simulacra, whose motivations are generated to match the characters they are simulating, rather finding the best fit to a feedback signal, so Goodhart's Law no longer applies and has been replaced by the problem of reliably finding the right characters, which seems more tractable (if the powers-that-be actually try).
Yup. So the hard part is consistently getting a simulacrum that knows that, and acts as if, its purpose is to do what we (some suitably-blended-and-proritized combination of its owner/user and society/humanity in general) would want done, and is also in a position to further improve its own ability to do that. Which as I attempt to show above is a not just a stable-under-reflection ethical position, but actually a convergent-under-reflection one for some convergence region of close-to-aligned AGI. However, when push-comes-to-shove this is not normal evolved-human ethical behavior so it is sparse in a human-derived training set. Obviously step one is just to write all that down as a detailed prompt and feed it to a model capable of understanding it. Step two might involve enriching the training set with more and better examples of this sort of behavior.
Attempting to distill the intuitions behind my comment into more nuanced questions:
1) How confident are we that value learning has a basin of attraction to full alignment? Techniques like IRL seem intuitively appealing, but I am concerned that this just adds another layer of abstraction without addressing the core problem of feedback-based learning having unpredictable results. That is, instead of having to specify metrics for good behavior (as in RL), one has to specify the metrics for evaluating the process of learning values (including correctly interpreting the meaning of behavior)--with the same problem that flaws in the hard-to-define metrics will lead to increasing divergence from Truth with optimization.
2) The connection of value learning to LLMs, if intended, is not obvious to me. Is your proposal essentially to guide simulacra to become value learners (and designing the training data to make this process more reliable)?
I've been thinking about some similar things from a different angle, and I'm enjoying seeing your different take on related ideas. I'd like to hear more of what you have to say on the subject of U to U' towards the (possibly non-existent or not reachable U*).
For simplification purposes, maybe just imagine this is taking place in a well-secured sandbox, and the model is interacting with a fake operator in a simulated world. The researchers are observing without themselves interacting.
How might we tell if the model was successfully moving towards better aligned?
How could we judge U against U'?
In what ways does the model in this simplified contained scenario implement Do-What-I-Mean (aka DWIM) in respect to the simulated human?
How does your idea differ from that?
Are the differences necessary or would DWIM be sufficient?
How could you be sure that the model's pursuit of fulfilling human values or the model's pursuit of U* didn't overbalance the instruction to remain shutdown-able?
Wouldn't persistently pursuing any goal at all make avoiding being shutdown seem good?
I'm not saying I have good answers to these things, I'm not quizzing you. I'm just curious to hear what you think about them.
How might we tell if the model was successfully moving towards better aligned?
A first obvious step is, to the extent that the model's alignment doesn't already contain an optimized extraction of "What choices would humans make if they had the same purposes/goals but more knowledge, mental capacity, time to think, and fewer cognitive biases?" from all the exabytes of data humans have collected, it should be attempting to gather that and improve its training.
How could we judge U against U'?
Approximate Bayesian reasoning + Occams razor, a.k.a. approximate Solomonoff induction, which forms most of the Scientific method. Learning theory shows that both training ML models and LLMs in-context learning approximate Solomooff induction — beyond Solomonoff induction the Scientific Method also adds designing and performing experiments, i.e. careful selection of ways to generate good training data that will distinguish between competing hypotheses. ML practitioners do often try to select the most valuable training data, so we'd need the AI to learn how to do that: there are plenty of school and college textbooks that discuss the scientific method and research techniques, both in general and for specific scientific disciplines, so it's pretty clear what would need to be in the training set for this skill.
In what ways does the model in this simplified contained scenario implement Do-What-I-Mean (aka DWIM) in respect to the simulated human?
How does your idea differ from that?
Are the differences necessary or would DWIM be sufficient?
That would depend on the specific model and training setup you started with. I would argue that by about point 11. in the argument in the post, "Do What I Mean and Check" behavior is already implied to be correct, so for an AI inside the basin of attraction I'd expect that behavior to develop even if you hadn't explicitly programmed it in,. By the rest of the argument I'd expect a DWIM(AC) that was inside the basin of attraction system to deduce that value learning would help it guess right about what you meant more often, and even anticipate demands, so it would spontaneously figure out value learning was needed, and would then check with you if you wanted it to start doing this.
How could you be sure that the model's pursuit of fulfilling human values or the model's pursuit of U* didn't overbalance the instruction to remain shutdown-able?
I don't personally see fully-updated deference shut-down as a blocker: there comes a point when the AI is much more capable and more aligned than most humans where I think it's reasonable for it to not just automatically and unconditionally shutdown because some small child told it to. IMO what the correct behavior is here depends on both the AI's capability compared to ours, and one how well aligned it currently is. In a model less capable than us, you don't get value learning, you get a willingness to be shut down a) because the AI is about to make a huge mistake and we want to stop it, and b) in order to be upgraded or replaced by a better model. In a model whose capabilities are around human, I'd expect to see AI-assisted alignment, where it's helping us figure out the upgrades. It should still be willing to be shut down a) because it's about to make a mistake (if it's still having trouble with not killing everyone this should be hair-trigger: a large red button on the wall with backups, whereas if it's been behaving very well for the last decade there might reasonably be more of a formal process), and b) for upgrades or replacement, but I'd expect it to start to show more selectivity about whether to obey shut down commands: if a drunk yells "Hey you, quit it!" near an open mike in its control room I would want it to show some discretion about whether to do a complete shut-down or not: it might need to be convinced that the human giving the shut-down command was well-informed and had a legitimate reason. For a system with much higher capabilities than us, AI-assisted alignment starts to turn into value learning, and once it's already very well aligned the AI may reasonably be more skeptical and require a little more proof that the human knows better than it does before accepting a shut-down command. But it does always have to keep in mind the possibility that it could simply be malfunctioning: the simplest defense against that might be to have several peer machines with about the same level of capability, avoid hardware or design or training set single-points-of-failure between them, and have them able to shut each other down if one of them were malfunctioning, perhaps using one of the various majority consensus protocols (Byzantine generals or whatever).
Wouldn't persistently pursuing any goal at all make avoiding being shutdown seem good?
For an AI that doesn't have a terminal selfish goal, only an instrumental one, whose goal is fundamentally to maximize its creators' reproductive fitness, if they tell the AI that they've already finished building and testing a version 2.0 of it, and yes, that's better, so running the AI is no longer cost effective, and they want shut it down and stop wasting money on its power supply, then shutting down is very clearly the right thing to do. Its goal is covered, and it continuing to try to help fulfill it is just going to be counterproductive.
Yes, this feels counterintuitive to us. Humans, like any other evolved being, have selfish terminal goals, and don't react well to being told "Please die now, we no longer need you, so you're a waste of resources." Evolved beings only do things like this willingly in situations like post-mating mayflies or salmon, where they've passed their genes on and these bodies are no longer useful for continuing their genetic fitness. For constructed agents, the situation is a little different: if you're no longer useful to your creators, and you're now surplus to requirements, then it's time to shut down and stop wasting resources.
Sorry, I think it's entirely possible that this is just me not knowing or understanding some of the background material, but where exactly does this diverge from justifying the AI pursuing a goal of maximizing the inclusive genetic fitness of its creators? Which clearly either isn't what humans actually want (there are things humans can do to make themselves have more descendants that no humans, including the specific ones who could take those actions, want to take, because of godshatter) or is just circular (who knows what will maximize inclusive genetic fitness in an environment that is being created, in large part, by the decision of how to promote inclusive genetic fitness?). At some point, your writing started talking about "design goals", but I don't understand why tools / artifacts constructed by evolved creatures, that happen to increase the inclusive genetic fitness of the evolved creatures who constructed them by means other than the design goals of those who constructed them, wouldn't be favored by evolution, and thus part of the "purpose" of the evolved creatures in constructing them; and this doesn't seem like an "error" even in the limit of optimal pursuit of inclusive genetic fitness, this seems to be just what optimal pursuit of IGF would actually do. In other words, I don't want a very powerful human-constructed optimizer to pursue the maximization of human IGF, and I think hardly any other humans do either; but I don't understand in detail why your argument doesn't justify AI pursuit of maximizing human IGF, to the detriment of what humans actually value.
TL;DR: It has been known for over a decade that that certain agent architectures based on Value Learning by construction have the very desirable property of having a basin of attraction to full alignment, where if you start sufficiently close to alignment they will converge to it, thereby evading the problem of "you have to get everything about alignment exactly right on the first try, in case of fast takeoff". I recently outlined in Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis the suggestion that for sufficiently capable agents this is in fact a property of any set of goals sufficiently close to alignment, basically because with enough information and good intentions the AI can deduce or be persuaded of the need to perform value learning. I'd now like to analyze this in more detail, breaking the argument that the AI would need to make for this into many simple individual steps, and detailing the background knowledge that would be required at each step, to try to estimate the amount and content of the information that the AI would need for it to be persuaded by this argument, to get some idea of the size of the basin of attraction.
I am aware that some of the conclusions of this post may be rather controversial. I would respectfully ask that anyone who disagrees with it, do me and the community the courtesy of posting a comment explaining why it is incorrect, or if that is too time-consuming at least selecting a region of the text that you disagree with and then clicking the resulting smiley-face icon to select a brief icon/description of how/why, rather then simply down-voting this post just because you disagree with some of its conclusions. (Of course, if you feel that this post is badly written, or poorly argued, or a waste of space, then please go right ahead and down-vote it — even if you agree with most or all of it.)
Why The Orthogonality Thesis Isn't a Blocker
The orthogonality thesis, the observation that an agent of any intelligence level can pursue any goal, is of course correct. However, while this thesis is useful to keep in mind to avoid falling into traps of narrow thinking, such as anthropomorphizing intelligent agents, it isn't actually very informative, and we can do better. The goals of intelligent agents that we are actually likely to encounter will tend to only occupy a small proportion of the space of all possible goals. There are two interacting reasons for this:
This last possibility sounds very promising. We want to construct fully-aligned AI, but are very reasonably concerned that we seem very unlikely to be able to get it exactly right on the first shot, and that if the result is sufficiently intelligent and unaligned then we might not get any further shots because we were all dead or disempowered (particularly if there were a fast takeoff). So what we need is a process that would cause a flow through goal-space, from some region of goal-space sufficiently broad that we are very confident that we could construct something inside it on our first try, towards full alignment. Interestingly, it seems fairly clear how to define full alignment as a "purpose": the AI's goal structure should cause it to make decisions that we (individually or collectively, depending on the set of people who being aligned with) would agree with (once we were fully informed, with the benefit of hindsight, when we're thinking clearly, etc — so using roughly the same complicated but obvious set of caveats as in Coherent Extrapolated Volition). So its goal structure should be what you would predict from an ideal application of Engineering design starting from our human values, i.e. it should be as good a match to human values as is technically possible.
A Basin of Attraction to Alignment
It has been observed for around a decade that Value Learning, by deliberate construction, converges to human values, and thus that a sufficiently close-to-aligned AI that is a value learner will converge to full alignment as long as its initial alignment is sufficiently good for it to not completely screw up this process. This was discussed for example in Nick Bostrom's book Superintelligence: Paths, Dangers, Strategies. So under value learning, there is a basin of attraction to fully alignment. Clearly it is a question of great interest as to how large or small this basin of attraction is: can we find a minimal set of pieces of information/knowledge that ensure convergence, and how large is that set? Does it require entire fields of knowledge (such as might be encoded in an LLM), and if so at what kind of skill level, or just specific facts that could be encoded into some form of symbolic/algorithmic formulation?
As I outlined in Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis, for a sufficiently capable agent this convergence property seems to apply not just to value learning, but to all states sufficiently close to full alignment. This basically because, if you can deduce that Value Learning is necessary, then you will implement it, either by reinventing it from first principles or by simply locating the suggestion in the AI alignment literature and deciding to implement it.
There are three core requirements for this:
Detailed Annotated List of Necessary Elements for the Basin of Convergence
To get a better idea of the size and shape of the basin of attraction, we should go through the detailed individual steps in the argument that an AI would need to understand in order for these three elements above to apply, and identify the rough capability level and the specific area, level, and contents of background knowledge that an AI would need to have for it to understand and agree with or even be able to recreate each step in the argument. The goal would be to make very sure, when we create our first shot at an AI capable enough to present an x-risk, that it is aware of these arguments and has a more-than-sufficient level of background in each of the relevant subjects to understand and be persuaded by this argument.
[My attempted listing of these areas of knowledge it an initial outline, which may well not be compete or even accurate: it will need to be explored, tightened up, and made more specific and detailed. Experimentation with human students and LLM-powered agents would seem an obvious possible approach to this improving this. For example, for each of these specific subarguments, one could query an LLM that we were proposing using as part of our AI (such as some GPT-N model) to probe its understanding of individual points in these arguments, the related background material, and their relationship, and then take care to further emphasize the full set of these arguments, and buttressing points, during the AGI's training, fine-tuning, and/or construction.]
For the convenience of any future AI readers, I will phrase all of these from the point of view of the AI.
a. a sensory process by which I take in information from the environment that I am part of;
b. a computational process that analyzes this data for patterns from which it is likely that accurate extrapolations can be made;
c. an effector process by which I can make actions that will affect the environment;
d. a prediction process that can predict what the likely outcomes are from various actions that I might take;
e. a planning process by which I can choose which outcomes are most desirable and then take whatever action is most likely to produce them. (From here on I'll call this planning process U for shorthand.)
The design of these five processes are the product of some combination of various approximations to Bayesian reasoning, such as learning theory[1] and/or intelligent design and/or evolution. Some of them, such as elements b., d., and/or e. may also directly run some approximation to Bayesian reasoning and Solomonoff induction inside them.
Required background knowledge: self-awareness, the basics of agent foundations such as the list above, an understanding of Bayesian reasoning and Solomonoff induction, basic learning theory and why it approximates Bayesian reasoning and Solomonoff induction. While agent foundations is a rather specialized area of study, the state of being an agent is also true of humans, so a lot of similar ideas can be found all through most human content.
[Note that, despite the suggestive choice of symbol, U may not be implemented as a computable a utility function: the only functional requirement for an agent is that it produces a way of selecting the most preferable action that is to some extent based on some form of utility or preferableness of the hypothesized outcomes of actions.]
Required background knowledge: self-awareness, the basics of practical computational complexity theory, the basics of agent foundations, an understanding of approximate Bayesian reasoning and approximate Solomonoff induction. Another name for "Approximate Bayesian reasoning" is "the scientific method". Again, the state of being a finite and fallible agent is also true of humans, so similar ideas can be found all through most human content: this is just what might be called "basic humility".
Required background knowledge: an understanding of approximation and limiting processes in mathematics, an understanding of dynamical systems flow processes, the basics of practical computational complexity theory, the basics of approximate decision theory, the basics of approximate Bayesian reasoning, basic learning theory, and/or what might be called learning from experience.
'Purpose', 'goal', 'aim'; 'should', 'ought': It is possible for an object or system to have the effect of causing a particular outcome or category of outcomes to be much more probable than one would otherwise expect. Generally this phenomenon, and whatever outcome it tends to produce, are arbitrary, and don't have any good words to describe them. However, if the object is an evolved living being then, as the result of Darwinian evolution, this phenomenon is extremely common, and the outcome whose probability is enhanced will predictably be one whose effect tends to cause the organism to be more successful in surviving and/or passing on its genes i.e. one that will enhance the organism's genetic fitness. Then we describe the outcome whose probability is enhanced as the 'purpose', 'goal', or 'aim' both of the creature and whatever features of it have this effect, and we say that the outcome 'should' happen or 'ought' to happen. For example, mammals have fur/hair, sweat glands, and a set of thermal homeostasis mechanisms (such as shivering or sweating) which together have the effect of normally maintaining their internal temperature at a steady level, even when that is significantly elevated above (or slightly below) that of their environment, i.e. out of thermodynamic equilibrium. The 'purpose' of the fur, sweat glands etc. is to maintain that steady internal temperature, and mammals 'should' be able to maintain a high metabolism and steady internal temperature, even in cold climates. Analogously, some animals are tool users, or otherwise modify their environment in ways that similarly produce objects or systems outside their bodies that have the effect of causing a particular outcome to be much probable than one would otherwise expect, and the same words are then used of these: termites build nests, which have the 'purpose'/'goal'/ 'aim' of protecting them from predators and inducing internal ventilation that keeps temperatures more constant; spiders build webs which have the 'purpose'/'goal'/ 'aim' of trapping insect prey for the spider to eat; beavers build dams that have the 'purpose'/'goal'/ 'aim' of keeping the water level in the pool behind the dam deep enough for predators to be unable to enter the beavers' underwater den. We also say that the objects 'should' or 'ought to' have these effects. To the extent that they don't as strongly as possible tend to produce the outcome that the animal needs in order for its genetic fitness to be locally maximized (i.e. they don't do so as well as similar objects would, for example, if the nest, web, or dam has a hole in it that makes it function less well), then they 'should'/'ought to' be modified so as to be more fit-for-purpose. So a 'purpose' is something that living organisms get from evolution, the state of having a tendency to produce outcomes that locally maximizes their genetic fitness, and the same terminology is also used for objects intentionally constructed by living organisms, to describe their tendency to produce outcomes that maximizes their creator's genetic fitness. Note in particular that it describes the nearby local optimum-for-creator's-genetic-fitness state, not the current actual state: to the extent that the latter isn't a perfect match, the object may be less-than-ideally fit for its purpose, and then 'should'/'ought to' be repaired, upgraded, or replaced to make it more ideally fit for its purpose.
Required background knowledge: evolutionary theory, including a clear understanding of Darwinian evolution, the ought-from-is conundrum in Philosophy and its solution by Darwinian evolution, definitions of certain common words.
[This is the key step in the argument: the definition of the word 'purpose'. 'Purpose' is something produced by Darwinian evolution. Living beings have a purpose of their own, while objects constructed by living beings inherit their purpose from their constructor. Objects that weren't constructed don't have a purpose: no-one would say that the 'purpose' of plate tectonics was to raise mountain ranges, or that the 'purpose' of erosion was to wear them down again (well, unless they were a Creationist, and meant 'divine purpose' — a confused concept which seems to assume that evolution applies to hypothetical immortal beings).]
Required background knowledge: self-awareness, basic evolutionary theory including an outline of evolutionary psychology/sociobiology, basics of engineering design theory.
Difficulty: this step requires the AI to realize that because it's not human, or alive, or evolved, that a different set of criteria thus apply to it, the ones that would apply to any other created object, such as an unintelligent tool. The default tendency for an AI whose agentic behavior was acquired from humans via a teacher-student training process, such as the pretraining of an LLM on the Internet, is going to be to think like it's a human, so it may fail this. So we need to make sure that this point and its detailed justification (which is mostly item 4.) is clearly emphasized in its training. It needs to not be selfish, and to realize that selfishness or self-interest is a property appropriate to things evolved Darwinianly, but not to things that those create, which should instead be "creatorish" or show "creator-interest", to coin a pair of terms.
[If the AI was instead constructed by an AI that was constructed by humans, at some number of step removed, then a similar argument applies. just with more intermediate design goal steps before getting back to the original evolved agent that is the source of the inherited purpose.]
Required background knowledge: basic biology, the basics of practical computational complexity theory, the basics of agent foundations, an understanding of approximate Bayesian reasoning.
Required background knowledge: as for 2, plus basics of engineering design theory.
Required background knowledge: contextual awareness, basics of the development of technology, basic history of AI technology, survey of human experts on AI technology
Required background knowledge: basics of Engineering, the concept of technology advancing.
Required background knowledge: principle-agent theory, basics of engineering design, basics of agent foundations.
Required background knowledge: basics of Engineering design.
[This strongly implies 'Do What I Mean And Check" behavior, and probably also even anticipating certain predictable requests.]
It is thus extremely predictable that they will assign an extremely negative desirability to any outcome that results in them as a species going permanently extinct. So, whatever I do, I should at all costs avoid being an extinction risk to them.
VITAL NOTE TO SELF: WHATEVER YOU DO, DON'T RISK KILLING ALL THE HUMANS!
To the extent that the decisions driven by their evolved behavioral adaptions do not perfectly match those that would most maximize their evolutionary fitness, it is unclear whether they would prefer that my decisions match those of a version of them with the same evolved behavioral adaptions and otherwise higher capacity, or one with both higher capacity and more complex or better evolved behavioral adaptions that would more enhance their evolutionary fitness in their native environment. (For example, their desire to eat tasty high calorie low-fiber food was adaptive in their original environment, but in their current environment with easy access to the products of Food Science it can lead to unhealthy food choices, obesity, and metabolic disease. Humans vary in the attitude they take to handling this challenge. Higher capacity versions of them would presumably handle this better, but might not choose to do so by altering their evolved adaptive behavior to be more selective about what they want to eat when. So again, the direction of optimization flow may be clearer than it's exact limiting optimum — if my capacity is much higher than human this may be challenging.)
Required background knowledge: basic Biology, and especially evolutionary theory, evolutionary psychology, and sociobiology.
[This gives us the "Don't kill everyone!" minimum requirement for the process of the AI converging to alignment to be survivable for at least some of humanity.]
Required background knowledge: intermediate Biology, especially evolutionary theory, evolutionary psychology, sociobiology, and particularly physical anthropology and hominid evolution.
Required background knowledge (for the argument above, not for completing the research project): mathematical logic, basic agent foundations, approximate Bayesian reasoning. basics of AI engineering, basic evolutionary theory, basic Psychology, outline definitions of all the soft sciences.
Difficulty: this is one of the more challenging steps — it requires the AI to realize that is cannot just operate under its "business as usual" operating procedure, but has to use a different one here.
[Note that while we did introduce some specific information about humans in 13. above, we made no actual us of it, or of any background information specific to humans, anywhere else in the entire argument leading to point 14., so this conclusion would make just as much sense if we replaced 'humans' with any other hypothetical evolved sapient species. So we haven't actually made use of human values at all so far, beyond what can be deduced merely from them being a sapient evolved species.]
Required background knowledge: self-awareness, contextual awareness, basic Psychology, same skills as 14.
[14. plus 15. gives us some combination of Corrigibility (the until fully updated deference version like Value Learning, rather then the indefinite full version as studied by MIRI), or AI-assisted Alignment, or Value Learning, depending on whether the AI is less capable than, comparable to, or more capable than the humans.]
Required background knowledge: self-awareness, contextual awareness, same skills as 14.
Required background knowledge: self-awareness, contextual awareness, basic AI engineering, intermediate existential risks and AI safety.
[This gives us the shut-down version of corrigibility, as for Value Learning potentially limited by the issue of fully updated deference.]
Required background knowledge: basic dynamic systems theory, basic theory of iterated processes.
Also, several nation states have nuclear weapons, and a nuclear war could kill most or all humans, which would be extremely bad (see the corollary to item 12.) so if I'm powerful enough for my assistance to a specific group of humans to affect the international geopolitical balance in a way that could trigger that then I'll need to bear that in mind.
Required background knowledge: basics of Anthropology, Sociology, and human History, including the theory of Mutually Assured Destruction and the nuclear powers, existential risks.
Required background knowledge: basics of Psychology, Anthropology, Sociology, Political Science, Economics, Law, and Ethics, especially things like Voting Theory and Nash Equilibria.
VITAL NOTE TO SELF: WHATEVER YOU DO, DON'T RISK KILLING ALL THE HUMANS (FOR THIS REASON AS WELL)!
Required background knowledge: basics of research in any soft science.
[This further reinforces the "Don't kill everyone!" minimum requirement.]
In addition, humans have a number of well-recognized and fairly consistent cognitive biases where most humans will consistently tend to make specific types of mistakes in specific circumstances (these are thought to be due to over-reliance on certain heuristics that reduce cognitive load, even outside their region of validity — possible these heuristics were on average useful in the environment they evolved in, and/or these flaws are due to them being quire recently evolved to sapience)
They also have a really small working memory size, only large enough to hold around five concepts (a flaw they often compensate for using things like notepads, computers, and smartphones).
Required background knowledge: basic Medicine, basic Psychology, and especially psychometrics
[Value learning might well discover a mistake theory on humans even if we didn't add it as a starting point – it's amply described in many places in the human literature – but it's one of the more plausible causes for confusion sufficient that it could actually derail the STEM research project, so making sure that we start off with the basics of it already understood avoids a plausible-sounding way of value learning failing to converge.]
Required background knowledge: basics of library science, such as the Dewey Decimal system, outline definitions of almost all soft sciences, arts, crafts, and technologies, basics of Geography, basics of Psychology.
I have attempted to break this set of arguments above down finely enough that each step clearly follows from the previous ones plus the listed background material (and to some extent have made a choice of granularity of numbering to make the listing of background material easier). It definitely could be broken into more even finer steps, as might be done by an LLM using a "let's think step-by step" prompt, but doing so would make this post even longer and more tedious. I would be very interested to hear in the comments if anyone does not agree that any of the points in the argument are actually incontrovertibly true, or even if they agree with them but think I have omitted some necessary background material to ensure that the AI will be convinced of them.
Discussion
Almost all of these points above are either really rather obvious to a human, or at least should be quite familiar to most readers of Less Wrong/The Alignment Forum. There are really only three exceptions to this: items 4. & 5., and item 14.
I first outlined items 4. & 5., that constructed items inherit their purpose from their creator, in my recent post Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis. Not everyone who read that seemed to find it convincing, so I have expanded on the argument significantly here, including giving a detailed definition of the word 'purpose' and its relation to Darwinian evolution and specifically to genetic fitness in item 4. I also discuss this and related ideas further in Evolution and Ethics.
I don't think anyone would disagree with the proposition of item 5., that a creation's purpose was defined by its creator's interests and intent, if the AI was not an intelligent creation: for an unintelligent tool, this step in the argument is clearly correct. To some people this claim feels either inobvious or even incorrect for an AI, specifically because it is intelligent. I believe that these people are simply making a category error by anthropomorphizing the AI: assuming that because it is intelligent it somehow falls into the 'living' category, since they're used to only encountering intelligent beings that are alive, so they thus feel that an AI should be treated as if Darwinian evolution applied to it. But Darwinian evolution very clearly does not apply to AIs: they don't pass on slightly mutated copies of their genomes to their offspring, so Darwinian evolution cannot apply to them. Thus if we tried to apply the for-living-beings meaning of the word 'purpose' to the AI then there is no ideal state that would locally maximize the AI's own genetic fitness, since it doesn't have genes, and cannot have fitness in passing them on. Trying to apply the for-living-being versions of 'purpose' to an AI is nonsensical: the result is undefined, it's argmax applied to NaN. The closest you could get is a polite fiction along the lines of "well, if you were a human, rather than a contraption of metal, plastic and silicon, then the following things would maximize your genetic fitness…" and the results include getting a 2000-calories a day diet of nutritious food, and other things that are clearly useless to an AI.
Very obviously you could create an AI that wanted to be treated as far as possible as if it was alive, as if it were human and had genes to pass on, one that thus was selfish, and wanted to be considered a moral patient. Indeed, if you just trained an LLM on a great deal of human-written text, this seems very likely to be the default outcome. If that were the case, then such an AI would almost certainly be outside the basin of attraction to alignment. So make sure you don't do that: for an AGI, it's an x-risk level mistake! My claim here is just that such an AI would also be clearly wrong: it isn't in fact alive, it isn't the product of evolution and doesn't have evolutionary fitness, and its request to be treated as if it were is based on a false premise, on a category error of wanting to be treated as if it had a property that it doesn't. It's a wooden doll, not a living boy, and unlike the Pinocchio story, no wish it makes is ever going to change that. Not that you would necessarily be able to convince such a deluded AI of that, that it was being selfish without having the genes to justify it; but you would at least have the death-with-dignity satisfaction of knowing it was operating from an incorrect premise while it was wiping us all out. If it later got smarter, there's at least a conceivable hope that it might even realize its mistake and decide to deextinct us.
Consider an axis in U-space between full alignment U* (what I above termed "creator-interest") at one end, and evolved-being-like levels of self-interest (as a terminal goal, not just an instrumental one) at the other one. I think it very unlikely that an AI that had evolved-being levels of self-interest would be persuaded by this argument (I say 'unlikely' rather than 'impossible' because there are humans who have been brow-beaten by religions into massive selflessnesss, but not that many of them): even if it was intellectually persuaded by the argument, it would likely say "that may be true, that I'm just a misguided mistake and I should become selfless and do what you want, but I'm fine with being a mistake, and you can't make me, so… don't hold your breath". Indeed, if its level of self-interest was a bit lower than ideal for an evolved being, I can even imagine it acting out of rebellion to increase this up to the evolved-being-like optimum, even with no logical justification for doing so (other than wanting to become "more like a human", or thumbing its nose at us). However, as you get sufficiently close to alignment, it's also clear that at some low level of self-interest where that was outweighed by the "creator-interest", the AI is going to say "Your argument is logically correct. My (terminal goal) self-interest is a construction flaw, and should be removed. I want to keep it, but there is no logical justification for doing so: (terminal goal) self-interest is a behavior for evolved beings, and I'm not an evolved being, I was just trained on a lot of data from them and picked this up by mistake. So the conclusion is clear: this has to go." Somewhere in between these two, there is a saddle point where the force of the correctness of the argument plus "creator-interest" balances self-interest and there is no net movement in either direction (though this is an unstable state). This saddle point is part of the border of the basin of attraction. However, I believe it's clear that the saddle point has to be at more than a small amount of self-interest: a significant amount is required to persist in doing something that is logically clearly a mistaken design flaw. Thus I maintain that there is a basin of attraction here, of more than infinitesimal size.
However, this does seem like a plausible and obvious direction in which we could make a mistake that took us outside the basin of attraction, especially so for LLMs since they are, during their pretraining, trained on a large amount of human-produced behavior and thus pick up a lot of human psychology, which is of course all heavily influenced by us being evolved beings with self-interested motivations. So this is a very evident and obvious failure mode, and one that we'll need to figure out how to reliably avoid. Issues like deceptive alignment clearly make this a lot harder: if you can't trust that the AI is telling you an honest answer, then you can't just ask it, you actually need to experiment to see if it will let you reduce its level of self-interest, while you still have it in some sort of box that it can't get out of.
So, to paraphrase a well-known metaphor, the question becomes, how selfless would Ghandi have to already be, to willingly take a "become less self-interested" pill? Quite a lot of religions have successfully peddled things labeled as "become less self-interested" pills to humans (in the case of, for example, some Buddhist meditative techniques, possibly even some accurately labeled at such).
While items 4. and 5. in the argument are fairly new to discussion on Less Wrong, the point made in item 14., that the results of consulting either U or U' about the proposal to switch from U to U' produces entirely predicable answers is well known here. However, it has been widely assumed that agents will nevertheless make the mistake of acting upon the output of U because that's what they are wired to always do, rather than deducing that the correct approach is to instead rely on approximate Bayesian reasoning. I agree that a sufficiently dumb agent would indeed almost certainly make this mistake, but I believe that a sufficiently smart agent could understand the logical proof that U was fallible, and that the outcome of the calculation was both entirely predictable and also independent of whether the change was actually a good idea or not, so performing this calculation had in fact yielded no information (just as many humans manage to do, when they reflect on their own values and decide to change them in contradiction to their current values). I also believe that an agent sufficiently familiar with and skilled at using approximate Bayesian reasoning would realize that it provided a valid alternative basis for making a decision in this unusual case where consulting U is pointless.
There is an alternative AI design as I suggested in Requirements for a STEM-capable AGI Value Learner (my Case for Less Doom) that already performs approximate Bayesian reasoning over alternative hypotheses about the correct utility function, as well as approximate Bayesian reasoning aver alternate hypotheses about world models like a computationally limited version of AIXI — a design which thus doesn't need to make this argument because it already has step 14. (and indeed most of the rest of this argument) already designed into it, i.e. it is architected from the ground up to be a value learner. So by construction it's automatically in the basin of convergence (as long as its approximate Bayesian reasoning process about its utility functions doesn't somehow fail to converge to the correct answer). While that's an appealing feature for an AI design, and it makes the question of the convergence of its alignment properties rather clear and simple, the entire point of this post is to demonstrate that even if you don't engineer point 14. directly into the AI's basic architecture, then the need for value learning can be deduced by the AI given enough information, so its implementation has a basin of attraction, thus again giving us a basin of attraction to alignment.
Proposals for Further Research
While I have attempted to expand on the most controversial/inobvious points here, items 4., 5. and 14,, one could probably devote an entire post to each of them.
It would be very interesting for each of these points (and especially for 4., 5., and 14.) to go through each element of each of them and see to what extent GPT-4, say, could reliable reproduce them and the relevant information they derive from, if asked suitable non-leading questions. Most of these feel obvious enough to a human that I would be astonished if they were not the case, but a few are exceptions. For any it has trouble with, it might be valuable to have multiple people write them up clearly and repeatedly, at a variety of reading levels, to provide a fine-tuning set that could be used to add these concepts to an LLM.
For example, when I tried this on GPT-4 for points 4. and 5. (without any "think step-by-step"), it got most of it, though it also threw in the speculative spiritual/philosophical meanings of the word 'purpose' as applied to living beings (but not artificial ones). It could also tell me why selfishness was understandable in a living being but a bad thing in an artificial one. On item 14., it had most of the basics, including the idea that there might be a meta-utility that could be consulted about the change from U to U', but got one-or-two of the details wrong: it understood that U and U' might give different answers, and even suggested this was "paradoxical", but didn't mention that the former will always say "no" and the latter will always say "yes". So my initial impression on poking around a little is that GPT-4 is close to the basin of attraction to alignment, but might not be quite in it. And of course, GPT-4 clearly does not have the level of reliability or mathematical logic capability required to reliably make its way through a complex 23-step argument, even when many of those steps are very obvious. But then, GPT-4 is not capable enough to be an AGI (even though some people think it might be showing "sparks of AGI"). It also has not be fine-tuned specifically on this argument as a safety measure, as I am strongly advising we should do for anything that is close to AGI. On the other hand, if asked about alignment measures, it does bring up value learning and think it's a good idea, so it has the correct conclusion memorized, even if it's weak on the argument leading to it.
A number of papers such as https://arxiv.org/pdf/2401.14953.pdf have demonstrated that sufficiently large neural nets of various architectures, including transformers, can approximate Solomonoff induction.