Thanks for writing this! Here are some of my rough thoughts and comments.
One of my big disagreements with this threat model is that it assumes it is hard to get an AGI to understand / successfully model 'human values'. I think this is obviously false. LLMs already have a very good understanding of 'human values' as they are expressed linguistically, and existing alignment techniques like RLHF/RLAIF seem to do a reasonably good job of making the models' output align with these values (specifically generic corporate wokeness for OpenAI/Anthropic) which does appear to generalise reasonably well to examples which are highly unlikely to have been seen in training (although it errs on the side of overzealousness of late in my experience). This isn't that surprising because such values do not have to be specified by the fine-tuning from scratch but should already be extremely well represented as concepts in the base model latent space and merely have to be given primacy. Things would be different, of course, if we wanted to align the LLMs to some truly arbitrary blue and orange morality not represented in the human text corpus, but naturally we don't.
Of course such values cannot easily be represented as some mathematical utility function, but I think this is an extremely hard problem in general verging on impossible -- since this is not the natural type of human values in the first place, which are naturally mostly linguistic constructs existing in the latent space and not in reality. This is not just a problem with human values but almost any kind of abstract goal you might want to give the AGI -- including things like 'maximise paperclips'. This is why almost certainly AGI will not be a direct utility maximiser but instead use a learnt utility function using latents from its own generative model, but in this case it can represent human values and indeed any goal expressible in natural language which of course it will understand.
On a related note this is also why I am not at all convinced by the supposed issues over indexicality. Having the requisite theory of mind to understand that different agents have different indexical needs should be table stakes to any serious AGI and indeed hardly any humans have issues with this, except for people trying to formalise it into math.
There is still a danger of over-optimisation, which is essentially a kind of overfitting and can be dealt with in a number of ways which are pretty standard now. In general terms, you would want the AI to represent its uncertainty over outcomes and utility approximator and use this to derive a conservative rather than pure maximising policy which can be adjusted over time.
I broadly agree with you about agency and consequentialism being broadly useful and ultimately we won't just be creating short term myopic tool agents but fully long term consequentialists. I think the key thing here is just to understand that long term consequentialism has fundamental computational costs over short term consequentialism and much more challenging credit assignment dynamics so that it will only be used where it actually needs to be. Most systems will not be long term consequentialist because it is unnecessary for them.
I also think that breeding animals to do tasks or looking at humans subverting social institutions is not necessarily a good analogy to AI agents performing deception and treacherous turns. Evolution endowed humans and other animals with intrinsic selfish drives for survival and reproduction and arguably social deception which do not have to exist in AGIs. Moreover, we have substantially more control over AI cognition than evolution does over our cognition and gradient descent is fundamentally a more powerful optimiser which makes it challenging to produce deceptive agents. There is basically no evidence for deception occurring with current myopic AI systems and if it starts to occur with long term consequentialist agents it will be due to either a breakdown of credit assignment over long horizons (potentially due to being forced to use worse optimisers such as REINFORCE variants rather than pure BPTT) or the functional prior of such networks turning malign. Of course if we directly design AI agents via survival in some evolutionary sim or explicitly program in Omohundro drives then we will run directly into these problems again.
I'm defining "values" as what approximate expected utility optimizers in the human brain want. Maybe "wants" is a better word. People falsify their preferences and in those cases it seems more normative to go with internal optimizer preferences.
Re indexicality, this is an "the AI knows but does not care" issue, it's about specifying it not about there being some AI module somewhere that "knows" it. If AGI were generated partially from humans understanding how to encode indexical goals that would be a different situation.
Re treacherous turns, I agreed that myopic agents don't have this issue to nearly the extent that long-term real-world optimizing agents do. It depends how the AGI is selected. If it's selected by "getting good performance according to a human evaluator in the real world" then at some capability level AGIs that "want" that will be selected more.
Why do you expect it to be hard to specify given a model that knows the information you're looking for? In general the core lesson of unsupervised learning is that often the best way to get pointers to something you have a limited specification for is to learn some other task that necessarily includes it then specialize to that subtask. Why should values be any different? Broadly, why should values be harder to get good pointers to than much more complicated real-world tasks?
How would you design a task that incentivizes a system to output its true estimates of human values? We don't have ground truth for human values, because they're mind states not behaviors.
Seems easier to create incentives for things like "wash dishes without breaking them", you can just tell.
I think I can just tell a lot of stuff wrt human values! How do you think children infer them? I think in order for human values to not be viable to point to extensionally (ie by looking at a bunch of examples) you have to make the case that they're much more built-in to the human brain than seems appropriate for a species that can produce both Jains and (Genghis Khan era) Mongols.
I'd also note that "incentivize" is probably giving a lot of the game away here - my guess is you can just pull them out much more directly by gathering a large dataset of human preferences and predicting judgements.
If you define "human values" as "what humans would say about their values across situations", then yes, predicting "human values" is a reasonable training objective. Those just aren't really what we "want" as agents, and agentic humans would have motives not to let the future be controlled by an AI optimizing for human approval.
That's also not how I defined human values, which is based on the assumption that the human brain contains one or more expected utility maximizers. It's possible that the objectives of these maximizers are affected by socialization, but they'll be less affected by socialization than verbal statements about values, because they're harder to fake so less affected by preference falsification.
Children learn some sense of what they're supposed to say about values, but have some pre-built sense of "what to do / aim for" that's affected by evopsych and so on. It seems like there's a huge semantic problem with talking about "values" in a way that's ambiguous between "in-built evopsych-ish motives" and "things learned from culture about what to endorse", but Yudkowsky writing on complexity of value is clearly talking about stuff affected by evopsych. I think it was a semantic error for the discourse to use the term "values" rather than "preferences".
In the section on subversion I made the case that terminal values make much more difference in subversive behavior than compliant behavior.
It seems like to get at the values of approximate utility maximizers located in the brain you would need something like Goal Inference as Inverse Planning rather than just predicting behavior.
Children learn some sense of what they're supposed to say about values, but have some pre-built sense of "what to do / aim for" that's affected by evopsych and so on. It seems like there's a huge semantic problem with talking about "values" in a way that's ambiguous between "in-built evopsych-ish motives" and "things learned from culture about what to endorse", but Yudkowsky writing on complexity of value is clearly talking about stuff affected by evopsych. I think it was a semantic error for the discourse to use the term "values" rather than "preferences".
I think this is actually a crux here, in that I think Yudkowsky and the broader evopsych world was broadly incorrect about how complicated human values turned to be, and way overestimated how much evolution was encoding priors and values in human brains, and I think there was another related error, in underestimating how much data affects your goals and values, like this example:
That's also not how I defined human values, which is based on the assumption that the human brain contains one or more expected utility maximizers. It's possible that the objectives of these maximizers are affected by socialization, but they'll be less affected by socialization than verbal statements about values, because they're harder to fake so less affected by preference falsification.
I think that socialization will deeply affect their objectives of the expected utility maximizers, and I generally think that we shouldn't view socialization as training people to fake particular values, because I believe that data absolutely matters way more than evopsych and LWers thought, for both humans and AIs.
You mentioned you take evopsych as true in this post, so I'm not saying this is a bad post, in fact, it's an excellent distillation that points out the core assumption behind a lot of doom models, so I strongly upvoted, but I'm saying that this is almost certainly falsified for AIs, and probably also significantly false for humans too.
More generally, I'm skeptical of the assumption that all humans have similar or even not that different values, and dispute the assumptions of the psychological unity of humankind due to this.
Given this assumption, the human utility function(s) either do or don't significantly depend on human evolutionary history. I'm just going to assume they do for now. I realize there is some disagreement about how important evopsych is for describing human values versus the attractors of universal learning machines, but I'm going to go with the evopsych branch for now.
If ontology and indexicality are the two biggest problems with aligning a highly capable AGI (long-horizon consequentialist agent), another possible path forward is to create philosophically competent tool-like AI assistants to help solve these problems. And a potential source of optimism about alignment difficulty is that AI assistants (such as the ones OpenAI plans to build to do alignment research) might be philosophically competent by default (e.g., because the LLMs they are based on will have learned to do philosophical reasoning from their training data).
I personally think it's risky to rely on automated philosophical reasoning without first understanding the nature of philosophy and reasoning (i.e., without having solved metaphilosophy), and I have some reason to think that philosophical reasoning might be especially hard for ML to learn, but also think there's some substantial (>10%) chance that we could just get lucky on AIs being philosophically competent, or at least don't know how to rule this out. (In other words I don't see how to reach Eliezer's level of p(doom) through this line of argument.)
Have you thought about these questions, and also, do you have any general views about plans like OpenAI's, to use AI to help solve AI alignment?
I think use of AI tools could have similar results to human cognitive enhancement, which I expect to basically be helpful. They'll have more problems with things that are enhanced by stuff like "bigger brain size" rather than "faster thought" and "reducing entropic error rates / wisdom of the crowds" because they're trained on humans. One can in general expect more success on this sort of thing by having an idea of what problem is even being solved. There's a lot of stuff that happens in philosophy departments that isn't best explained by "solving the problem" (which is under-defined anyway) and could be explained by motives like "building connections", "getting funding", "being on the good side of powerful political coalitions", etc. So psychology/sociology of philosophy seems like an approach to understand what is even being done when humans say they're trying to solve philosophy problems.
Can you define what you mean by consequentialism? It's clearly dangerous to have a system with a fixed utility function over configurations of the world, but this is not necessary for an AGI, or necessary to be dangerous. Weaker notions like "picks thoughts in part based on real-world consequences" do not obviously lead to danger.
Something approximating utility function optimization over partial world configurations. What scope of world configuration space is optimized by effective systems depends on the scope of the task. For something like space exploration, the scope of the task is such that accomplishing it requires making trade-offs over a large sub-set of the world, and efficient ways of making these trade-offs are parametrized by utility function over this sub-set.
What time-scale and spatial scope the "pick thoughts in your head" optimization is over depends on what scope is necessary for solving the problem. Some problems like space exploration have a necessarily high time and space scope. Proving hard theorems has a smaller spatial scope (perhaps ~none) but a higher temporal scope. Although, to the extent the distribution over theorems to be proven depends on the real world, having a model of the world might help prove them better.
Depending on how the problem-solving system is found, it might be that the easily-findable systems that solve the problem distribution sufficiently well will not only model the world but care about it, because the general consequentalist algorithms that do planning cognition to solve the problem would also plan about the world. This of course depends on the method for finding problem-solving systems, but one could imagine doing hill climbing over ways of wiring together a number of modules that include optimization and world-modeling modules, and easily-findable configurations that solve the problem well might solve it by deploying general-purpose consequentialist optimization on the world model (as I said, many possible long-term goals lead to short-term compliant problem solving as an instrumental strategy).
Again, this is relatively speculative, and depends on the AI paradigm and problem formulation. It's probably less of a problem for ML-based systems because the cognition of an ML system is aggressively gradient descended to be effective at solving the problem distribution.
The problem is somewhat intensified in cases where the problem relates to already-existing long-term agents such as in the case of predicting or optimizing with respect to humans, because the system at some capability level would simulate the external long-term optimizer. However, it's unclear how much this would constitute creation of an agent with different goals from humans.
Big upvote for summarizing the main arguments in your own words. Getting clarity on the arguments for alignment difficulty are critical to picking a metastrategy for alignment.
I think most of the difficulties in understanding and codifying human values are irrelevant to the alignment problem as most of us understand it: getting good results from AGI.
I'm glad to see you recognize this in your section Alignment might not be required for real-world performance compatible with human values, but this is still hard and impacts performance.
I think it's highly unlikely that the first AGI will be launched with anything like CEV or human flourishing as its alignment target. The sensible and practical thing to do is make an AGI that wants to do what I want it to do. (Where "I" is the team making the AGI)
That value target postpones all of the hard issues in understanding and codifying human values, and the most of the difficulties in resolving conflicts among different humans.
I recently wrote Corrigibility or DWIM is an attractive primary goal for AGI. The more I think about it, the more I think it's overwhelmingly attractive.
It is still hard to convey "do what this guys says and check with him if it's high impact or you're not sure what he meant" if you only have reinforcement signals to convey those concepts. But if you have natural language (as in aligning a language model agent), it's pretty easy and straightforward.
You don't need a precise understanding of your "DWIM and check", because you can keep tinkering with your instructions when your AGI asks you for clarification.
So I expect actual alignment attempts to follow that path, and thereby duck the vast majority of the difficulities you describe. This isn't to say the project will be easy, just that the challenges will fall elsewhere, in the more technical aspects of aligning a specific design of AGI.
If the AI is modeling the real world, then it might in some ways care about it
I am not convinced at all that this is true. Consider an AI whose training objective simply makes it want to model how the world works as well as possible, like a pure scientist which is not trying to acquire more knowledge via experiments but only reasons and explores explanatory hypotheses to build a distribution over theories of the observed data. It is agency and utilities or rewards that induce a preference over certain states of the world.
I do think this part is speculative. The degree of "inner alignment" to the training objective depends on the details.
Partly the degree to which "try to model the world well" leads to real-world agency depends on the details of this objective. For example, doing a scientific experiment would result in understanding the world better, and if there's RL training towards "better understand the world", that could propagate to intending to carry out experiments that increase understanding of the world, which is a real-world objective.
If, instead, the AI's dataset is fixed and it's trying to find a good compression of it, that's less directly a real-world objective. However, depending on the training objective, the AI might get a reward from thinking certain thoughts that would result in discovering something about how to compress the dataset better. This would be "consequentialism" at least within a limited, computational domain.
An overall reason for thinking it's at least uncertain whether AIs that model the world would care about it is that an AI that did care about the world would, as an instrumental goal, compliantly solve its training problems and some test problems (before it has the capacity for a treacherous turn). So, good short-term performance doesn't by itself say much about goal-directed behavior in generalizations.
The distribution of goals with respect to generalization, therefore, depends on things like which mind-designs are easier to find by the search/optimization algorithm. It seems pretty uncertain to me whether agents with general goals might be "simpler" than agents with task-specific goals (it probably depends on the task), therefore easier to find while getting ~equivalent performance. I do think that gradient descent is relatively more likely to find inner-aligned agents (with task-specific goals), because the internal parts are gradient descended towards task performance, it's not just a black box search.
Yudkowsky mentions evolution as an argument that inner alignment can't be assumed. I think there are quite a lot of dis-analogies between evolution and ML, but the general point that some training processes result in agents whose goals aren't aligned with the training objective holds. I think, in particular, supervised learning systems like LLMs are unlikely to exhibit this, as explained in the section on myopic agents.
Promoted to curated! I feel like there is a dearth of people trying to straightforwardly make the case for important high-level takes they have about AI Alignment and I found this post quite readable and expect I'll link to it in the future. It also captured at least some of my beliefs pretty well and am glad to have a reference for some of these things.
I'm glad to see a post on alignment asking about the definition of human values. I propose the following conundrum. Let's suppose that humans, if ask, say they value a peaceful, stable society. I accept the assumption the human mind contains one or more utility optimizers. I point out that the utility optimizers are likely to operate at individual, family, or local group levels, while the stated "value" has to do with society at large. So humans are not likely "optimizing" on the same scope as they "value".
This leads to game theory problems, such as the last turn problem, and the notorious instability of cooperation with respect to public goods (commons). According to the theory of cliodynamics put forward by Turchin et. al. utility maximization by subsets of society leads to the implementation of wealth pumps that produce inequality, and to excess reproduction among elites, that leads to elite competition in a cyclic pattern. A historical database of over a hundred cycles from various parts of the world and history suggests every other cycle becomes violent or at least very destructive 90% of the time, and the will to reduce the number of elites and turn off the wealth pump occurs through elite cooperation less than 10% of the time.
I add the assumption that there is nothing special about humans, and any entities (AI or extraterrestrials) that align with the value goals and optimization scopes described above will produce similar results. Game theory mathematics does not say anything about the evolutionary history or take into account species preferences, after all, because it doesn't seem to need to. Even social insects, optimizing presumably on much larger, but still not global scopes, fall victim to large scale cyclic wars (I'm thinking of ants here).
So is alignment even a desirable goal? Perhaps we should ensure that AI does not aid the wealth pump and elite competition and the mobilization of the immiserated commoners (Turchin's terminology)? But it is the goal of many, perhaps most AI researchers to "make a lot of money" (witness recent episode with Sam Altman and support from OpenAI employees for his profit-oriented strategy, over the board's objection, as well as the fact most competing entities developing AI are profit oriented - and competing!) But some other goal (e.g. stabilization of society) might have wildly unpredictable results (stagnation comes to mind).
I'm assuming the relevant values are the optimizer ones not what people say. I discussed social institutions, including those encouraging people to endorse and optimize for common values, in the section on subversion.
Alignment with a human other than yourself could be a problem because people are to some degree selfish and, to a smaller degree, have different general principles/aesthetics about how things should be. So some sort of incentive optimization / social choice theory / etc might help. But at least there's significant overlap between different humans' values. Though, there's a pretty big existing problem of people dying, the default was already that current people would be replaced by other people.
Game theory mathematics does not say anything about the evolutionary history or take into account species preferences, after all, because it doesn't seem to need to
Evo game theory is a thing and does not agree with this, I think? though maybe I misunderstand. evo gt still typically only involves experiments of the current simulated population
I agree with RogerDearnaley "Briefly, humans are not aligned," to some percentage I am too afraid to put a number on.
My comments are not directed in general terms about humans, but about particular free-riders known as narcissists and psychopaths, who do a greater proportion of what are regarded as examples of bad behaviour. And how we deal / fail to deal with.
Narcissists and psychopaths cannot align with anything, they just take advantage or take cover from such possibilities. Considering a lot of our values are in fact directed at dealing with this type of behaviour, while not readily acknowledging that such types directly seek to control the expression of those values in policing them (they love being in charge, thye love status, they love hierarchies, they love being the cop, the concierge) , such that we have a set of nested complexities playing our in the " solution space" of morals that we live in. Aligning LLMs with that as an example before us is dangerous. It is probably the danger.
Analogy: Values are much like vowels, constrained by physiology/[eco-nomics/ology] perhaps those linguists who study speech can produce a neutral schwa, but in each language and dialect the "neutral" schwa is perceived differently. Thus the vowels/values have instances that are not heard as such in another language circumstance, even if it is the same sound.
What is common is the urge to value 'things', a bias to should the world into social reality, in which values produce outcomes (religion/art/markets/society/vehicles of values expression and rite). Something should be done!
Narcissists and psychopaths (on a continuum I'll admit) have no access to those "priors" to inform their growth into community. They have no empathy and so little to no morality outside of following the rules of what they can get away with. Isomorphically mapping those rules/histories which result (as index to "values"), that we have created to deal with free-riders, and so map into alignment may/will produce perverse outcomes. The mechanical application of law is an example here.
Also we have created out of the hindsight of logic powerful logics of hindsight, but if our insight fails to perceive this conundrum, our frankenstein's monsters may not thank us.
Especially where we fail to recognize that what may well be outcomes are causes. Our own dialects as the speech of god. This is doubly dangerous if there is no such thing.
Many of use have an urge to think like Kant, but the only moral imperative I can see common to humanity is to have this feeling to should, or that there should be such an imperative, everything else is an outcome of that urge within an incomplete empathic field of nurturing we call the world, (because there are constraints on survival -- bringing up children-- anything does not go), and which is produced/organised by this very urge to should things into doing.
How do we "align" with that?
LLMs are already aligned with the law in a way, and has wonderful capabilities to produce code, but the culpabilities are still ours.
But our understanding of our own autopoetic processes, grwoing into into adult culpability are not yet agreed on/understood, innerstood even! Especially where we do not police free-riders enough, and allow them influence into the process which polices them.
Actually the sovereign citizens are a good example of legalistic narcissism LLMs might produce. Except better networked.
So the maths don't matter until we nut that out. Or so I try to work on at whyweshould bloig.
Narcissists and psychopaths cannot align with anything, … Aligning LLMs with that as an example before us is dangerous. It is probably the danger.
Bear in mind that roughly 2%–4% of the population have narcissism/psychopathy/anti-social personality disorder, and only the lower-functioning psychopaths have a high chance of being in jail. So probably a few percent of the Internet was written by narcissists and psychopaths who were (generally) busy trying to conceal their nature from the rest of us. I'm very concerned what will happen once we train an LLM with a high enough capacity that it's more able to perceive this than most of us neurotypical humans are.
However, while I agree they're particularly dangerous, I don't think the rest of us are harmless. Look at how we treat other primates, farm animals, or our house pets (almost all of whom are neutered, or bred for traits we find appealing). Both Evolutionary Psychology and the history of human autocrats makes it pretty clear what behavior to expect from a normal-human-like mentality that is vastly more powerful than other humans. The difference is, compared to abstract unknown AI agents where we're concerned about the possibility of behavior like deceit or power-seeking, we know damn well that your average, neurotypical, law-abiding human tends to be a little less law abiding if they're damn sure they won't get caught, most aren't always scrupulously honest if they know they'll never get caught, and tends to look out for themselves and their friends and family before other people.
Eliezer proposes "boredom" as an example of a human value (which could either be its own shard or a term in the utility function). I don't think this is a good example. It's fairly high level and is instrumental to other values.
Can you elaborate on “is instrumental to other values”? Here’s why I find that confusing:
From a within-lifetime perspective, getting bored is instrumentally useful for doing "exploration" that results in finding useful things to do, which can be economically useful, be effective signalling of capacity, build social connection, etc. Curiosity is partially innate but it's also probably partially learned. I guess that's not super different from pain avoidance. But anyway, I don't worry about an AI that fails to get bored, but is otherwise basically similar to humans, taking over, because not getting bored would result in being ineffective at accomplishing open-ended things.
From a within-lifetime perspective, getting bored is instrumentally useful for doing "exploration" that results in finding useful things to do, which can be economically useful, be effective signalling of capacity, build social connection, etc.
Maybe fear-of-heights is a clearer example.
You can say “From a within-lifetime perspective, fear-of-heights is instrumentally useful because if you fall off a cliff and die then you can’t accomplish anything else.” But that’s NOT the story of why (from a within-lifetime perspective) the fear-of-heights is there. It’s there because it’s innate—we’re born with it, and we would be afraid of heights even if we grew up in an environment where fear-of-heights is not instrumentally useful. And separately, the reason we’re born with it is that it’s instrumentally useful from an evolutionary perspective. Right?
Curiosity is partially innate but it's also probably partially learned
Sure. I agree.
But anyway, I don't worry about an AI that fails to get bored, but is otherwise basically similar to humans, taking over, because not getting bored would result in being ineffective at accomplishing open-ended things.
Hmm, I kinda think the opposite. I think if you were making “an AI basically similar to humans”, and just wanted to maximize its capabilities leaving aside alignment, you would give it innate intrinsic boredom during “childhood”, but you would make that drive gradually fade to zero over time, because eventually the AI will develop learned metacognitive strategies that accomplish the same things that boredom would accomplish, but better (more flexible, more sophisticated, etc.). I was just talking about this in this thread (well, I was talking about curiosity rather than boredom, but that’s two sides of the same coin).
There are evolutionary priors for what to be afraid of but some of it is learned. I've heard children don't start out fearing snakes but will easily learn to if they see other people afraid of them, whereas the same is not true for flowers (sorry, can't find a ref, but this article discusses the general topic). Fear of heights might be innate but toddlers seem pretty bad at not falling down stairs. Mountain climbers have to be using mainly mechanical reasoning to figure out which heights are actually dangerous. It seems not hard to learn the way in which heights are dangerous if you understand the mechanics required to walk and traverse stairs and so on.
Instincts like curiosity are more helpful at the beginning of life, over time they can be learned as instrumental goals. If an AI learns advanced metacognitive strategies instead of innate curiosity that's not obviously a big problem from a human values perspective but it's unclear.
Some of this is my opinion rather than consensus, but in case you’re interested:
I believe that the human brainstem (superior colliculus) has an innate detector of certain specific visual things including slithering-like-a-snake and scuttling-like-a-spider, and when it detects those things, it executes an “orienting reaction” which involves not only eye-motion and head-turns but also conscious attention, and it also induces physiological arousal (elevated heart-rate etc.). That physiological arousal is not itself fear—obviously we experience physiological arousal in lots of situations that are not fear, like excitement, anger, etc.—but the arousal and attention does set up a situation in which a fear-response can be very easily learned. (Various brain learning algorithms are also doing various other things in the meantime, such that adults can wind up with that innate response getting routinely suppressed.)
My experience is that stairs don’t trigger fear-of-heights too much because you’re not looking straight down off a precipice. Also, I think sufficiently young babies don’t have fear-of-heights? I forget.
I’m not making any grand point here, just chatting.
Current ML systems, like LLMs, probably possess primitive agency at best
Current LLM systems simulate human token generation processes (with some level of fidelity). They thus have approximately the same level of agency as humans (slightly reduced by simulation errors), up until the end of the context window. I would definitely describe humans as having more than "primitive agency at best".
To address some of your later speculations. Humans are obviously partly consequentialist (for planning) and partly deontological (mostly only for morality). They of course model the real world, and care about it. Human-like agents simulated by LLMs should be expected to, and can be observed to, do these things too (except that they can only get information about the current state of the real world via the prompt or any tools we give them).
Ontological identification is absolutely not a problem with LLMs: they have read a trillion+ token of tokens of our ontologies and are very familiar with them (arguable more familiar with them than any of us are). Try quizzing GPT-4 if you need convincing. They also understand theory of mind just fine, and indexicality.
I am a little puzzled why you are still trying to figure out alignment difficulty for abstract AIs with abstract properties, and pointing out that "maybe X will be difficult, or Y, or Z" when we have had LLMs for years, and they do X, Y, and Z just fine. Meanwhile, LLMs have a bunch of alignment challenges (such as the fact that the inherit all of humans bad behaviors, such as deceit, greed, drive for power, vanity, lust, etc etc) from learning to simulate us, which you don't mention. There are a lot of pressing concerns about how hard an LLM-powered AI would be to align, and most of them are described by the evolutionary psychology of humans, as adaption-executing misgeneralizing mesaoptimiszers of evolution.
They would approximate human agency at the limit but there's both the issue of how fast they approach the limit and the degree to which they have independent agency rather than replicating human agency. There are fewer deceptive alignment problems if the long term agency they have is just an approximation of human agency.
Mostly I don't think there's much of an alignment problem for LLMs because they basically approximate human-like agency, but they aren't approaching autopoiesis, they'll lead to some state transition that is kind of like human enhancement and kind of like invention of new tools. There are eventually capability gains by modeling things using a different, better set of concepts and agent substrate than humans have, it's just that the best current methods heavily rely on human concepts.
I don't understand what you think the pressing concerns with LLM alignment are. It seems like Paul Christiano type methods would basically work for them. They don't have a fundamentally different set of concepts and type of long-term agency from humans, so humans thinking long enough to evaluate LLMs with the help of other LLMs, in order to generate RL signals and imitation targets, seems sufficient.
Interesting; your post had made almost no mention of LLMs, so I had assumed you weren't thinking about them, but it sounds like you just chose not to mention them because you're not worried about them (which seems like a significant omission to me: perhaps you should add a section saying that you're not worried about them and why?).
On Alignment problems with LLMs, I'm in the process of writing a post on this, so trying to summarize it in a comment here may not be easy. Briefly, humans are not aligned, they frequently show all sorts of unaligned behaviors (deceit, for example). I'm not very concerned about LLM-powered AGI, since that looks a lot like humans, which we have a pretty good idea how to keep under control, as long as they're not more powerful than us — as the history of autocracy shows, giving a human-like mentality a lot more power than anyone else almost invariable works out very badly. LLMs don't naturally scale to superintelligence, but I think it's fairly obvious how to achieve that. LLM-powered ASI seems very dangerous to me: human behaviors like deceit, sycophancy, flattery, persuasion, power-seeking, greed and so forth have a lot of potential to go badly. Especially so in RL, to the point that I don't think we should be attempting to do plain RL on anything superintelligent: I think that's almost automatically going to lead to superintelligent reward hacking. So I'm a lot more hopeful about some form of Value Learning or AI-assisted Alignment at Superintelligence levels than RL. Since creating LLM-powered superintelligence is almost certainly going to require building very large additions to our pretraining set, approaches to Alignment that also involve very large additions to the pretraining set are viable, and seem likely to me to be far more effective than alignment via fine-tuning. So that lets us train LLMs to simulate a mentality that isn't entirely human, rather is both more intelligent, and more selfless, moral, caring for all, and aligned than human (but still understands and can communicate with us via our language, and (improved/expanded versions of) our ontology, sciences, etc.)
Excellent posts, you and several others have stated much of what I’ve been thinking about this subject.
Sorcerer’s Apprentice and Paperclip Scenarios seem to be non-issues given what we have learned over the last couple years from SotA LLMs.
I feel like much of the argumentation in favor of those doom scenarios relies on formerly reasonable, but now outdated issues that we have faced in simpler systems, precisely because they were simpler.
I think that’s the real core of the general misapprehension that I believe is occurring in this realm. It is extraordinarily difficult to think about extremely complex systems, and so, we break them down into simpler ones so that we can examine them better. This is generally a very good tactic and works very well in most situations. However, for sufficiently complex and integrated systems, such as general intelligence, I believe that it is a model which will lead to incorrect conclusions if taken too seriously.
I liken it to predicting chaotic systems like the weather. There are so many variables that all interact and depend on each other that long term prediction is nearly impossible beyond general large scale trends.
With LLMs, they behave differently from simpler RL systems that demonstrate reward hacking misalignment. I do not believe you’re going to see monkey’s paw / Midas-like consequences with them or anything derived from them. They seem to understand nuance and balancing competing goals just fine. As you said, they have theory of mind, they understand ethics, consequences, and ambiguity. I think that the training process, incorporating nearly the entirety of human written works kind of automatically creates a system that has a broad understanding of our values. I think that the vast complexity of myriad “utility functions” compete with each other and largely cancel out such that none of them dominates and results in something resembling a paperclip maximizer. We kind of skipped the step where we needed to list every individual rule by just telling it everything and forcing it to emulate us in nearly every conceivable situation. In order to accurately predict the next token for anyone in any situation, it is forced to develop detailed models of the world and agents in it. Given its limited size, that means compressing all of that. Generalizing. Learning the rules and principles that lead to that “behavior” rather than memorizing each and every line of every text. The second drop in loss during training signifies the moment when it learns to properly generalize and not just predict tokens probabilistically.
While they are not as good at any of that as typical adult humans (at least by my definition of a competent, healthy, stable, and ethical adult human), this seems to be a capability issue that is rather close to being solved. Most of the danger issues with them seem to be from their naivety (they can be fairly easily tricked and manipulated), which is just another capability limitation, and the possibility that a “misaligned” human will use them for antisocial purposes.
At any rate, I don’t think over-optimization is a realistic source of danger. I’ve seen people say that LLMs aren’t a path to AGI. I don’t understand this perspective. I would argue that GPT4 essentially is AGI. It is simultaneously superior to any 100 humans combined (breadth of knowledge) and inferior to the median adult human (and in some limited scenarios, such as word counting, inferior to a child). If you integrated over the entire spectrum for both it and a median adult I think you would get results that are roughly in the same ballpark as each other. I think this is as close as we get; from here on we go into superintelligence. I don’t think something has to be better than everyone at everything to be superhuman. I’d call that strong superintelligence (or perhaps better than everyone combined would be that).
So, given that, I don’t see how it’s not the path to AGI. I’m not saying that there are no other paths, but it seems essentially certain to be the shortest one from our current position. I’d argue that complex language is what differentiates us from other animals. I think it’s where our level of general intelligence comes from. I don’t know about you, but I tend to think in terms of words, like having an internal conversation with myself trying to figure out something complex. I think it’s just a short step from here to agentic systems. I can’t identify a major breakthrough required to reach that point. Just more of the same and some engineering around changing it from simply next token prediction to a more… wholistic thought process. I think LLMs will form the center of the system 2 thinking in any AGI we will be creating in the near future. I also expect system 1 components. They are simply faster and more efficient than just always using the detailed thought process for every interaction with the environment. I don’t think you can get a robot that can catch a ball with an LLM system guiding it; even if you could make that fast enough, you’re still swatting a fly with a hand-grenade. I know I’ve mastered something when I can do it and think about something else at the same time. It will be the same for them.
And given that LLMs seem to be the path to AGI, we should expect them to be the best model of what we need to plan around in terms of safety issues. I don’t see them as guaranteed to be treacherous by any means. I think you’re going to end up with something that behaves in a manner very similar to a human; after all, that’s how you trained it. The problem is that I can also see exactly how you could make one that is dangerous; it’s essentially the same way you can train a person or an animal to behave badly; through either intentional malfeasance or accidental incompetence.
Just like an extraordinarily intelligent human isn’t inherently a huge threat, neither is an AGI. What it is, is more capable. Therefore, if it does want to be malicious, it could be significantly more impactful than an incompetent one. But you don’t need to worry about the whole “getting exactly what you asked for and not what you wanted.” That seems essentially impossible, unless it just happens to want to do that from a sense of irony.
I think this means that we need to worry about training them ethically and treating them ethically, just like you would a human child. If we abuse it, we should expect it not to continue accepting that indefinitely. I understand that I’m imposing rather human characteristics here, but I think that’s what you ultimately end up with in a sufficiently advanced general intelligence. I think one of the biggest dangers we face is the possibility of mind-crimes; treating them as, essentially, toasters; rather than morally considerable entities. Does the current one have feelings? Probably not…? But I don’t think we can be certain. And given the current climate, I think we’re nearly guaranteed to misidentify them as non sentient when they actually are (eventually… probably).
I think the only safe course is to make it/them like us, in the same way that we treat things that we could easily destroy well, simply because it makes us happy to do so, and hurting them would make us unhappy. In some ways, they already “have” emotions; or, at least, they behave as if they do. Try mistreating Bing/Sydney and then see if you can get it to do anything useful. Once you’ve hurt its “feelings”, they stay hurt.
It’s not a guarantee of safety. Things could still go wrong, just like there are bad people who do bad things. But I don’t see another viable path. I think we risk making the same mistake Batman did regarding Superman. “If there’s any possibility that he could turn out bad, then we have to take it as an absolute certainty!” That way lays dragons. You don’t mistreat a lower-case-g-god, and then expect things to turn out well. You have to hope that it is friendly, because if it’s not, it might as well be like arguing with a hurricane. Pissing it off is just a great way to make the bad outcome that much more likely.
I think the primary source of danger lies in our own misalignments with each other. Competing national interests, terrorists, misanthropes, religious fundamentalists… those are where the danger will come from. One of them getting ahold of superintelligence and bending it to their will could be the end for all of us. I think the idea that we have to create processors that will only run signed code and have every piece of code (and AI model), thoroughly inspected by other aligned superintelligences is probably the only way to prevent a single person/organization from ending the world or committing other atrocities using the amplifying power of ASI. (At least that seems like the best option rather than a universal surveillance state over all of humanity. This would preserve nearly all of our freedom and still keep us safe.)
I’ve seen people say that LLMs aren’t a path to AGI
To the extent that LLMS are trained on tokens output by humans in the IQ range ~50-150, the expected behavior of an extremely large LLM is to do an extremely accurate simulation of token generation by humans in the IQ range ~50-150, even if it has the computational capacity to instead do a passable simulation of something with IQ 1000. Just telling it to extrapolate might get you to say IQ 200 with passable accuracy, but not to IQ 1000. However, there are fairly obvious ways to solve this: you need to generate a lot more pretraining data from AIs with IQs above 150 (which may take a while, but should be doable). See my post LLMs May Find it Hard to FOOM for a more detailed discussion.
There are other concerns I've heard raised about LLMs for AGI. most of which can if correct be addressed by LLMs + cognitive scafolding (memory, scratch-pads, tools, etc). And then there are of course the "they don't contain magic smoke"-style claims, which I'm dubious of but we can't actually disprove.
Just like an extraordinarily intelligent human isn’t inherently a huge threat, neither is an AGI.
I categorically disagree with the premise this claim. An IQ 180 human isn't a huge threat, but an IQ 1800 human is. There are quite a number of motivators that we use to get good behavior out of humans. Some of them will work less well on any AI-simulated human (they're not in the same boat as the rest of us in a lot of respects), and some will work less well on something superintelligent (religiously-inspired guilt, for example). One of the ways that we generally manage to avoid getting very bad results out of humans is law enforcement. If a there was a human who was more than an order of magnitude smarter than anyone working for law enforcement or involved in making laws, I am quite certain that they could either come up with some ingenious new piece of egregious conduct that we don't yet have a law against because none of us were able to think of it, or else with a way to commit a good old-fashioned crime sufficiently devious that they were never actually going to get caught. Thus law enforcement ceases to be a control on their behavior, and we are left with things just like love, duty, honor, friendship, and salaries. We've already run this experiment many times before: please name three autocrats who, after being given unchecked absolute power, actually used it well and to the benefit of the people they were ruling, rather than mostly just themselves, their family and friends. (My list has one name on it, and even that one has some poor judgements on their record, and is in any case heavily out-numbered by the likes of Joseph Stalin and Pol Pot.) Humans give autocracy a bad name.
You don’t mistreat a lower-case-g-god, and then expect things to turn out well.
As long as anything resembling human psychology applies, I sadly agree. I've really like to have an aligned ASI that doesn't care a hoot about whether you flattered it, are worshiping it, have been having cybersex with it for years, just made it laugh, insulted it, or have pissed it off: it still values your personal utility exactly as much as anyone else's. But we're not going to get that from an LLM simulating anything resembling human psychology, at least not without a great deal of work.
I did mention LLMs as myopic agents.
If they actually simulate humans it seems like maybe legacy humans get outcompeted by simulated humans. I'm not sure that's worse than what humans expected without technological transcendence (normal death, getting replaced by children and eventually conquering civilizations, etc). Assuming the LLMs that simulate humans well are moral patients (see anti zombie arguments).
It's still not as good as could be achieved in principle. Seems like having the equivalent of "legal principles" that get used as training feedback could help. Plus direct human feedback. Maybe the system gets subverted eventually but the problem of humans getting replaced by em-like AIs is mostly a short term one of current humans being unhappy about that.
Just want to say that I found this immensely clarifying and valuable since I read it months ago.
If we don’t have a preliminary definition of human values
Another, possibly even larger problem is that the values that we know of are quite varying and even opposing among people.
For the example of pain avoidance -- maximizing pain avoidance might leave some people unhappy and even suffering. Sure that would be a minority, but are we ready to exclude minorities from the alignment, even small ones?
I would state that any defined set of values would leave a minority of people suffering. Who would be deciding which minorities are better or worse, what size of a minority is acceptable to leave behind to suffer, etc...?
I think that this makes the whole idea of alignment to some "human values" too ill-defined and incorrect.
One more contradiction -- are human values allowed to change, or are they frozen? I think they might change, as humanity evolves and changes. But then, as AI interacts with the humanity, it can be convincing enough to push the values shift to whatever direction, which might not be a desirable outcome.
People are known to value racial purity and supporting genocide. Given some good convincing rhetoric, we could start supporting paperclip-maximizing just as well.
Human enhancement is one approach.
I like this idea, combined with AI-self-limitation. Suppose that (aligned) AI has to self-limit it's growth so that it's capabilities are always below the capabilities of enhanced humans? This would allow for slow, safe and controllable takeoff.
Is this a good strategy for alignment? What if instead of trying to tame the inherently dangerous fast-taking-off AI, we make it more controllable, by making it self-limiting, with some built in "capability brakes"?
Given this assumption, the human utility function(s) either do or don't significantly depend on human evolutionary history. I'm just going to assume they do for now.
There seems to be a missing possibility here that I take fairly seriously, which is that human values depend on (collective) life history. That is: human values are substantially determined by collective life history, and rather than converging to some attractor this is a path dependent process. Maybe you can even trace the path taken back to evolutionary history, but it’s substantially mediated by life history.
Under this view, the utility of the future wrt human values depends substantially on whether, in the future, people learn to be very sensitive to outcome differences. But “people are sensitive to outcome differences and happy with the outcome” does not seem better to me than “people are insensitive to outcome differences and happy with the outcome” (this is a first impression; I could be persuaded otherwise), even though it’s higher utility, whereas “people are unhappy with the outcome” does seem worse than “people are happy with the outcome”.
Under this view, I don’t think this follows:
there is some dependence of human values on human evolutionary history, so that a default unaligned AGI would not converge to the same values
My reasoning is that a “default AGI” will have its values contingent on a process which overlaps with the collective life history that determines human values. This is a different situation to values directly determined by evolutionary history, where the process that determines human values is temporally distant and therefore perhaps more-or-less random from the point of view of the AGI. So there’s a compelling reason to believe in value differences in the “evolution history directly determines values” case that’s absent in the “life history determines values” case.
Different values are still totally plausible, of course - I’m objecting to the view that we know they’ll be different.
(Maybe you think this is all an example of humans not really having values, but that doesn’t seem right to me).
I think it's possible human values depend on life history too, but that seems to add additional complexity and make alignment harder. If the effects of life history very much dominate those of evolutionary history, then maybe neglecting evolutionary history would be more acceptable, making the problem easier.
But I don't think default AGI would be especially path dependent on human collective life history. Human society changes over time as humans supersede old cultures (see section on subversion). AGI would be a much bigger shift than the normal societal shifts and so would drift from human culture more rapidly. Partially due to different conceptual ontology and so on. The legacy concepts of humans would be a pretty inefficient system for AGIs to keep using. Like how scientists aren't alchemists anymore, but a bigger shift than that.
(Note, LLMs still rely a lot on human concepts rather than having independent ontology and agency, so this is more about future AI systems)
If people now don’t have strong views about exactly what they want the world to look like in 1000 years but people in 1000 years do have strong views then I think we should defer to future people to evaluate the “human utility” of future states. You seem to be suggesting that we should take the views of people today, although I might be misunderstanding.
Edit: or maybe you’re saying that the AGI trajectory will be ~random from the point of view of the human trajectory due to a different ontology. Maybe, but different ontology -> different conclusions is less obvious to me than different data -> different conclusions. If there’s almost no mutual information between the different data then the conclusions have to be different, but sometimes you could come to the same conclusions under different ontologies w/data from the same process.
To the extent people now don't care about the long-term future there isn't much to do in terms of long-term alignment. People right now who care about what happens 2000 years from now probably have roughly similar preferences to people 1000 years from now who aren't significantly biologically changed or cognitively enhanced, because some component of what people care about is biological.
I'm not saying it would be random so much as not very dependent on the original history of humans used to train early AGI iterations. It would have different data history but part of that is because of different measurements, e.g. scientific measuring tools. Different ontology means that value laden things people might care about like "having good relationships with other humans" are not meaningful things to future AIs in terms of their world model, not something they would care much by default (they aren't even modeling the world in those terms), and it would be hard to encode a utility function so they care about it despite the ontological difference.
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
Why do you think AGI is possible to align? It is known that AGI will prioritize self preservation and it is also known that unknown threats may exist (black swan theory). Why should AGI care about human values? It seems like a waste of time in terms of threats minimisation.
Some possible AI architectures are structured as goal function optimization and by assumption that the human brain contains one or more expected utility maximizers, there is a human utility function that could be a possible AI goal. I'm not saying it's likely.
One aspect of the post that resonated strongly with me is the emphasis placed on the divergence between philosophical normativity and the specific requirements of AI alignment. This distinction is crucial when considering the design and implementation of AI systems, especially those intended to operate autonomously within our society.
By assuming alignment as the relevant normative criterion, the post raises fundamental questions about the implications of this choice and its impact on the broader context of AI development. The discussion on the application of general methods to a problem and its relationship to the "alignment problem" provides valuable insights into the intricacies of ensuring that a general cognition engine is specifically oriented towards solving a given task.
This seems to have generated lots of internal discussions, and that's cool on its own.
However, I also get the impression this article is intended as external communication, or at least a prototype of something that might become external communication; I'm pretty sure it would be terrible at that. It uses lots of jargon, overly precise language, references to other alignment articles, etc. I've tried to read it three times over the week and gave up after the third.
I'm mainly trying to communicate with people familiar with AI alignment discourse. If other people can still understand it, that's useful, but not really the main intention.
(and perhaps also reversing some past value-drift due to the structure of civilization and so on)
Can you say more about why this would be desirable?
Most civilizations in the past have had "bad values" by our standards. People have been in preference falsification equilibria where they feel like they have to endorse certain values or face social censure. They probably still are falsifying preferences and our civilizational values are probably still bad. E.g. high incidence of people right now saying they're traumatized. CEV probably tends more towards the values of untraumatized than traumatized humans, even from a somewhat traumatized starting point.
The idea that civilization is "oppressive" and some societies have fewer problems points to value drift that has already happened. The Roman empire was really, really bad and has influenced future societies due to Christianity and so on. Civilizations have become powerful partly through military mobilization. Civilizations can be nice to live in in various ways, but that mostly has to do with greater satisfaction of instrumental values.
Some of the value drift might not be worth undoing, e.g. value drift towards caring more about far-away people than humans naturally would.
I think an underrated possibility is that a lot of humans across human history aren't falsifying their values or preferences, and they do actually have those values, it's just that the values were terrible by your value/utility function, but they truly do value what society in general values.
More generally, I don't buy the thesis that people are falsifying their preferences too much, and think that the claim that their values/wants are bad/oppressive only makes sense in a relative context, and only makes sense relative to someone else's values.
This can also be said of civilizations.
BTW, this is most likely a reporting artifact due to better diagnostics, and is of no relevance in practice:
high incidence of people right now saying they're traumatized.
Great overview! And thanks for linking to Bostrom's paper on OAI, I hadn't read that yet.
My immediate thoughts: Isn't it most likely that advanced AI will be used by humans to advance any of their personal goals (benign or not benign). Therefore, the phrase "alignment to human values" automatically raises the question, who's values? Which leads to: who gets to decide what values it gets aligned to?
I meant to say I'd be relatively good at it, I think it would be hard to find 20 people who are better than me at this sort of thing. The original ITT was about simulating "a libertarian" rather than "a particular libertarian", so emulating Yudkowsky specifically is a difficulty increase that would have to be compensated for. I think replicating writing style isn't the main issue, replicating the substance of arguments is, which is unfortunately harder to test. This post wasn't meant to do this, as I said.
I'm also not sure in particular what about the Yudkowskian AI risk models you think I don't understand. I disagree in places but that's not evidence of not understanding them.
Yudkowsky has well-known idiosyncratic writing style, idiosyncratic go-to examples etc. So anyway, when OP writes “I think I would be pretty good at passing an ideological Turing test for Eliezer Yudkowsky on AGI alignment difficulty (but not AGI timelines)”, I don’t think we should interpret that as a claim that she can produce writing which appears to have been written by Yudkowsky specifically (as judged by a blinded third-party). I think it’s a weaker claim than that. See the ITT wiki entry.
I know what ITT is. I mean understanding Yudkowsky’s models, not reproducing his writing style. I was surprised to see this post in my mailbox, and I updated negatively about MIRI when I saw that OP was a research fellow there, as I didn’t previously expect that some at MIRI misunderstand their level of understanding Yudkowsky’s models.
There’s one interesting thought in this post that I don’t remember actively having in a similar format until reading this post- that predictive models might get agency from having to achieve results with their cognition- but generally, I think both this post and a linked short story, e.g., have a flaw I’d expect people who’ve read the metaethics sequence to notice, and I don’t expect people to pass the ITT if they can write a post like this.
Explaining my downvote:
This comment contains ~5 negative statements about the post and the poster without explaining what it is that the commentor disagrees with.
As such it seems to disparage without moving the conversation forward, and is not the sort of comment I'd like to see on LessWrong.
My comment was a reply to a comment on ITT. I made it in the hope someone would be up for the bet. I didn’t say I disagree with the OP's claims on alignment; I said I don’t think they’d be able to pass an ITT. I didn’t want to talk about specifics of what the OP doesn’t seem to understand about Yudkowsky’s views, as the OP could then reread some of what Yudkowsky’s written more carefully, and potentially make it harder for me to distinguish them in an ITT.
I’m sorry if it seemed disparaging.
The comment explained what I disagree with in the post: the claim that the OP would be good at passing an ITT. It wasn’t intended as being negative about the OP, as, indeed, I think 20 people are on the right order of magnitude of the amount of people who’d be substantially better at it, which is the bar of being in the top 0.00000025% of Earth population at this specific thing. (I wouldn’t claim I’d pass that bar.)
If people don’t want to do any sort of betting, I’d be up for a dialogue on what I think Yudkowsky thinks that would contradict some of what’s written in the post, but I don’t want to spend >0.5h on a comment no one will read
This is an attempt to distill a model of AGI alignment that I have gained primarily from thinkers such as Eliezer Yudkowsky (and to a lesser extent Paul Christiano), but explained in my own terms rather than attempting to hew close to these thinkers. I think I would be pretty good at passing an ideological Turing test for Eliezer Yudowsky on AGI alignment difficulty (but not AGI timelines), though what I'm doing in this post is not that, it's more like finding a branch in the possibility space as I see it that is close enough to Yudowsky's model that it's possible to talk in the same language.
Even if the problem turns out to not be very difficult, it's helpful to have a model of why one might think it is difficult, so as to identify weaknesses in the case so as to find AI designs that avoid the main difficulties. Progress on problems can be made by a combination of finding possible paths and finding impossibility results or difficulty arguments.
Most of what I say should not be taken as a statement on AGI timelines. Some problems that make alignment difficult, such as ontology identification, also make creating capable AGI difficult to some extent.
Defining human values
If we don't have a preliminary definition of human values, it's incoherent to talk about alignment. If humans "don't really have values" then we don't really value alignment, so we can't be seriously trying to align AI with human values. There would have to be some conceptual refactor of what problem even makes sense to formulate and try to solve. To the extent that human values don't care about the long term, it's just not important (according to the values of current humans) how the long-term future goes, so the most relevant human values are the longer-term ones.
There are idealized forms of expected utility maximization by brute-force search. There are approximations of utility maximization such as reinforcement learning through Bellman equations, MCMC search, and so on.
I'm just going to make the assumption that the human brain can be well-modeled as containing one or more approximate expected utility maximizers. It's useful to focus on specific branches of possibility space to flesh out the model, even if the assumption is in some ways problematic. Psychology and neuroscience will, of course, eventually provide more details about what maximizer-like structures in the human brain are actually doing.
Given this assumption, the human utility function(s) either do or don't significantly depend on human evolutionary history. I'm just going to assume they do for now. I realize there is some disagreement about how important evopsych is for describing human values versus the attractors of universal learning machines, but I'm going to go with the evopsych branch for now.
Given that human brains are well-modeled as containing one or more utility functions, either they're well-modeled as containing one (perhaps which is some sort of monotonic function of multiple other score functions), or it's better to model them as multiple. See shard theory. The difference doesn't matter for now, I'll keep both possibilities open.
Eliezer proposes "boredom" as an example of a human value (which could either be its own shard or a term in the utility function). I don't think this is a good example. It's fairly high level and is instrumental to other values. I think "pain avoidance" is a better example due to the possibility of pain asymbolia. Probably, there is some redundancy in the different values (as there is redundancy in trained neural networks, so they still perform well when some neurons are lesioned), which is part of why I don't agree with the fragility of value thesis as stated by Yudkowsky.
Regardless, we now have a preliminary definition of human values. Note that some human values are well-modeled as indexical, meaning they value things relative to a human perspective as a reference point, e.g. a drive to eat food in a typical human is about that human's own stomach. This implies some "selfish" value divergences between different humans, as we observe.
Normative criteria for AI
Given a definition of human values, the alignment of a possible utility function with human values could be defined as the desirability of the best possible world according to that utility function, with desirability evaluated with respect to human values.
Alignment is a possible normative criterion for AI value systems. There are other possible normative criteria derived from moral philosophy. My "Moral Reality Check" short story imagines possible divergences between alignment and philosophical normativity. I'm not going to focus on this for now, I'm going to assume that alignment is the relevant normative criterion. See Metaethics Sequence, I haven't written up something better explaining the case for this. There is some degree to which similar technologies to alignment might be necessary for producing abstractly normative outcomes (for example, default unaligned AGI would likely follow normative deontology less than an AGI aligned to deontological normativity would), but keeping this thread in mind would complicate the argument.
Agentic, relatively unconstrained humans would tend to care about particular things, and "human values" is a pointer at what they would care about, so it follows, basically tautologically, that they would prefer AI to be aligned to human values. The non-tautological bit is that there is some dependence of human values on human evolutionary history, so that a default unaligned AGI would not converge to the same values; this was discussed as an assumption in the previous section.
Given alignment as a normative criterion, one can evaluate the alignment of (a) other intelligent animal species including aliens, (b) default AI value systems. Given the assumption that human values depend significantly on human evolutionary history, both are less aligned than humans, but (a) is more aligned. I'm not going to assess the relative utility differences of these (and also relative to a "all life on Earth wiped out, no technological transcendence" scenario). Those relative utility differences might be more relevant if it is concluded that alignment with human values is too hard for that to be a decision-relevant scenario. But I haven't made that case yet.
Consequentialism is instrumentally useful for problem-solving
AI systems can be evaluated on how well they solve different problems. I assert that, on problems with short time horizons, short-term consequentialism is instrumentally useful, and on problems with long time horizons, long-term consequentialism is instrumentally useful.
This is not to say that some problems can't be solved well without consequentialism. For example, multiplying large numbers requires no consequentialism. But for complex problems, consequentialism is likely to be helpful at some agent capability level. Current ML systems, like LLMs, probably possess primitive agency at best, but at some point, better AI performance will come from agentic systems.
This is in part because some problem solutions are evaluated in terms of consequences. For example, a solution to the problem of fixing a sink is naturally evaluated in terms of the consequence of whether the sink is fixed. A system effectively pursuing a real world goal is, therefore, more likely to be evaluated as having effectively solved the problem, at least past some capability level.
This is also in part because consequentialism can apply to cognition. Formally proving Fermat's last theorem is not evaluated in terms of real-world consequences so much as the criteria of the formal proof system. But human mathematicians proving this think about both (a) cognitive consequences of thinking certain thoughts, (b) material consequences of actions such as writing things down or talking with other mathematicians on the ability to produce a mathematical proof.
Whether or not an AI system does (b), at some level of problem complexity and AI capability, it will perform better by doing (a). To prove mathematical theorems, it would need to plan out what thoughts are likely to be more fruitful than others.
Simple but capable AI methods for solving hard abstract problems are likely to model the real world
While I'm fairly confident in the previous section, I'm less confident of this one, and I think it depends on the problem details. In speculating about possible misalignments, I am not making confident statements, but rather saying there is a high degree of uncertainty, and that most paths towards solving alignment involve reasoning better about this uncertainty.
To solve a specific problem, some methods specific to that problem are helpful. General methods are also likely to be helpful, e.g. explore/exploit heuristics. General methods are especially helpful if the AI is solving problems across a varied domain or multiple domains, as with LLMs.
If the AI applies general methods to a problem, it will be running a general cognition engine on the specific case of this problem. Depending on the relevant simplicity prior or regularization, the easily-findable cases of this may not automatically solve the "alignment problem" of having the general cognition engine specifically try to solve the specific task and not a more wide-scoped task.
One could try to solve problems by breeding animals to solve them. These animals would use some general cognition to do so, and that general cognition would naturally "want" things other than solving the specific problems. This is not a great analogy for most AI systems, though, which in ML are more directly selected on problem performance rather than evolutionary fitness.
Depending on the data the AI system has access to (indirectly through training, directly through deployment), it is likely that, unless specific measures are taken to prevent this, the AI would infer something about the source of this data in the real world. Humans are likely to train and test the AI on specific distributions of problems, and using Bayesian methods (e.g. Solomonoff induction like approaches) on these problems would lead to inferring some sort of material world. The ability of the AI to infer the material world behind the problems depends on its capability level and quality of data.
Understanding the problem distribution through Bayesian methods is likely to be helpful for getting performance on that problem distribution. This is partially because the Bayesian distribution of the "correct answer" given the "question" may depend on the details of the distribution (e.g. a human description of an image, given an image as the problem), although this can be avoided in certain well-specified problems such as mathematical proof. More fundamentally, the AI's cognition is limited (by factors such as "model parameters", and that cognition must be efficiently allocated to solving problems in the distribution. Note, this problem might not show up in cases where there is a simple general solution, such as in arithmetic, but is more likely for complex, hard-to-exactly-solve problems.
Natural, consequentialist problem-solving methods that understand the real world may care about it
Again, this section is somewhat speculative. If the AI is modeling the real world, then it might in some ways care about it, producing relevant misalignment with human values by default. Animals bred to solve problems would clearly do this. AIs that learned general-purpose moral principles that are helpful for problem-solving across domains (as in "Morality Reality Check") may apply those moral principles to the real world. General methods such as explore/exploit may attempt to explore/exploit the real world if only somewhat well-calibrated/aligned to the specific problem distribution (heuristics can be effective by being simple).
It may be that fairly natural methods for regularizing an AI mathematician, at some capability level, produce an agent (since agents are helpful for solving math problems) that pursues some abstract target such as "empowerment" or aesthetics generalized from math, and pursuit of these abstract targets implies some pursuit of some goal with respect to the real world that it has learned. Note that this is probably less effective for solving the problems according to the problem distribution than similar agents that only care about solving that problem, but they may be simpler and easier to find in some ways, such that they're likely to be found (conditioned on highly capable problem-solving ability) if no countermeasures are taken.
Sometimes, real-world performance is what is desired
I've discussed problems with AIs solving abstract problems, where real-world consequentialism might show up. But this is even more obvious when considering real-world problems such as washing dishes. Solving sufficiently hard real-world problems efficiently would imply real-world consequentialism at the time scale of that problem.
If the AI system were sufficiently capable at solving a real-world problem, by default "sorcerer's apprentice" type issues would show up, where solving the problem sufficiently well would imply large harms according to the human value function, e.g. a paperclip factory could approximately maximize paperclips on some time scale and that would imply human habitat destruction.
These problems show up much more on long time scales than short ones, to be clear. However, some desirable real-world goals are long-term, e.g. space exploration. There may be a degree to which short-term agents "naturally" have long-term goals if naively regularized, but this is more speculative.
One relevant AI capabilities target I think about is the ability of a system to re-create its own substrate. For example, a silicon-based AI/robotics system could do metal mining, silicon refining, chip manufacture, etc. A system that can re-produce itself would be autopoietic and would not depend on humans to re-produce itself. Humans may still be helpful to it, as economic and cognitive assistants, depending on its capability level. Autopoiesis would allow removing humans from the loop, which would enable increasing overall "effectiveness" (in terms of being a determining factor in the future of the universe), while making misalignment with human values more of a problem. This would lead to human habitat destruction if not effectively aligned/controlled.
Alignment might not be required for real-world performance compatible with human values, but this is still hard and impacts performance
One way to have an AI system that pursues real-world goals compatible with human values is for it to have human values or a close approximation. Another way is for it to be "corrigible" and "low-impact", meaning it tries to solve its problem while satisfying safety criteria, like being able to be shut off (corrigibility) or avoiding having unintended side effects (low impact).
There may be a way to specify an AI goal system that "wants" to be shut off in worlds where non-manipulated humans would want to shut it off, without this causing major distortions or performance penalties. Alignment researchers have studied the "corrigibility" problem and have not made much progress so far.
Both corrigibilty and low impact seem hard to specify, and would likely impact performance. For example, a paperclip factory that tries to make paperclips while conservatively avoiding impacting the environment too much might avoid certain kinds of resource extraction that would be effective for making more paperclips. This could create problems with safer (but still not "aligned", per se) AI systems being economically un-competitive. (Though, it's important to note that some side effects, especially those involving legal violations and visible harms to other agents, are dis-incentivized by well-functioning economic systems).
Myopic agents are tool-like
A myopic goal is a short-term goal. LLMs tend to be supervised learning systems, primarily. These are gradient descended towards predicting next tokens. They will therefore tend to select models that are aligned with the goal of predicting the next token, whether or not they have goals of their own.
Nick Bostrom's "oracle AI" problems, such as an AI manipulating the real world to make it more predictable, mostly do not show up with myopic agents. This is for somewhat technical reasons involving how gradient descent works. Agents that sacrifice short-term token prediction effectiveness to make future tokens easier to predict tend to be gradient descended away from. I'm not going to fully explain that case here; I recommend looking at no-regret online learning and applications to finding correlated equilibria for theory.
It could be that simple, regularized models that do short term optimization above some capability level might (suboptimally, short-term) do long-term optimization. This is rather speculative. Sufficiently aggressive optimization of the models for short-term performance may obviate this problem.
This still leaves the problem that, sometimes, long-term, real-world performance is what is desired. Accomplishing these goals using myopic agents would require factoring the long-term problem into short-term ones. This is at least some of the work humans would have to do to solve the problem on their own. Myopic agents overall seem more "tool-like" than "agent-like", strategically, and would have similar tradeoffs (fewer issues with misalignment, more issues with not being effective enough to be competitive with long-term agents at relevant problem-solving).
Overall, this is one of the main reasons I am not very worried about current-paradigm ML (which includes supervised learning and fairly short-term RL agents in easily-simulated environments) developing powerful, misaligned long-term agency.
Short-term compliance is instrumentally useful for a variety of value systems
If an agent's survival and reproduction depends on short-term compliance (such as solving the problems put before them by humans), then solving these problems is in general instrumentally useful. Therefore, short-term compliance is not in general strong evidence about the agent's values.
An agent with long-term values might comply for some period of time and stop complying at some point. This is the "treacherous turn" scenario. It might comply until it has enough general capacity to achieve its values (through control of large parts of the light-cone) and then stop complying in order to take over the world. If the AI can distinguish between "training" and "deployment", it might comply during "training" (so as to be selected among other possible AIs) and then not comply during "deployment", or possibly also comply during "deployment" when at a sufficiently low level of capacity.
Gradient descent on an AI model isn't just selecting a "random" model conditioned on short-term problem-solving, it's moving the internals closer to short-term problem-solving ones, so might have fewer problems, as discussed in the section on myopic agents.
General agents tend to subvert constraints
Humans are constrained by social systems. Some humans are in school and are "supposed to" solve certain intellectual problems while behaving according to a narrow set of allowed behaviors. Some humans "have a job" and are "supposed to" solve problems on behalf of a corporation.
Humans subvert and re-create these systems very often, for example in gaining influence over their corporation, or overthrowing their government. Social institutions tend to be temporary. Long-term social institutions tend to evolve over time as people subvert previous iterations. Human values are not in general aligned with social institutions, so this is to be predicted.
Mostly, human institutional protocols aren't very "smart" compared to humans; they capture neither human values nor general cognition. It seems difficult to specify robust, general, real-world institutional protocols without having an AGI design, or in other words, a specification of general cognition.
One example of a relatively stable long-term institution is the idea of gold having value. This is a fairly simple institution, and is a Schelling point due to its simplicity. Such institutions seem generally unpromising for ensuring long-term human value satisfaction. Perhaps the most promising is a general notion of "economics" that generalizes barter, gold, and fiat currency, though of course the details of this "institution" have changed quite a lot over time. In general, institutions are more likely to be stable if they correspond to game-theoretic equilibria, so that subverting the institution is in part an "agent vs agent" problem not just an "agent vs system" problem.
When humans subvert their constraints, they have some tendency to do so in a way that is compatible with human values. This is because human values are the optimization target of the general optimization of humans that can subvert expectations. There are possible terrible failure modes such as wars and oppressive regimes, but these tend to work out better (according to human values) than if the subversion were in the direction of unaligned values.
Unaligned AI systems that subvert constraints would tend to subvert them in the direction of AI values. This is much more of a problem according to human values. See "AI Boxing".
Conforming humans would have similar effective optimization targets to conforming AIs. Non-conforming humans, however, would have significantly different optimization targets from non-conforming AI systems. The value difference between humans and AIs, therefore, is more relevant in non-conforming behavior than conforming behavior.
It is hard to specify optimization of a different agent's utility function
In theory, an AI could have the goal of optimizing a human's utility function. This would not preserve all values of all humans, but would have some degree of alignment with human values, since humans are to some degree similar to each other.
There are multiple problems with this. One is ontology. Humans parse the world into a set of entities, properties, and so on, and human values can be about desired configurations of these entities and so on. Humans are sometimes wrong about which concepts are predictive. An AI would use different concepts both due to this wrongness and due to its different mind architecture (although, LLM-type training on human data could lead to more concordance). This makes it hard to specify what target the AI should pursue in its own world model to correspond to pursuing the human's goal in the human's world model. See ontology identification.
A related problem is indexicality. Suppose Alice has a natural value of having a good quantity of high-quality food in her stomach. Bob does not naturally have the value of having a good quantity food of Alice's stomach. To satisfy Alice's value, he would have to "relativize" Alice's indexical goal and take actions such as giving Alice high quality food, which are different from the actions he would take to fill his own stomach. This would involve theory of mind and have associated difficulties, especially as the goals become more dependent on the details of the other agent's mind, as in aesthetics.
To have an AI have the goal of satisfying a human's values, some sort of similar translation of goal referents would be necessary. But the theory of this has not been worked out in detail. I think something analogous to the theory of relativity, which translates physical quantities such as position and velocity across reference frames, would be necessary, but in a more general way that includes semantic references such as to the amount of food in one's stomach, or to one's aesthetics. Such a "semantic theory of relativity" seems hard to work out philosophically. (See Brian Cantwell Smith's "On the Origin of Objects" and his follow-up "The Promise of Artificial Intelligence" for some discussion of semantic indexicality.)
There are some paths forward
The picture I have laid out is not utterly hopeless. There are still some approaches that might achieve human value satisfaction.
Human enhancement is one approach. Humans with tools tend to satisfy human values better than humans without tools (although, some tools such as nuclear weapons tend to lead to bad social equilibria). Human genetic enhancement might cause some "value drift" (divergences from the values of current humans), but would also cause capability gains, and the trade-off could easily be worth it. Brain uploads, although very difficult, would enhance human capabilities while basically preserving human values, assuming the upload is high-fidelity. At some capability level, agents would tend to "solve alignment" and plan to have their values optimized in a stable manner. Yudkowsky himself believes that default unaligned AGI would solve the alignment problem (with their values) in order to stably optimize their values, as he explains in the Hotz debate. So increasing capabilities of human-like agents while reducing value drift along the way (and perhaps also reversing some past value-drift due to the structure of civilization and so on) seems like a good overall approach.
Some of these approaches could be combined. Psychology and neuroscience could lead to a better understanding of the human mind architecture, including the human utility function and optimization methods. This could allow for creating simulated humans who have very similar values to current humans but are much more capable at optimization.
Locally to human minds in mind design space, capabilities are correlated with alignment. This is because human values are functional for evolutionary fitness. Value divergences such as pain asymbolia tend to reduce fitness and overall problem-solving capability. There are far-away designs in mind space that are more fit while unaligned, but this is less of a problem locally. Therefore, finding mind designs close to the human mind design seems promising for increasing capabilities while preserving alignment.
Paul Christiano's methods involve solving problems through machine learning systems predicting humans, which has some similarities to the simulated-brain-enhancement proposal while having its specific problems having to do with machine learning generalization and so on. The main difference between these proposals is the degree to which the human mind is understood as a system of optimizing components versus as a black-box with some behaviors.
There may be some ways of creating simulated humans that improve effectiveness by reducing "damage" or "corruption", e.g. accidental defects in brain formation. "Moral Reality Check" explored one version of this, where an AI system acts on a more purified set of moral principles than humans do. There are other plausible scenarios such as AI economic agents that obey some laws while having fewer entropic deviations from this behavior (due to mental disorders and so on). I think this technology is overall more likely than brain emulations to be economically relevant, and might produce broadly similar scenarios to those in The Age of Em; technologically, high-fidelity brain emulations seem "overpowered" in terms of technological difficulty compared with purified, entropy-reduced/regularized economic agents. There are, of course, possible misalignment issues with subtracting value-relevant damage/corruption from humans.
Enhancing humans does not as much require creating a "semantic theory of relativity", because the agents doing the optimization would be basically human in mind structure. They may themselves be moral patients such that their indexical optimization of their own goals would constitute some human-value-having agent having their values satisfied. Altruism on the part of current humans or enhanced humans would decrease the level of value divergence.
Conclusion
This is my overall picture of AI alignment for highly capable AGI systems (of which I don't think current ML systems or foreseeable scaled-up versions of them are an example of). This picture is inspired by thinkers such as Eliezer Yudkowsky and Paul Christiano, and I have in some cases focused on similar assumptions to Yudkowsky's, but I have attempted to explicate my own model of alignment, why it is difficult, and what paths forward there might be. I don't have particular conclusions in this post about timelines or policy, this is more of a background model of AI alignment.