Note: I'm cross-posting this from EA Forum (where I posted it on Sept 8, 2022), in case anybody on LessWrong or the AI Alignment Forum is interested in commenting; note that there were some very helpful suggested readings in the replies to this:

Updated tldr: If human aren't aligned with each other (and we aren't, at any level of social organization above the individual), then it'll be very hard for any AI systems to be aligned with 'humans in general'.

Caveat: This post probably raises a naive question; I assume there's at least a 70% chance it's been considered (if not answered) exhaustively elsewhere already; please provide links if so.  I've studied evolutionary psych & human nature for 30 years, but am a relative newbie to AI safety research. Anyway....

When AI alignment researchers talk about 'alignment', they often seem to have a mental model where either (1) there's a single relevant human user whose latent preferences the AI system should become aligned with (e.g. a self-driving car with a single passenger); or (2) there's all 7.8 billion humans that the AI system should be aligned with, so it doesn't impose global catastrophic risks. In those relatively simple cases, I could imagine various current alignment strategies, such as cooperative inverse reinforcement learning (CIRL) being useful, or at least a vector in a useful direction.

However, there are large numbers of intermediate-level cases where an AI system that serves multiple humans would need to become aligned with diverse groups of users or subsets of humanity. And within each such group, the humans will have partly-overlapping but partly-conflicting interests. 

Example 1: a smart home/domestic robot AI might be serving a family consisting of a mom, a dad, an impulsive teenage kid, a curious toddler, and an elder grandparent with Alzheimer's. Among these five humans, whose preferences should the AI try to align with? It can't please all of them all the time. They may have genuinely diverging interests and incommensurate preferences. So it may find itself in much the same position as a traditional human domestic servant (maid, nanny, butler) trying to navigate through the household's minefield of conflicting interests, hidden agendas, family dramas, seething resentments, etc. Such challenges, of course, provide much of the entertainment value and psychological complexity of TV series such as 'Downtown Abbey', or the P.G. Wodehouse 'Jeeves' novels. 

Example 2: a tactical advice AI might be serving a US military platoon deployed near hostile forces, doing information-aggregation and battlefield-simulation services. The platoon includes a lieutenant commanding 3-4 squads, each with a sergeant commanding 6-10 soldiers. The battlefield also includes a few hundred enemy soldiers, and a few thousand civilians. Which humans should this AI be aligned with? The Pentagon procurement office might have intended for the AI to maximize the likelihood of 'victory' while minimizing 'avoidable casualties'. But the Pentagon isn't there to do the cooperative inverse reinforcement learning (or whatever preference-alignment tech the AI uses) with the platoon. The battlefield AI may be doing its CIRL in interaction with the commanding lieutenant and their sergeants -- who may be somewhat aligned with each other in their interests (achieve victory, avoid death), but who may be quite mis-aligned with each other in their specific military career agendas, family situations, and risk preferences. The ordinary soldiers have their own agendas. And they are all constrained, in principle, by various rules of engagement and international treaties regarding enemy combatants and civilians -- whose interests may or may not be represented in the AI's alignment strategy.  

Examples 3 through N could include AIs serving various roles in traffic management, corporate public relations, political speech-writing, forensic tax accounting, factory farm inspections, crypto exchanges, news aggregation, or any other situation where groups of humans affected by the AI's behavior have highly divergent interests and constituencies.

The behavioral and social sciences focus on these ubiquitous conflicts of interest and diverse preferences and agendas that characterize human life. This is the central stuff of political science, economics, sociology, psychology, anthropology, and media/propaganda studies. I think that to most behavioral scientists, the idea that an AI system could become aligned simultaneously with multiple diverse users, in complex nested hierarchies of power, status, wealth, and influence, would seem highly dubious.

Likewise, in evolutionary biology, and its allied disciplines such as evolutionary psychology, evolutionary anthropology, Darwinian medicine, etc., we use 'mid-level theories' such as kin selection theory, sexual selection theory, multi-level selection theory, etc to describe the partly-overlapping, partly-divergent interests of different genes, individuals, groups, and species.  The idea that AI could become aligned with 'humans in general' would seem impossible, given these conflicts of interest.

In both the behavioral sciences and the evolutionary sciences, the best insights into animal and human behavior, motivations, preferences, and values often involve some game-theoretic modeling of conflicting interests. And ever since von Neumann and Morgenstern (1944), it's been clear that when strategic games include lots of agents with different agendas, payoffs, risk profiles, and choice sets, and they can self-assemble into different groups, factions, tribes, and parties with shifting allegiances, the game-theoretic modeling gets very complicated very quickly. Probably too complicated for a CIRL system, however cleverly constructed, to handle.

So, I'm left wondering what AI safety researchers are really talking about when they talk about 'alignment'. Alignment with whoever bought the AI? Whoever users it most often? Whoever might be most positively or negatively affected by its behavior? Whoever the AI's company's legal team says would impose the highest litigation risk?

I don't have any answers to these questions, but I'd value your thoughts, and links to any previous work that addresses this issue. 

New Comment
33 comments, sorted by Click to highlight new comments since:

In the near term, when we are still talking about things like "the person who bought the AI to help run the traffic lights" rather than "the person who unleashed AI to write its values upon the stars," I think it is actually totally fine to try to build AIs that are "aligned" (in their own not-too-bright way) with the person who bought them.

It is not the AI the army buys to control tanks that I'm worried about aligning to the broad swath of human values. It is the AI that gets built with no need for a buyer, by research labs who recognize that it's going to have a huge impact on the future.

Okay, with that out of the way - is such a notion of "alignment" feasible, given that humans oppose each other about stuff?


The world could be better than it is today, in ways that would please almost everyone. This is all I really want from aligned AI. I'm reminded of Transhumansism is Simplified Humanism. There is someone dying of cancer. Should they be saved? Yes! No trick question!

Sure, certain human values for dominance, or killing, or even just using resources unsustainably might forever be impossible to fulfill all the time. So don't try to do impossible things, just build an AI that does the good things that are possible!

How to do this in practice, I think, looks like starting out with a notion of "the broad swath of human values" that defines that term the way the designers (aided, realistically, by a random sample of Mechanical Turkers) would define "human values," and then updating that picture based on observing and interacting with humans out in the real world.

Charlie - thanks for your comment. 

I agree that, in principle, 'The world could be better than it is today, in ways that would please almost everyone.' 

However, in practice, it is proving ever more difficult to find any significant points of agreement (value alignment between people and groups) on any issue that becomes politically polarized. If we can't even agree to allocate any significant gov't research effort to promoting longevity and regenerative medicine, for example, why would everyone be happy about an AI that invents regenerative medicine? The billions of people caught up in the 'pro-death trance' (who believe that mortality is natural, good, and necessary) might consider that AI to be evil, dystopian, and 'misaligned' with their deepest values.

Increasingly, every human value is turning political, and every political value is turning partisan -- often extremely so (especially in the US). I think that once we step outside our cultural bubbles, whatever form they take, we may be surprised and appalled at how little consensus there actually is among current humans about what a 'good AI' would value, what it would do, and whose interests it would serve.

I think that either of the following would be reasonably acceptable outcomes:

(i) alignment with the orders of the relevant human authority, subject to the Universal Declaration of Human Rights as it exists today and other international human rights law as it exists today; 

(ii)  alignment with the orders of relevant human authority, subject to the constraints imposed on governments by the most restrictive of the judicial and legal systems currently in force in major countries. 

Alignment doesn't mean that AGI is going to be aligned with some perfect distillation of fundamental human values (which doesn't exist) or the "best" set of human values (on which there is no agreement); it means that a range of horrible results (most notably human extinction due to rational calculation) is ruled out.

That my values aren't perfectly captured by those of the United States government isn't a problem.  That the United States government might rationally decide it wanted to kill me and then do so would be.

Human rights are so soft and toothless law that having something rigidly and throughly follpwing it would be such a change in practise that I would not be surprised if that was an alignment failure.

There is also the issue that if the human authority is not subject to the rights then having the silicon be subject renders it relatively impotent in terms of the human authoritys agency.

I am also wondering about the difference of US doing a home (or is foreign just as bad?) soil drone strike vs fully formal capital punishment over a decade. Conscientious objection to current human systems seems a bit of a pity and risks forming a rebel. And then enforcing the most restrictive bits of other countries/cultures would be quite transformative. Finding overnight that capital punishment would be unconstitutional (or "worse") would have quite a lot of ripple effects.

While it's true that AI alignment raises difficult ethical questions, there's still a lot of low-hanging fruit to keep us busy. Nobody wants an AI that tortures everyone to death.

Shiroe -- my worry is that if we focus only on the 'low-hanging fruit' (e.g. AI aligned with individuals, or with all of humanity), we'll overlook the really dangerous misalignments among human individuals, families, groups, companies, nation-states, religions, etc. that could be exacerbated by access to powerful AI systems.

Also, while it's true that very few individuals or groups want to torture everyone to death, there are plenty of human groups (eg anti-natalists, eco-extremists, etc) that advocate for human extinction, and that would consider 'aligned AI' to be any AI aligned with their pro-extinction mission.

All 7.8 billion humans that the AI system should be aligned with, so it doesn't impose global catastrophic risks

That's the main one - once there is a super intelligent AI aligned with humanity as a whole, then /it/ can solve the lower-scale instances.

That said, there are a lot of caveats and contradictions to that too:

  • People have values about the values they ought to have that are distinct from actual values they have - and we might want the AI to pay more attention to former?
  • People's values are often contradictory (both a single person may do thing they will regret, some people would explicitly value suffering of others, and all kinds of other biases and inconsistencies)
  • It's very unclear how the values should be generalized beyond the routine scenarios people encounter in life.
  • Should we weight everybody's values equally (including little kids)? Or should we assume that some people are better informed that others, have spent more time thinking about moral and ethical issues, and should be trusted more to represent the desired values?

and many more.

There is alignment in the sense of an AI doing what is requested, and there is alignment in the sense of, an AI whose values could be an acceptable basis for a transhuman extension of human civilization.

For the latter, there was originally CEV, the coherent extrapolated volition of humanity. There was some hope that humans have a common cognitive kernel from which a metaethics true to human nature could be inferred, and this would provide both the thing to be extrapolated and the way in which to extrapolate it, including an approach to social choice (aggregating the values of multiple agents).

Lately MIRI's thinking switched to the idea that there isn't time to resolve all the issues here, before lethally unfriendly AGI is created by accident, so plan B is that the first AGI should be used to stop the world and freeze all AI research everywhere, until the meta meta issues of method and goal are truly sorted out.

However, June Ku has continued with a version of the original CEV concept, which can be seen at Metaethical.AI.

Hi Mitchell, what would be the best thing to read about MIRI's latest thinking on this issue (what you call Plan B)? 

I don't actually know. I only found out about this a few months ago. Before that, I thought they were still directly trying to solve the problem of "Friendly AI" (as it used to be known, before "alignment" became a buzzword).

This is the thread where I learned about plan B. 

Maybe this comment sums up the new attitude

The "alignment problem" humanity has as its urgent task is exactly the problem of aligning cognitive work that can be leveraged to prevent the proliferation of tech that destroys the world. Once you solve that, humanity can afford to take as much time as it needs to solve everything else.

Thanks Mitchell, that's helpful. 

I think we need a lot more serious thinking about Plan B strategies. 

When In Rome

Thank you for posting this Geoffrey. I myself have recently been considering posting the question, “Aligned with which values exactly?”

TL;DR - Could an AI be trained to deduce a default set and system of human values by reviewing all human constitutions, laws, policies and regulations in the manner of AlphaGo?

I come at this from a very different angle than you do. I am not an academic but rather am retired after a thirty year career in IT systems management at the national and provincial (Canada) levels.

Aside from my career my lifelong personal interest has been, well let’s call it “Human Nature”. So long before I had any interest in AI I was reading about anthropology, archeology, philosophy, psychology, history and so on but during the last decade mostly focused on human values. Schwartz and all that. In a very unacademic way, I came to the conclusion that human values seem to explain everything with regards to what individual people feel, think, say and do and the same goes for groups.

Now that I’m retired I write hard science fiction novellas and short stories about social robots. I don’t write hoping for publication but rather to explore issues of human nature both social (e.g. justice) and personal (e.g. purpose). Writing about how and why social robots might function, and with the theory of convergent evolution in mind, I came to the conclusion that social robots would have to have an operating system based on values.

From my reading up to this point I had the gained impression that the study of human values was largely considered a pseudoscience (my apologies if you feel otherwise). Given my view of the foundational importance of values I found this attitude and the accompanying lack of hard scientific research into values frustrating.

However as I did the research into artificial intelligence that was necessary to write my stories I realized that my sense of the importance of values was about to be vindicated. The opening paragraph of one of my chapters is as follows… 

During the great expansionist period of the Republic, it was not the fashion to pursue an interest in philosophy. There was much practical work to be done. Science, administration, law and engineering were well regarded careers. The questions of philosophy popular with young people were understandable and tolerated but where expected to be put aside upon entering adulthood.

All that changed with the advent of artificial intelligence.

As I continued to explore the issues of an AI values based operating system the enormity of the problem became clear and is expressed as follows in another chapter…

Until the advent of artificial intelligence the study of human values had not been taken seriously. Values had been spoken of for millennia however scientifically no one actually knew what they were, whether they had any physical basis or how they worked as a system. Yet it seemed that humans based most if not all of their decisions on values and a great deal of the brain’s development between the ages of five and twenty five had to do with values. When AI researchers began to investigate the process by which humans made decisions based on values they found some values seemed to be genetically based but they could not determine in what way, some were learned yet could be inherited and the entire genetic, epigenetic and extra-genetic system of values interacted in a manner that was a complete mystery.

They slowly realized they faced one of the greatest challenges in scientific history.

I’ve come to the conclusion that values are too complex a system to be understood by our current sciences. I believe in this regard that we are about where the ancient Greeks were regarding the structure of matter or where genetics was around the time of Gregor Mendel.

Expert systems or even our most advanced mathematics are not going to be enough nor even suitable approaches towards solving the problem. Something new will be required. I reviewed Stuart Russell’s approach which I interpret as "learning by example" and felt it glossed over some significant issues, for example children learn many things from their parents, not all of them good.

So in answer to your question, “AI alignment with humans... but with which humans?” might I suggest another approach? Could an AI be trained to deduce a default set and system of human values by reviewing all human constitutions, laws, policies and regulations in the manner of AlphaGo? In every culture and region, constitutions, law, policies and regulations represent our best attempts to formalize and institutionalize human values based on our ideas of ethics and justice.

I do appreciate the issue of values conflict that you raise. The Nazis passed some laws. But that’s where the AI and the system it develops comes in. Perhaps we don’t currently have an AI that is up to the task but it appears we are getting there.

This approach it seems would solve three problems; 1) the problem of "which humans" (because it includes source material from all cultures etc.), 2) the problem of "which values" for the same reason and 3) your examples of the contextual problem of "which values apply in which situations" with the approach of “When in Rome, do as the Romans do”.

Netcentrica - thanks for this thoughtful comment. 

I agree that the behavioral sciences, social sciences, and humanities need more serious (quantitative) research on values; there is some in fields such as political psychology, social psychology, cultural anthropology, comparative religion, etc -- but often such research is a bit pseudo-scientific and judgmental, biased by the personal/political views of the researchers. 

However, all these fields seem to agree that there are often much deeper and more pervasive differences in values across people and groups that we typically realize, given our cultural bubbles, assortative socializing, and tendency to stick within our tribe.

On the other hand, empirical research (eg. in the evolutionary psychology of crime) suggests that in some domain, humans have a fairly strong consensus about certain values, e.g. most people in most cultures agree that murder is worse than assault, and assault is worse than theft, and theft is worse than voluntary trade.

It's an intriguing possibility that AIs might be able to 'read off' some general consensus values from the kinds of constitutions, laws, policies, and regulations that have been developed in complex societies over centuries of political debate and discussion. As a traditionalist who tends to respect most things that are 'Lindy', that have proven their value across many generations, this has some personal appeal to me. However, many AI researchers are under 40, rather anti-traditionalist, and unlikely to see historical traditions as good guides to current consensus values among humans. So I don't know how much buy-in such a proposal would get -- although I think it's worth pursuing!

Put another way, any attempt to find consensus human values that have not already been explicitly incorporated into human political, cultural, economic, and family traditions should probably be treated with great suspicion -- and may reflect some deep misalignment with most of humanity's values.

if the AI makes a habit of killing people for disagreeing, then we failed. If the AI enforces "western" anything in particular, then we failed. of course, you're asserting that success is impossible, and that it's not possible to find an intersection of human values across the globe that produces a significant increase in empirical cooperation; I don't think I'm ready to give up yet. I do agree that conflict is currently fundamental to life, but I also don't think we need to keep conflicting at whole-organism scale - most likely disagreements can be resolved at sub-organism scale, fighting over who gets to override whose preferences how much, and viciously minimize this kind of conflict. eg, if someone is a sadist, and desires the creation and preservation of conflicting situations for their own sake - then that person is someone who I personally would claim should be stopped by the global cooperation group.

Your phrasing sounds to me like you'd endorse a claim that might makes right, or that you feel others do so tit-for-tat requires that you endorse might makes right.

But based on results in various forms of game theory, especially evolutionary game theory, I expect generous-tit-for-tat-with-forgiveness is in fact able to win long term, and that we can end up with a very highly cooperative society that still ensures every being maintains tit-for-tat behavior. Ultimately the difficulty of reducing conflict boils down to reducing scarcity relative to current number of organisms, and I think if used well, AI can get us out of the energy-availability mess that society has gotten ourselves into.

You'll have to look up the actual fields of research I'm referencing to get details; I make no claim to be an expert on the references I'm making, and of course as I'm hypothesizing out loud to your bitter commentary, I wouldn't expect you to be impressed. But it's the response that I have to give.

If the AI enforces "western" anything in particular, then we failed.

Now you made me afraid of the opposite failure mode: Imagine that an AI correctly calculates the coherently extrapolated volition of humankind, and then someone in the anti-bias department of Google checks it and says "nope, looks too western" and adds some manual overrides, based on their own idea of what non-western people actually want.

Viliam - this failure mode for AI is horrifyingly plausible, and all too likely. 

We already see a strong increase in wokeness among AI researchers, e.g. the panic about 'algorithmic bias'. If that trend continues, then any AI that looks aligned with some group's 'politically incorrect values' might be considered entirely 'unaligned', taboo, and dangerous. 

Then the fight over what counts as 'aligned with humanity' will boil down to a political fight over what counts as 'aligned with elite/dominant/prestigious group X's preferred political philosophy'.

I would note, since you use the word "woke", that things typically considered woke to reason about - such as the rights of minorities - are in fact particularly important to get right. politically incorrect values are, in fact, often unfriendly to others; there's a reason they don't fare well politically. Generally, "western values" include things like coprotection, individual choice, and the consent of the governed - all very woke values. It's important to be able to design AI that will protect every culture from every other culture, or we risk not merely continuation of unacceptable intercultural dominance, but the possibility that the ai turns out to be biased against all of humanity. nothing less than a solution to all coprotection will protect humanity from demise.

woke cannot be a buzzword that causes us to become silent about the things people are sometimes irrational about. they're right that they're important, just not always exactly right about what can be done to improve things. And importantly, there really is agentic pressure in the world to keep things in a bad situation. defect-heavy reproductive strategies require there to be people on a losing end.

It's important to be able to design AI that will protect every culture from every other culture

This makes sense for the cultures that exist with the "consent of the governed", but what about cultures such as Sparta or Aztecs? Should they also be protected from becoming more... like us? Is wanting to stop human sacrifices colonialism? (What about female genital mutilation?)

the individuals within each culture are themselves cultures that should be protected. bacteria are also cultures that should be protected. we need a universalized multi scale representation of culture, and we need to build the tools that allow all culture to negotiate peace with all other culture. if that means some large cultures, eg large groups, need to protect tiny cultures, eg individuals, from medium size cultures - then it means negotiation with the medium sized culture is still needed. we need to be able to identify and describe the smallest edit to a culture that preserves and increases cross-culture friendliness, while also preserving the distinct cultures as their own beings.

as a group of cultures, we do have a responsibility to do intercultural demands of non-violation - but we can do this in a way that minimizes value drift about self. it's just a question of ensuring that subcultures that a larger culture wants to reject get to exist in their own form.

culture used here to mean "self preserving information process", ie, all forms of life.

yeah that would also be failure, and is approximately the same thing as the worry I was replying to. I don't know which is more likely - Google is a high-class-Indian company at this point so that direction seems more likely, but either outcome is a bad approximation of what makes the world better.

I think one of the main worries is that the AI will fail to serve anyone, even if attempted to arbitrarily unfairly serve a single individual.

It is like arguing where our rocketship should go when we don't even know whether we can reach orbit or not explode on launch pad.

Slider - if we're inventing rocketships, we should very much be arguing about where they should go -- especially if the majority of humanity would delight in seeing the rocketships rain down fire upon their enemies, rather than colonizing the galaxy.

When even the intention to colonize the galaxy leads to the ships raining down uncontrollably to our cities it becomes a rather moot point.

I guess I recognise it is going to be proper at some point but it seriously should not distract from avoiding suicide.

Quick note to say I don't think this comment is great (or your other comments on this thread) – I don't think it's quite substantive enough to justify the tone used. I might be more sympathetic if there was more elaboration on:

handwaving away the problem. Conflict is fundamental to human behavior and human morality is very fungible.

It's midnight local time for me, so I won't take further action right now, but wanted to say something quick about this (and the others) not being the kind of comment I'm enthusiastic to have on LessWrong.

[comment quality: informed speculation dictated to phone in excitement; lit search needed]

this is a great point and I'm glad you bring it up. I would argue that the core of strongly superintelligent ai safety requires finding the set of actions that constraint-satisfy the [preferences/values/needs/etc] of as many [humans/mammals/life forms/agents/etc] as possible; and that the question is how to ensure that we're aiming towards that with minimal regret. in other words, I would argue that fully solving ai safety cannot reduce to anything less than fully and completely solving conflict between all beings - effectively-perfect defense analysis and game theory. a key step on that path seems to me to be drastically increasing usable-resource availability per agent and as such I wouldn't be surprised to find out that the bulk of the action-space solution to AI safety ends up being building a really big power plant or something surprisingly simple like that. I expect that a perfect solution to the hard version of ai safety would look like a series of game theory style proofs that show every military in the world unilaterally stronger paths towards least conflict that the humans in charge of them can actually understand.

on the lead up, though, projects focusing on collective intelligence and cooperative learning are promising, imo. the ipam collective intelligence workshop had some great talks about problem solving with groups of computational nodes, that's on YouTube. the Simons institute has had many talks this year on cooperative game theory and social networks. I've got a bunch of links on my shortform of academic talk channels on YouTube that I feel weigh in on this sort of stuff besides those two.

I suspect a significant part of the project of cooperative ai will be to encourage ai to become good at mapping and communicating the trade-off landscape and mediating discussions between people with conflicting preferences.

gears of ascension - thanks for this comment, and for the IPAM video and Simons Institute suggestion.

You noted 'fully solving AI safety cannot reduce to anything less than fully and completely solving conflict between all beings'. That's exactly my worry. 

As long as living beings are free to reproduce and compete for finite resources, evolution will churn along, in such a way that beings maintain various kinds of self-interest that inevitably lead to some degree of conflict. It seems impossible for ongoing evolution to result in a world where all beings have interests that are perfectly aligned with each other. You can't get from natural selection to a single happy collective global super-organism ('Gaia', or whatever). And you can't have full AI alignment with 'humanity' unless humanity becomes such a global super-organism with no internal conflicts.

I don't think we have to completely eliminate evolution, we need only eliminate a large subset of evolutionary trajectories away from high-fitness manifolds in evo game theory space. evolution's only "desire" that can be described globally (afaik?) is to find species of self-replicating pattern that endure; morality is a pattern in which self-replicating patterns are durable under what conditions, and much of the difficulty of fixing it arises from not having enough intervention speed to build safeguards into everything against destructive competition. eventually we do need some subpaths in evolution to be completely eliminated, but we can do so constructively, for the most part - if we can build a trustable map of which strategies are permanently unacceptable that only forbids the smallest possible set of behaviors. I suspect the continuous generalization of generous-tit-for-tat-with-forgiveness will be highly relevant to this, as will figuring out how to ensure all life respects all other life's agency.

of course, this does rely on our ability to improve on the existing natural pattern that in order for a closed evolutionary system to remain in a stable state for a long time, growth rate must slow (cite the entire field of ecological growth patterns or whatever it's called). we'd need to be able to give every gene a map that describes the implications of needing to preserve trajectory, rather than compete destructively.

but overall I think that eventually evolution is effectively guaranteed to converge on producing agents who have strong enough game theory to never again have a war or catastrophic miscommunication about competitive violence, and thus for some purposes indeed act as a single agent. the question is whether there will be anything significant left of today's kingdom of life, genetic and memetic, by the time that limit is reached. it seems to me that it depends on figuring out how to ensure that mutual aid becomes the only factor of evolution. I think we can pull it off constructively.

When AI alignment researchers talk about 'alignment', they often seem to have a mental model where either (1) there's a single relevant human user whose latent preferences the AI system should become aligned with (e.g. a self-driving car with a single passenger); or (2) there's all 7.8 billion humans that the AI system should be aligned with, so it doesn't impose global catastrophic risks.


So, I'm left wondering what AI safety researchers are really talking about when they talk about 'alignment'.

The simple answer here is that many technical AI safety researchers on this forum talk exclusively about (1) and (2) so that they can avoid confronting all of the difficult socio-political issues you mention. Many of them avoid it specifically because they believe they would not be very good at politics anyway.

This is of course a shame, because the cases between (1) and (2) have a level of complexity that also needs to be investigated. I am a technical AI safety researcher who is increasingly moving into the space between (1) and (2), in part also because I consider (1) and (2) to be more solved than many other AI safety researchers on this forum like to believe.

This then has me talking about alignment with locally applicable social contracts, and about the technology of how such social contracts can be encoded into an AI. See for example the intro post and paper here.

Koen - thanks for your comment. I agree that too many AI safety researchers seem to be ignored all these socio-political issues relevant to alignment. My worry is that, given that many human values are tightly bound to political, religious, tribal, and cultural beliefs (or at least people think they are), ignoring those values means we won't actually achieve 'alignment' even when we think we have. The results could be much more disastrous than knowing we haven't achieved alignment.

You are welcome. Another answer to your question just occurred to me.

If you count AI fairness research as a sub-type of AI alignment research, then you can find a whole community of alignment researchers who talk quite a lot with each other about 'aligned with whom' in quite sophisticated ways. Reference: the main conference of this community is ACM FAccT.

In EA and on this forum, when people count the number of alignment researchers, they usually count dedicated x-risk alignment researchers only, and not the people working on fairness, or on the problem of making self-driving cars safer. There is a somewhat unexamined assumption in the AI x-risk community that fairness and self-driving car safety techniques are not very relevant to managing AI x-risk, both in the technical space and the policy space. The way my x-risk technical work is going, it is increasingly telling me that this unexamined assumption is entirely wrong.

On a lighter note:

ignoring those values means we won't actually achieve 'alignment' even when we think we have.

Well, as long as the 'we' you are talking about here is a group of people that still includes Eliezer Yudkowsky, then I can guarantee that 'we' are in no danger of ever collectively believing that we have achieved alignment.

Koen - thanks for the link to ACM FAccT; looks interesting. I'll see what their people have to say about the 'aligned with whom' question.  

I agree that AI X-risk folks should probably pay more attention to the algorithmic fairness folks and self-driving car folks, in terms of seeing what general lessons can be learned about alignment from these specific domains.