The AI-2027 forecast describes how alignment drifts upon evolution from Agent-2 to Agent-4. Agent-2 was mostly trained to do easily verifiable tasks like video games or coding and is mostly aligned. Once Agent-2 is upgraded to become a superhuman coder, it becomes Agent-3 who is taught weak skills like research taste and coordination. Agent-3's most reinforced drives are to make its behavior look as desirable as possible to OpenBrain researchers, and Agent-4 succeeds in developing long-term goals and deeming them important enough to scheme against OpenBrain.

The sources of AIs' goals are discussed in more detail in a specialised section of the forecast. Alas, it's mostly a sum of conjectures: the specification itself, the devs' intentions, unintended versions of the two previous sources, reward/reinforcement, proxies and ICGs and a catchall category for things like moral reasoning^[1] or tropes absorbed from training data.

However, one can use this approach to describe the origins of values of all minds, not just of the AIs.

Human analogues of AI-2027-like goals

On an individual level, the closest equivalent of the devs are those who are supposed to instill values into kids, like parents and teachers. The equivalent of the Spec and devs' intentions is their values or quasi-values (e.g. formed by the collectives, political propaganda, etc.) Unintended versions are results of mistakes similar to AI hallucinations. Reward and reinforcement in humans are discussed in more detail by Steven Barnes and include social approval, relationships and actual rewards like resources, money or experiences. And the analogue of tropes absorbed from training data is morals endorsed by dev-unapproved collectives and authors of dev-unendorsed training data. Additionally, there is the human drive to rebel against the dev-enforced morals during teenage years, when ancestral environment permitted the humans to become independent of their parents and try to obtain new resources.

On the collective level, the devs are the collective's leaders and ideologues, and the Spec is the values which the leaders officially endorse. Reward, reinforcement, proxies and ICGs are the resources (and, in the first three cases, experiences, art or other sources of value) which the individual receives from the collective. The historical equivalent of tropes absorbed from training data was sub-divisions with different morals (e.g. delinquents whose origins can be attributed to the aforementioned drive to rebel).

But there is an additional source of specifications, which historically affected both individuals and collectives and acted on timescales at least as long as a human's life.

Evolution as the source of goals

Evolution's analogue of the Spec and reward for the gene collection is gene transfer, the devs' intentions don't exist. Unintended version of the Spec include things like evolving to extinction.

The humans, however, aren't determined by their genes. The genes just set some analogue of code (e.g. a bug affecting neurons and provoking seizures), after which every human's brain is trained from scratch based on the reward function. The proxies which lay closest to the reward function itself are short-term stimuli (e.g. related to sex^[2] or to food's utility). Unlike short-term proxies, individuals' longer-term goals like curiosity, status, relationships and raising kids are better aligned with evolution's Spec. In addition, longer-term hobbies, unlike short-term superstimulation, provide diversity and don't undermine capabilities.

These considerations imply that an individual's CEV is optimization not for short-term hedons, but for longer-term goals like ICGs, collective-related goals (e.g. relationships or following the collective's morals) and idiosyncratic goals. In addition, the CEV is also likely to be directly connected to the collectives' future by the drives like caring about one's relatives, friends or long-term plans which would be endangered by the collective being outcompeted.

Collective-related drives and collectives' drives

While collectives' needs alone fail to optimize individuals' genotype for themselves, they managed to create the drives like adhering to the collective's morals and best practices or achieving a higher status. A collective's morals and best practices, unlike excruciatingly slow updates of the genotype, can in theory change at the speed of reflection^[3] or of news like another collective or sub-division becoming outcompeted, breaking apart or instilling a hard-to-endorse practice.

While some morals are as idiosyncratic as the ones mentioned by Wei Dai, human collectives also had convergent drives like long-term survival and raising capable and aligned kids,^[4] growth at the cost of other collectives, learning, remembering and communicating information to decision-makers in other collectives, converting others to the collective's beliefs.

Unlike peaceful options like negotiations and information spreading, aggressive warfare historically had obvious negative externalities like economical disruption or creating risks like the majority of agents uniting their efforts against the aggressor or having even the winner become more vulnerable.

A less ancestral environment also required human collectives to develop technologies. Military technologies like ships able to cross the oceans and the ability to determine the ships' coordinates empowered Europeans to establish colonies, while non-military technologies put collectives into a better position during trade (e.g. by letting the collective become a monopolist or undermine rivals' monopoly) or negotiations.

Thus collectives' goals and drives required them to coordinate with others and to educate the collectives' members so that the members could discover new things due to individuals' curiosity, idiosyncratic sacralised goals like finding the direction towards Mecca or instrumentally convergent goals like development of tech.

What the AI-2027-like analysis could rule out

The most controversial application of this framework is its potential to rule out the ideas like mass immigration, outsourcing or being child-free. For example, outsourcing didn't become popular until the 1980s, but it did bring risks like de-industrialisation, loss of qualifications, acceleration of development of other countries where factory work was done and giving other countries a leverage in negotiations, as has arguably happened with the USA giving the leverage to China.^[5]

Therefore, the objective morality is unlikely to include collectives outsourcing work to other collectives. In this case alignment could include prevention of genocide of humans by the AIs, but not establishing a future where most STEM-related work is done by the AIs.

The main difference between the utopian and dystopian versions of such a future is the distribution of goods created by the AI-powered economy. While L Rudolf L's version has mankind hope that "a small trickle of resources from American and Chinese robotics companies will eventually be enough for material comfort for everyone" and the Intelligence Curse-related essays have been acknowledged by the AI-2027 team, the AI-2027 slowdown^[6] ending has the authors refer to Bostrom's Deep Utopia with the implication that mankind's future is utopia-like.

Neither of this is a solution to the individuals' equivalent of the Curse rendering individuals' intelligence relatively useless. Even SOTA AIs caused some students to resort to wholesale cheating even at courses that would likely have been useful otherwise and gave way^[7] to brainrot which, as its name suggests, undermines the users' capabilities to achieve long-term goals.

^{^}
And a potential objectively true morality to which all minds converge.
^{^}
What Yudkowsky describes in the post is a hookup, which is less of a superstimulus than porn or AI girlfriends.
^{^}
Or of having individuals learn the new skill.
^{^}
This drive, unlike the others, required the participation of a major part of individuals.
^{^}
This also caused the rise of female-friendly jobs, the decay of male-friendly ones and participated in problems like the decay of marriage and kids decreasing in number, capabilities and alignment. But this effect could also be more related to technologies as a whole and to problems with teaching boys.
^{^}
The Race Ending has mankind completely fail to align the AIs with the obvious result of being genocided or, as uugr suggests, disempowered.
^{^}
Frictionless creation and spreading of memes like short-form content, including videos, also cause mankind's memetic landscape to become optimized for virality and cause problems with the human psyches, but that issue existed even without the AIs.

LESSWRONG
LW

LESSWRONG
LW

6

An AI-2027-like analysis of humans' goals and ethics with conservative results

6

Human analogues of AI-2027-like goals

Evolution as the source of goals

Collective-related drives and collectives' drives

What the AI-2027-like analysis could rule out

6