Machines vs Memes Part 1: AI Alignment and Memetics — LessWrong

x

Machines vs Memes Part 1: AI Alignment and Memetics — LessWrong

This is the first in a series of three posts on the interlinkages between Memetics and AI Alignment. It was written as an output from the 2022 AI Safety Camp, for the ‘Impact of Memetics on Alignment’ team, coached by Daniel Kokotajlo and comprising Harriet Farlow, Nate Rush and Claudio Ceruti. Please read on to post 2 and post 3. We are not AI Safety experts so any and all feedback is greatly appreciated :)

TL;DR; Memetics is the study of cultural transmission through memes (as genetics is the study of biological transmission through genes). What lessons could be applied between Memetics and AI Alignment, and could AGI (Artificial General Intelligence) also develop ideas and culture - ‘memes’? Where AI Alignment postulates the existence of a base objective and a mesa objective, could a third objective - the memetic objective - also exist, and create the potential not just for inner and outer alignment problems, but a third memetic misalignment?

Note: Knowledge of AI Alignment and some knowledge of evolutionary theory assumed.

The field of AI Safety deals with questions of existential risk stemming from AGI (Artificial General Intelligence). AI Alignment research, in particular, examines to what extent intelligent, unbounded agents develop final goals that misalign with their intended goals as specified by their human programmers. It also postulates that if these agents are able to acquire resources, evolve self-preservation, and converge to certain misaligned final goals, that this presents existential harm to humanity.

While the field of Artificial Intelligence regularly borrows from domains within Organic Intelligence (ie. humans) like neuroscience, psychology and evolutionary science, there is one discipline that has not yet been applied to AI - the study of memetics. Memetics is a theory that likens cultural units of transmission, the meme, to genetic units of transmission, the gene. Genetics underpins evolutionary theory by formalising the link between genetic fitness - defined as variation, reproduction and heritability - with human fitness (ie. the ability to live long enough to pass your genes to the next generation). The theory of memetics is still contentious. Proposed by preeminent evolutionary biology academic, Richard Dawkins, he defines a meme as an ‘idea, behaviour or style that spreads from person to person within a culture’. He also proposed the (generally accepted) theory that genes behave ‘selfishly’ - prioritising their own replication over than the actual fitness of the organism they serve (although the organism’s survivability usually benefits, this is not necessarily the rule - see genetic diseases, evolutionary bottlenecks, and maladaptive/vestigial traits).

Since then, memetics as a concept has inspired academic sub-fields, competing memetic ideologies, and funny pictures on the internet. However, a key aspect of the theory that is often missed is: if genetic replication is a primary goal under evolutionary theory, then the spread of cultural traits or ‘memes’ - memetic replication - could be considered a parallel goal, especially if a person’s beliefs benefit their ability to survive and thrive in their social group. If memes behave selfishly, and replicate in competition with their genetic counterparts, then any memetic transmission that hampers either genetic transmission or human fitness would be considered a misaligned goal. This sounds a lot to me like the outer/inner alignment problem between the base and mesa objectives in AI Alignment. If the base objective is genetic replication through evolution, and the mesa objective is human survival (and replication), there may exist a third objective - memetic replication. Therefore I argue that there are lessons that could be transferred between the fields of Memetics and AI Alignment.

Potential model and relationships between the optimisers

Could AGI have a third alignment problem - memes?

It is generally accepted within the AI Alignment community that within an AGI there may be an alignment problem between the base optimiser and a mesa optimiser. In this theory, the base optimiser optimises for the goal programmed into it by a human. However in RL (reinforcement learning) settings, particularly the unbounded cases that AI Safety deals within, it is possible that the learned model may itself become an optimiser, such that the base optimiser optimises and creates a mesa optimiser. For example, if a human programs an AGI to create paperclips, this goal to create as many paperclips as possible would be the base optimiser. It may then work towards creating paperclip factories, harvesting and refining raw paperclip materials, and diverting other factories to paperclip-making. However as it interacts with its environment the base optimiser may create a new optimiser that also works towards the base goal but is misaligned with the human intentions in programming the final goal. For example, the mesa optimiser may optimise to destroy all humans, because they keep getting in the way of the goal to create paperclips. This misalignment may not be apparent until after training, if the mesa optimiser is pseudo aligned but not robustly aligned.

While this is a pretty wild scenario, there is a lot of evidence to support the notion of base versus mesa optimisers among RL (reinforcement learning) agents learning to complete tasks or play games. For example - agents designed to play games that are optimised to achieve high scores may focus on collecting points instead of playing the game. Or an agent trained to find the exit to a maze but becomes optimised to seek out the ‘exit’ sign will pursue that sign even if it is not at the maze’s exit. This misalignment is possible because of some widely accepted principles, like orthogonality between intelligence and final goals (so that even an ‘intelligent’ AGI could pursue very misaligned goals) and instrumental convergence of final goals (towards things like self-preservation, goal integrity, technological perfection and resource acquisition).

An inner alignment example often cited for humans as (sometimes) intelligent agents is that of our final evolutionary goals versus and our day-to-day human desires. For example, humanity’s base objective from the perspective of evolution - to procreate and pass along genetic material - creates the mesa goal to pursue sex (even when procreation is not the goal). This does not necessarily create a misalignment, except from the view of the selfish gene when humans have lots of sex but use contraception. It fulfils the mesa objective but not the base objective.

In comes memetics - where a third objective may be created based not on a misaligned inner objective, but through the existence of some kind of replicator (the meme) that behaves selfishly and may be at the expense of the other objectives.

In the above example, let’s consider the existence of religion. Religion is a phenomenon unique to humans (as far as we can tell) which prescribes certain values and ideologies for its human hosts. Like other memes, it is replicated through the spread of its ideology by passing memetic material between humans through group interactions (church), words, or written material. It may be transmitted vertically (through parent-child relationships) or horizontally (through friends, social groups, social media). We argue that in this circumstance, religion could exist as a third replicator that optimises for the spread of its own ideology among a population. It is more likely to replicate if it increases human fitness, through increasing altruism and satisfying human desires for connection and community. However, there are cases where it may not increase human fitness in the strict definition of survivability and reproductive fitness and may in fact come into conflict with the base and/or the mesa objective. Many religions, for example, advocate for marriage and do not advocate for contraception. (There are also rules within many religions that require celibacy, but this is usually among the religious order and not amongst its disciples). Throughout history, when religions permeated society and whose values were more strictly adhered to, this lack of contraception and likelihood for partnering would have increased reproducibility and supported the evolutionary base objective. In modern times, however, the meme for non-religion is far more likely to spread, with only 47% of people in the US currently identifying as religious, down from 73% in 1937. Does this mean the non-religion meme is in conflict with the base objective? Maybe. There are many other possible examples of memetic ideologies that interact with and either support or counter our base and mesa optimisers, such as veganism, extreme sports, eunicism and cultural rites.

This third objective may wax/wane in importance over time

While the existence of a third ‘memetic’ replicator may be less obvious in AGI now, we can look to some attributes of current AGI capability that render it possible, and look to the future of AGI capability to see why I believe it would become more likely over time.

Other sub-domains of AI have explored whether there may be some sort of equivalent of human culture in AI, and they generally agree that it is possible. It is well established that agents ingest and replicate the biases that already exist in training data fed to it, like racism, homophobia or sexism. However, these sorts of traits are usually only identified because they typically impact predictions the model may make. Therefore, it seems likely that many other human cultural traits are also ingested and replicated, and may evolve over time. Especially in unbounded RL agents, it is highly likely that they can develop behaviours that are equivalent to memetics in the human case, and are separate from the base and mesa optimisers as already defined. It is also possible that AGI may develop their own memes - their own unique culture and ideas - in ways that are not necessarily understood by humans. Some studies have found evidence to indicate that AI could create their own languages to communicate, and there is scholarship that looks at the potential for AI culture.

If we look to the future of AGI, which is more likely to be interconnected, ubiquitous, have access to more data, and involve generative techniques, the relationships (and transmission of material) between AGI is more likely to play a further impact, and this will also increase the likelihood for memetic optimisation.

In memetics as applied to humans, there are many theories that address the impact of the memetic optimiser over time - for example religion waxing and waning in importance - especially as human survivability becomes ‘easier’ with modern conveniences. It is possible that memetic optimisers in AGI may be present now, but only be obvious post-testing. This would represent a further alignment challenge to those already known, and raises questions about the ‘intelligence’ and ‘human-like agency’ of AGI.

This is a field rich with possibility and analogy. There are many further research questions, including the below:

Expanding the standard cybernetics model to include memetics
Imitation within AGI
Could AGI ‘culture’ represent more than just another optimiser, and how might future AGI shape what this looks like
Considering other types of AI within AI Safety eg. generative methods
Looking to more broad fields that could lend to AI Safety

While this post focused primarily on single-agent systems, my teammates will be exploring some of these further research questions for multi-agent systems in the next two posts.