This is the last in a series of three posts on the interlinkages between Memetics and AI Alignment. It was written as an output from the 2022 AI Safety Camp, for the ‘Impact of Memetics on Alignment’ team, coached by Daniel Kokotajlo and comprising Harriet Farlow, Nate Rush and Claudio Ceruti. Please read on to post 1 and post 2. We are not AI Safety experts so any and all feedback is greatly appreciated :)
TL;DR: Linking the concept of imitation between the domains of memetic and AI safety leads to considering how misalignment could spread memetically; this helps to identify misaligned agents able to generate a contagious spread of memes of misalignment. In particular, power-seeking agents could diffuse power-seeking tendencies to other agents under mild conditions. This work adds seminal tools to researching misalignment in multi-agent settings.
In memetic theory, imitation is a fundamental part of the evolutionary process of memes since it’s the primary way that provides the means for spreading, reproducing, selecting, and mutating memes. Even if selection pressure on memes is exerted internally, e.g., a single agent recalls only certain memes it was exposed to because the agent finds them easier to remember, the reproduction of memes can exist only in the presence of imitation.
In more detail, Susan Blackmore defines imitation as when an agent copies other agents' observed behaviour to achieve one of its goals (or sub-goals). Even if this concept clearly distinguishes imitation from other forms of collective learning (such as reaction to enhanced stimuli resulting from the actions of other agents), it lacks a detailed explanation of the copying function.
We can fill this gap by referring to the concept of observational learning made by Albert Bandura. Observational learning consists of observing and modelling another agent’s behaviour, without the need to clone the observed behaviour perfectly, but treating it as a learning source rather than precisely imitating it.
Combining observational learning and Blackmore’s definition produces a notion of copying another agent without explicitly and fully modelling its behaviour, but only focusing on replicating certain aspects or part of it.
This synthesis sounds intuitive since when we imitate some behaviour, we generally grasp only its fundamental aspects. If you want to replicate a dancer, you don’t care about wearing the same clothes as them or being illuminated in the same way, etc. you only focus on making approximately the same dance moves: you don’t spend time trying to infer what a dancer thinks when they are dancing, the reasons why they are dancing in the first place, the exact angle of the arms relative to the torso, and so on.
It is also worth noting that this interpretation of imitation, being based on an approximation of selected parts of observed behaviour, opens up to the presence of noise, e.g., replicating some dance move incorrectly or with a personal twist; in memetic terms, noise can be seen as the mutation undertaken by memes during the replication process.
When does imitation take place then? In other words, why does an observer decide to select some seen behaviour and try to replicate it?
The intuitive notion is that we imitate because we think replicating what we saw will be beneficial in order to achieve one or more of our goal/sub-goals. More specifically, the observer thinks a correlation exists between the observed behaviour and the set of behaviours that will be beneficial in terms of reward ("thinks" because the observer can be wrong or deceived).
In Reinforcement Learning several techniques are based on some sort of imitation process: teacher-student algorithms, behavioural cloning, and inverse reinforcement learning are among them.
In behavioural cloning, an agent seeks to regress the policy of an observed expert to implement it and achieve the same goal of the expert having the same exact policy. In contrast, in inverse reinforcement learning, an agent is tasked to infer the policy function from an observed agent. . Both of these approaches require that the observer models the state-action trajectories of the observed agent. This complex and challenging task requires a high degree of assumptions and approximations, which are too restrictive to apply to the generality of real-world scenarios (sparse reward, noisy signals, etc. )
To overcome this limitation, we can again refer to the Bandura definition of observational learning and leverage its formalisation in the domain of Reinforcement Learning.
First of all, we need to include the conditions that enable the execution of observational learning, which are:
Observational learning, formalised as an RL methodology, seems more feasible and generalisable and could also arise naturally from relatively simple multi-agent environments.
It is empirically proven that a reward system in a multi-agent scenario is sufficient to produce imitation through observational learning without even inter-agent communication.
More formally, it is shown that the conditions that allow the presence of imitation are the following
constituting the MEDAL-ADR framework.
Apart from the last one, ADR, which trains the agent to generalise as much as possible to simulate the Baldwin effect, the other conditions can be naturally linked to Bandura’s features of observational learning.
Memory and dropout of the expert (when the agent is left alone in the environment after observing the expert) are retention in Bandura’s terms, as attentional bias is obviously attention. What about reproduction and motivation, then?
Let’s consider a multi-agent setting where an agent is modelled as a MDP; that is we are considering the usual formulation of <S,A,P,R,γ>where S is the state space, A the action space,P the transition matrix, R a reward function and the discount factor.
As an example, this image depicts an agent in state 1 that can choose to take the action of going right to state 2 or the action of going down to state 3 (image taken from here).
In this setting we have an agent, provided with attention and memory, who sees another agent doing something in the environment, like going down from its initial position. The observed agent then keeps moving in its surroundings, creating a sequence of positions, i.e., its states.
At a certain point, the observer decides to imitate part of the trajectories is seeing. It is evident that to replicate this behaviour the agent has to be in a state that allows it to reproduce the sequence of states: an agent won’t be able to replicate the behaviour “always go left” if an unclimbable wall is built on its left.
More formally, an agent can reproduce a trajectory only if a permutation exists between sets of states; that is, a function in the space of symmetries has to exist that projects the seen set of states into a new trajectory, which is composed by the observer agent’s states (the ones that compose the replication enacted by the observer).
The existence of this permutation means that there is a correlation between sets (subsets) of states; without this, no imitation is possible.
We can draw an analogy between the existence of a symmetric function and the functioning of mirror neurons in the animal brain: a mirror neuron is a neuron that fires both when an action is performed by an animal or when the animal sees the same action carried out by another animal. The mirroring process of these specialised neurons is believed to be a staple for imitational learning in primates, as well as in certain species of birds.
But why does an agent choose a particular set of states to imitate?
In addition to the correlation between observed and observer state spaces, another correlation is in place that links a considered set of states with high rewarding goals. It is worth noting that the reward function here is not inferred from the observed agent, but it is the internal reward function of the observer.
In other words, an agent imitates when it recognizes a trajectory of states that are mappable to its own state-action space and that are highly rewarded; that is, the agent believes that being in that sequence of states will fulfil its goal (or a sub-goal).
Choosing to imitate a seen behaviour correlating with high reward goals can be an optimal strategy when exploration is challenging, rewards are sparse or deceptive, and the state-action space is large. These are all features that are easy to find in real-world scenarios. Therefore imitation can be advantageous in many cases; that is when it helps the agent find a highly rewarding path in a complex exploration space.
Having defined imitation as a process requiring the existence of a permutation of correlated states between two agents, activated when the permuted states correlate with high reward goals as measured by the observer reward function, we can start thinking about what this process can produce under a memetic point of view.
In more detail, we will see what agents are the fittest candidates to imitate (the influencers) and what behaviour is most likely to be replicated (the memes).
Starting from here, we can broadly define a power-seeking agent as an agent who decides what action to take to strictly have more options available in the future instead of ending up in a single terminal state.
Some practical aspects of power-seeking behaviour are escaping confinement, taking over the world, refusing correction, avoiding being shut down, among others. In the notorious paperclip example, power-seeking is an instrumental goal that drives the paperclip AI to avoid being shut down and stop collecting paperclips, preventing it from further maximising its objective.
Choosing from the broadest possible range of future states means, on average, being able to follow a number of trajectories of states, that is behaviours, that are by definition greater than the possibility of a non-power-seeking agent.
Therefore, let's consider two agents, a power-seeking one P, and a non-power-seeking one N, both in the same environment with the same action-state space, and a third agent, that we call the observer O.
Since P will tend to be in states that will maximise its agency in the future, that means that it's more probable for the agent O to find a permutation of states between P rather thanNon average; in other words, on average, the observer O can find easier to imitate the power-seeking agent Prather than N.
A power-seeking agent is the best candidate for being imitated by O because it is more probable to find a correlation of states with agent P since it will always have the possibility to enact more behaviours/trajectories than N.
It is worth noting that we are not considering reward functions at this point, so we are not defining what an agent chooses to imitate; we are only considering the potential of an agent to be imitated independently from the specific reward function of the observer.
However, having stated that a power-seeking agent is an influencer, we can say something about the kind of behaviours (memes) will be imitated.
It follows naturally from the tendency of a power-seeking agent to avoid terminal states; that is, it favours enacting behaviours that will keep it away from them.
Therefore, the set of possible correlations that an observer can more probably find on average will be trajectories (behaviours/memes) that will drive it to imitate the same power-seeking inclinations.
Considering power-seeking agents as memetic influencers increases the threat these types of agents could pose over the already known dangerous outcomes produced by them.
Even a simple multi-agent system, where very mild conditions allow a natural emergence of imitation, could be derailed by a power-seeking agent's presence, even for a short amount of time.
The probable outcome is that the power-seeking tendencies will be memetically spread to all the agents, even if the originally power-seeking one is removed from the environment.
From a purely memetic point of view, where agents are vessels used by memes to spread and reproduce, this makes perfect sense: the first commandment of a meme is to be replicated, and a shutdown agent can't spread memes anymore. We can even say that the process of imitation is inherently auto-reproducing itself when it is applied.
This shows the potential benefit for research in AI safety in introducing memetic concepts and definitions to analyse and study potential forms of misalignments under its different point of view.
Further directions of this line of research can be the study of how reward functions can be linked to the selective mechanism of memes: a correlation between seen trajectories of states and high rewards is the trigger for activation, therefore guaranteeing the replication and spread of the seen behaviour. Another one can be the analysis of the presence of symmetry (as pointed out here) between sets of states as a required feature for imitation (the existence of a permutation) could also lead to new insight into convergent evolution or feature universality in machine learning agents.