[This is post 2 of 3 in a series written as the output for the 2022 AI Safety Camp, for the ‘Impact of Memetics on Alignment’ team, coached by Daniel Kokotajlo and comprising Harriet Farlow, Nate Rush and Claudio Ceruti. You can view the first post in this series here. Feedback so appreciated I'll do a jig for every comment.]
This post is a beginner’s attempt at extending the generally-accepted model of reinforcement learning agents (see this post).
This extension encourages us to think about artificial agents in a more suggestive and (potentially) more productive way. Specifically, this extensions captures an agent’s ability to change the policies it uses over time, while abstracting the decision to change policies away from the agent itself.
This encourages thinking about agents as more dynamic creatures than the standard RL formalization. It also allows for some interesting arguments about constraints on these agents' behaviors that are relevant to AI safety. This post is not optimistic, as it seems these more dynamic agents are less likely to be well-aligned.
This post was written as a culmination of my time at AI Safety camp. I am not an AI researcher, nor am I particularly good at math. Please treat these suggestions as a beginner taking a shot in a specific direction. Of course, any and all feedback is greatly appreciated!
The model extensions I suggest are motivated by the study of memetics and memes.
By memes, I’m not talking about Grumpy Cat - I’m talking more generally about memes as a unit of “cultural replication.” The theory of memetics argues that humans are a uniquely big system of cultural replicators, and as such a well-developed theory of memes might be able to explain much of our behavior.
Although memetic theory is far from mainstream, its relevance to AI Safety is immediately obvious. Humans are the most intelligent agents we know of, and we are also by far the most memetic. As such, it’s natural to wonder how this trend might continue to affect artificial (and possibly smarter) agents.
Importantly, this is not an attempt to formalize memes. This model extension merely uses memetics as the inspiration. Memes make clear that behavior/beliefs may infect an agent, whether they want to be infected or not. This, in turn, might change how an agent behaves through a process entirely outside of its control.
This has obvious implications for AI safety research. An agent that does not control its own behavior may be particularly hard to align indeed.
We begin from the model described in this post, and adapt from there.
First, for agents to change policies, we need to be able to refer to different policies specifically.
Thus, we introduce a set of names I, where there is a name i∈I for each policy πi. As this paper argues, these names can naturally be chosen as program strings with respect to some interpreter.
Now, we can speak of an agent who knows about not just one policy but rather a collection of policies.
Specifically, the agent has a set It⊆I which is the set of the names of the policies that it knows at any specific time step. I0 is the initial set of policies that this agent has. Every timestep, part of the agent's observation of the world is additional policies it adds to those it has seen before, extending It+1.
Now that our agent has a specific set of policies available to it at every time step, it is natural to expect that the agent will choose one.
Here, we take this decision out of the agent's hands. Instead, we have some a function called the policy selector: P(It)=it . This function is not run by the agent, but rather by the environment. The agent uses policy πit at that time step.
Notable here is the agent's lack of choice in what policy it runs. It does not get to select the policy that it believes will maximize utility (although one could construct a policy selector that does this). Instead, we put no constraints on the policy that is selected.
This policy selector acknowledges that instead of the agent selecting the policy to run itself, rather there may be some other process that selects the policy the agent runs. This process could be a memetic one, rather than just some utility maximization process.
It might seem like we’re not really adding much to the model with this extension; couldn’t we just represent these changing policies as a single policy in the standard RL model?
Yes, we could! But as we will explore below, this tiny amount of additional, meme-motivated structure will allow us to reason about the behavior of the agent in interesting and somewhat-unique ways.
The agent starts with some initial set of policies I0. A single policy is selected from this set and run, which in turn affects the response from the environment. Notably, in our model, this response contains the new policies the agent is exposed to and could possibly run in the future.
So: the policies an agent runs determines its behavior, which determines which policies it learns about, which determines which policies can be run in the future.
This creates a clear path dependency in the policies that the agent may choose to run, no matter what policy selector is used.
It’s not a new observation that agents’ behavior is path dependent: we can argue path dependence occurs in training from gradient descent (and other systems that do not converge to the global minimum) -- how is this different?
In this case, the path dependence argument we are making is not a function of the training architecture, of the training environment, or of the policy selector function. This path dependence is a fundamental aspect of being an agent with non-total knowledge in the world, and will exist for any agent we could conceivably construct.
Moreover, this behavioral path dependence exists in a post-deployment agent in the wild.
One interesting corollary of this path-dependence argument: it is possible for an agent to be unaligned from a utility function perspective while still being aligned from an policy/action perspective, and vice-versa. An agent’s utility function might not prefer to do aligned things, but it simply might not be aware of strategies that break alignment in practice.
For example, consider a paperclip maximizer that is not aligned: it would be happy to kill all humans on earth as long as it would result in more paper clips. But if it never learns strategies that helps it kill all humans on earth (to make more paper clips), so what?
This side -- the unaligned utility function but aligned actions -- is less scary, of course. What is scarier: an agent that has a totally aligned utility function, and yet due to a rough upbringing in a non-ideal training environment and path dependence, it never can quite behave in an aligned manner.
In this model, utility function alignment is not enough to conclude the agent’s behavior will be aligned in practice. It seems we need to argue about the set of policies that the agent actually deploys, which in turn is a function of the strategies it has learned, which is a function of the strategies it chooses to deploy, etc.
This memetically-inspired extension has allowed us to make a somewhat interesting argument about the relationship between aligned utility functions and aligned behavior. Of course, there might be other ways to get to this conclusion from our current model -- but to me, this felt very natural.
Of course, I am a beginner, and this research was primarily done as a way to learn about AI Safety!
That being said, I do have a general fear of well-accepted formalisms and the thinking patterns they encourage. It helps to remember that for something as imaginary as future AI agents, there’s many ways that things could turn out, and forcing ourselves to step outside our formalisms is a good way to reset our thinking and glance at AI Safety with fresh eyes.