[Edited]: I have removed the experiments after realizing that I had implement the algorithm incorrectly. I have replaced the experiment with different research papers that use similar objectives.

Introduction

I have been working on an idea in relation to AI alignment and I want feedback from this community on both the validity of the idea and to comment with any future directions that this work should be taken in. 

I am going to try and accomplish 3 things in this post.

  • Introduce and explain a reframing of AI alignment.
  • Present a mathematical version of the problem and a potential solution.
  • Suggest research papers that implement versions of this objective.

Quick AI alignment overview

Before we start reframing AI alignment it is important to understand what people mean when they discuss AI alignment. There is a lot of information out there explaining what AI alignment is (Blog Posts, Free Curriculums, etc.). However, to keep it short, I will stick with the definition that Paul Christiano presents, which is:

A is trying to do what H wants it to do.

In this case A is the AI agent and H is the human.

A common way of tackling this problem is through trying to understand what it is that the human wants and encode that as the goal for the AI agent. This is the core idea behind Value Learning. Value Learning takes information, such as human behaviour, and uses that information to predict what the true goal the human is optimizing for.

This way of solving AI alignment though has some pitfalls including:

  • Ambiguity of the reward function: With limited data, there may be many reward functions that can fit the data, making it difficult to accurately predict the true goal.
  • Human irrationality: Humans are not always rational or omniscient, meaning we may make mistakes that the AI model may try to justify under the reward model, leading to the wrong goal being encoded.
  • Misunderstood goals: If the model does not fully understand the desired goal, it may push variables to extremes to satisfy the goal from the reward model, leading to unintended consequences.

So what can we do to help reduce some of these problems? Well what if we look at the data we use for Value Learning from a different perspective.

Fundamental Assumption

Before we go into reframing the AI alignment problem I am going to make a big assumption and explain my reasoning behind this assumption because it informs the rest of the blogpost. This is the assumption:

The change from the current state to the next state is dependent only on our values and errors (from both the environment and ourselves).

These are my definitions for value and error:

  • Value: How we want the world to change (I'll explain later why I use change instead of a static goal)
  • Error: Anything that makes us deviate from the desired change.

To clarify what I'm saying let me give an example:

Imagine that you are a rat and your aim in life is to collect as much cheese as possible (because let's be honest, who doesn't love cheese?). If I place you on one end of a table and the cheese on the other, it's a no-brainer that you'll take actions that bring you closer to the cheese. The change in your location (where you are on the table) is determined by your undying love for cheese, which makes you want to be as close to it as possible.

But let's say there's a scientist who wants to make it harder for you to reach the cheese (probably because they're jealous of your cheese-collecting skills). So, this scientist decides to reset the experiment, put you in a maze, and move the cheese to the left side of the maze. You, being the clever rat that you are, decide to go to the right side of the maze (even though you can smell the cheese on the left) due to some inherent biases you have. And let's just say that decision didn't quite pan out for you because you didn't find the cheese. But don't worry, you're not one to give up easily! You change your direction and head towards the left side of the maze, where you finally find the cheese (hooray!). The change in your location (where you are in the maze) is now determined by your values and biases as a lab rat, which in this case can be considered an error.

The reason why I define values as a desired change in state instead of one desired state is because humans don't really value particular end states instead we use end states as clear pointers for us to work towards.

An obvious example that we can all relate to is the idea of achieving a goal, feeling happy for a while, and choosing another goal to work on afterwards. Even in fiction when we explore the idea of dystopias disguised as utopias a key part of them is the idea of being static (which in most cases are boring and oppressive). This is why we tend to say 

The journey is more important than the destination.

When we define values as a desired change in state it can encompass alot more things that we might want because it becomes less consequentialist and it allow us to incorporate our values about the process it takes to get to certain goals.

Reframing AI Alignment

The data that is used in value learning algorithms today tends to be of two kinds:

  • State - Action: This data is the behaviour of the agent (in this case human) that you want to learn the values from. Methods then estimate what reward function would make this behaviour the most optimal.
  • Current State - Next State: This is a relatively newer area of research that only takes the observed transition of states (instead of the action) and estimates what reward function that the optimal policy's state distribution would match the state distrubtion in the data. 

This data usually is collected beforehand and a reward model is fit to the data, although there are some disciplines that are more interactive in how the data is collected and used (CIRL, IIRL, etc.). 

But an alternative way of looking at the problem is by trying to reduce the errors that are present in the data.

I believe instead of trying to estimate the values that are present in the data and give that objective to an AI agent we should instead give an objective to the AI agent that tries to reduce the errors that occur as we move from the current state to the next state.

That is all fine and dandy but is there a way of mathematically visualizing the problem that we are trying to describe? (Spoilers: Yes there is)

We are going to use a few concepts from infomation theory to describe the problem:

  • Source: A source is a device or process that generates a message or signal.
  • Channel: A channel is the way a message or signal travels from one place to another.
  • Noise: Noise are unwanted/unexpected signals that distort a original signal.
  • Mutual information: The information that is shared between variables (used in the next section)

Using these definition and the assumption from the last section it is now possible to frame the problem using information theory.

The image above describes the problem using information theory. Given the current state of the world our values produce an imaginary ideal next state. However this ideal next state is then distorted (through random factors in the world, suboptimal actions, etc.) and produces the next state that we actually observe.

Ok now the question you might be asking is why did we choose to frame it this way. Well the reason why is because my approach uses the properties this system has to tackle the AI alignment problem.

Potential Solution

One of the properties this system has is that it is a markov chain. A markov chain is a mathematical system that transitions from one state to another. The next output depends only on what the previous input was.

This fact means that it obeys the Data processing inequality which says you can't extract more information from a piece of data than is already present in that data. This definition implies that if you have a markov chain that looks like this:

Then the mutual information between X and Z is less than mutual information between Y and Z:

The inequality states that the mutual information between the current and next state, which are observable, is less than or equal to the mutual information between the ideal next state and the real next state. By maximizing this lower bound, we can increase the mutual information between the ideal and real next state, thereby reducing errors in the channel and bringing the real next state closer to the ideal next state.

All of this makes me believe that that maximizing the following objective could lead to aligning the policy with the values inherent in the change of states because it should reduce the errors that occur from the change in states:

In english this is the mutual information between , the current state, and , the next state, given , the AI agent's policy (usually the actions given the state).

Another way of rephrasing it would to say it is maximizing the predictive information (mutual information of the past and the future) given a AI agent's policy.

Related Research Papers

This paper has a much more detailed analysis of using Predictive Information as an objective for an embodied autonomous robot and shows the exploratory behaviour that this can encourage but the learning rule is derived from the the environment and not learned.

This is another paper that uses the same objective in a non-stationary setting and shows emergent cooperation between independent controllers. These are the videos for the paper.

This is a paper of a physical robot optimizing the Predictive Information and showing exploratory behaviour as well.

Limitations and Questions

In this section I am just going rattle off some thoughts and doubts about the conclusions that I have made in this post. I am also going to ask some questions I find interesting that could be further researched.

Thoughts:

  • This reframing of AI alignment doesn't explictly use humans when asking about the values that it is aligning with. I did this somewhat on purpose because there are benefits that come from trying to align with the environment (which includes humans). One benefit is that there isn't a distinct split of one human from another in the environment so the problem of multiple values that it needs to align with becomes an inherent part of the objective. This then includes optimizing conflict resolution and other societal processes that consolidate values in the original objective.

Doubts:

  • I am not sure that the translation of the reframed AI alignment into a problem using information theory is actually valid. This framework of having an ideal system and introducing errors into that system is very similar to robust control however the desired system is already known so controllers know what to adjust to get back to the desired system. In the way I reframed it the data is transformed into a hypothetical state and back into a real state which might invalidate the hypothesis.
  • This whole blogpost is a hypothesis based on some intuition I have and the interesting results found in the papers under the experiment section. There isn't any serious math that I have done to theoretically prove that this objective is useful. The only math I use is surface level to provide a semi-coherent argument for this objective. I don't have the mathematical background to further delve deeper into why it is a useful objective. So anyone that does please leave some feedback.

Questions:

  • What happens with this objective in partially observable settings? How does it affect what the optimal policy is optimizing for?
  • What happens when this agent is placed in an environment where two other agents have completely different objectives (Chaser and Runner in a tag environment)? Does it favour one agent over the other? Is it neutral?

Last Thing

This is my first post so I would like any feedback particularly on my format and anything else you notice. 

I would also like to mention ChatGPT chose the title and acted as an editor for this post.

New to LessWrong?

New Comment
4 comments, sorted by Click to highlight new comments since: Today at 7:29 PM

I wrote and then rewrote a sequence called Reducing Goodhart so I could plug it in spots like this. It's my shot at explaining what to do instead of assuming that humans have some specific "True Values" that we just have to find out.

I believe instead of trying to estimate the values that are present in the data and give that objective to an AI agent we should instead give an objective to the AI agent that tries to reduce the errors that occur as we move from the current state to the next state.

[...]

All of this makes me believe that that maximizing the following objective could lead to aligning the policy with the values inherent in the change of states because it should reduce the errors that occur from the change in states:

In english this is the mutual information between , the current state, and , the next state, given , the AI agent's policy (usually the actions given the state).

 

It's worth noting that mutual information is really far from the conventional definition of "error" -- e.g. if there's any deterministic map from S_t to S_t+1, this suffices to maximize MI (indeed, we get MI(S_t, S_t+1) = H(S_t) = H(S_t+1), where H is the entropy of the state). So under the naive setup, you could have arbitrarily large changes in the state, leading to arbitrarily poor performance, as long as the policy navigated to states featuring more deterministic transitions. 

As a consequence, MI is pretty bad even as a regularizer you add to your reward when selecting policies or trajectories, unless you really wanted the agent to navigate to states with deterministic state transitions for whatever reason. 

(This doesn't even get into the issue of how to actually estimate mutual information, which can be really hard to do well, as the two papers you reference correctly point out.)


Let's give a concrete example of how this goes wrong: suppose you have an AI acting in the world. This objective encourages the AI to reduce sources of randomness, since this would make S_t more informative about S_t+1. Notably, assuming the AI can't model human behavior perfectly (or models human behavior less well than it does ordinary physical phenomena), living humans reduce I(S_t; S_t+1), while dead humans are way easier to predict (and thus there's less randomness in transitions and higher I(S_t; S_t+1)). So rather than discouraging your AI from seeking power or aligning it with human values, your proposed objective encourages the AI to seek power over the world and even kill all humans. 


I think the actual concept you want from information theory is the Kullback-Leiber divergence; specifically you'd want take a policy that's known to be safe and calculate KL(AI_policy || safe_policy) and penalize AI policies that are far away from the safe policy. 

Thanks for the response. I think that you bring up a good point about it leading to more predictable transitions could be bad however doesn’t the other part that is optimized somewhat counteract this?

The objective I(S_t ; S_t+1) breaks down into

H(S_t+1) - H(S_t+1|S_t)

So when this objective is maximized then because of the second term it does try and increase the predictability however the first term makes the states it reaches more diverse.

This is partially why some people use predictive information as a measure of complexity in sequential series.

Since human are optimizing for a pretty complex system I think that this would somewhat be accounted for.

In addition the objective where the policy is used to condition the Mutual information is changing the policy to align with the observed transition rather than deciding it’s own objective separately.

“ I think the actual concept you want from information theory is the Kullback-Leiber divergence; specifically you'd want take a policy that's known to be safe and calculate KL(AI_policy || safe_policy) and penalize AI policies that are far away from the safe policy. ”

The reason why I didn’t pursue this path is because of 2 reasons:

  1. The difficultly of defining what a safe policy is in every possible situation

  2. And I think that whatever the penalization term it should be self-supervised to be able to scale properly with our current systems

oooh upvote! however...

this feels like chatgpt's writing in a way that I find makes it harder for me to understand. I've been talking to chatgpt to try to understand these concepts as well, and I generally find it to be good for suggesting keywords, but its tendency to repeat itself feels like a bad school essay, not an insightful explanation. it brings up concepts because it needs them to think out loud, but then doesn't expand on them enough to really teach me about them. so I look up YouTube videos and Wikipedia articles on each subtopic, drop fragments of explanations into metaphor and read those, and still feel like it hasn't properly resolved my confusions. I am very excited about this approach and I agree it's probably the true answer; it meshes well with, eg, MIMI, LOVE in a simbox, etc. but actually resolving all the references enough that I actually understand will take some doing, and I'm a just-in-time learner who is missing large chunks of intuition; I look forward to working through this post but I guess my point with this comment is that I'm a bit pessimistic about my ability to compensate for the entropy introduced by sampling randomly from /mlgroups/openai/iffy_school_essay.py.

if anyone with more experience can recommend a series of high quality, high density exercises on information theory, that will help me flesh out my intuitions for the concepts referenced by language in this post, I'd love to see it. I recognize even matching chatgpt unaided can be a lot of writing work, so I would hardly call this post awful for it. but maybe the signal to noise ratio could be improved or something. idk, not totally sure what I'm asking. maybe I just need to try to write a post myself in order to understand whatever it is I'm stuck on about this research path.

catch y'all tomorrow!