[AN #58] Mesa optimization: what it is, and why we should care

[-]Pattern6y110

Really great summaries as always. Digesting these papers is rather time consuming, and these distillations are very understandable and illuminating.

[-]Rohin Shah6y20

Thanks!

[-]David Scott Krueger (formerly: capybaralet)6yΩ230

RE Natasha's work: she's said she thinks that whether the influence criteria leads to more or less altruistic behavior is probably environment dependent.

[-]Pattern6y30

Handling groups of agents

Social Influence as Intrinsic Motivation for Multi-Agent Deep RL (Natasha Jaques et al) (summarized by Cody): An emerging field of common-sum multi-agent research asks how to induce groups of agents to perform complex coordination behavior to increase general reward, and many existing approaches involve centralized training or hardcoding altruistic behavior into the agents. This paper suggests a new technique that rewards agents for having a causal influence over the actions of other agents, in the sense that the actions of the pair of agents agents have high mutual information. The authors empirically find that having even a small number of agents who act as "influencers" can help avoid coordination failures in partial information settings and lead to higher collective reward. In one sub-experiment, they only add this influence reward to the agents' communication channels, so agents are incentivized to provide information that will impact other agents' actions (this information is presumed to be truthful and beneficial since otherwise it would subsequently be ignored).

Cody's opinion: I'm interested by this paper's finding that you can generate apparently altruistic behavior by incentivizing agents to influence others, rather than necessarily help others. I also appreciate the point that was made to train in a decentralized way. I'd love to see more work on a less asymmetric version of influence reward; currently influencers and influencees are separate groups due to worries about causal feedback loops, and this implicitly means there's a constructed group of quasi-altruistic agents who are getting less concrete reward because they're being incentivized by this auxiliary reward.

(I didn't understand that last part.)

This reminds me of OpenAI Five - the way they didn't communicate, but all had the same information. It'll be interesting to see if (in this work) the "AI" used the other benefits/types of communication, or if it was all about providing information. (The word "influencers" seems to invoke that.) "Presuming the information is truthful and beneficial" - this brings up a few questions.

1) Are they summarizing? Or are they giving a lot of information and leaving the other party to figure out what's important? We (humans) have preferences over this, but whether these agents do will be interesting, along with how that works - is it based on volume or ratios or frequency?

I'm also gesturing at a continuum here - providing a little information versus all of it.

Extreme examples: a) The agents are connected in a communications network. Though distributed, (and not all nodes are connected) they share all information.* b) A protocol is developed for sending only the minimum amount of information. Messages read like "101" or "001" or just a "0" or a "1", and are rarely sent.

2) What does beneficial mean? Useful? Can "true" information be harmful in this setting? (One can imagine an agent, which upon receiving the information "If you press that button you will lose 100 points", will become curious, and press the button.)

3) Truthful - absolutely, or somewhat? Leaving aside "partial truths"/"lying by omission", do influencers tend towards "the truth" or something else? Giving more information which is useful for both parties? Saying 'option B is better than option A', 'option C is better than option B', and continuing on in this matter (as it is rewarded for this) in stead of skipping straight to 'option Z is the best'.

The paper says it's intrinsic motivation so that might not be a problem. I'm surprised they got good results from "try to get other agents to do something different", but it is the borrowing from the structure of causality.

Deep Learning

Ray Interference: a Source of Plateaus in Deep Reinforcement Learning (Tom Schaul et al) (summarized by Cody): The authors argue that Deep RL is subject to a particular kind of training pathology called "ray interference", caused by situations where (1) there are multiple sub-tasks within a task, and the gradient update of one can decrease performance on the others, and (2) the ability to learn on a given sub-task is a function of its current performance. Performance interference can happen whenever there are shared components between notional subcomponents or subtasks, and the fact that many RL algorithms learn on-policy means that low performance might lead to little data collection in a region of parameter space, and make it harder to increase performance there in future.

Cody's opinion: This seems like a useful mental concept, but it seems quite difficult to effectively remedy, except through preferring off-policy methods to on-policy ones, since there isn't really a way to decompose real RL tasks into separable components the way they do in their toy example

This reminds me of the Starcraft AI, AlphaStar. While I didn't get all the details I recall something about the reason for the population was so they could each be given a bunch of different narrower/easier objectives than "Win the game" like "Build 2 Deathstalkers" or "Scout this much of the map" or "find the enemy base ASAP", in order to find out what kind of easy to learn things helped them get better at the game.

Glancing through the AlphaStar article again, that seemed more oriented around learning a variety of strategies, and learning them well. Also, there might be architecture differences I'm not accounting for.

The neural network weights of each agent are updated by reinforcement learning from its games against competitors, to optimise its personal learning objective. The weight update rule is an efficient and novel off-policy actor-critic reinforcement learning algorithm with experience replay, self-imitation learning and policy distillation.

(Emphasis added.) Well, I guess AlphaStar demonstrates the effectiveness of off-policy methods. (Possibly with a dash of supervised learning, and well, everything else.)

there isn't really a way to decompose real RL tasks into separable components the way they do in their toy example

This sounds like one of those "as General Intelligences we find this easy but it's really hard to program".

*Albeit with two types of nodes - broadcasters and receivers. (If broadcasters don't broadcast to each other, then: 1) In order for everyone to get all the information, the broadcasters must receive all information. 2) In order for all the receivers to get all the information, then for each receiver r, the information held by the set of broadcasters that broadcast to it, b, must include all information.)

[-]Rohin Shah6y20

and this implicitly means there's a constructed group of quasi-altruistic agents who are getting less concrete reward because they're being incentivized by this auxiliary reward.

Suppose Alice and Bob are in an iterated prisoners dilemma ($2/$2 for both cooperating, 1/1 for both defecting, and 3/0 for cooperate/defect.) I now tell Alice that actually she can have an extra $5 each time if she always cooperates. Now the equilibrium is for Alice to always cooperate and Bob to always defect (which is not an equilibrium behavior in normal IPD).

The worry here is that by adding this extra auxiliary intrinsic reward, you are changing the equilibrium behavior. In particular, agents will exploit the commons less and instead focus more on finding and transmitting useful information. This doesn't really seem like you've "learned cooperation".

This reminds me of OpenAI Five - the way they didn't communicate, but all had the same information.

Note that in the referenced paper the agents don't have the same information. (I think you know that, just wanted to clarify in case you didn't.)

Most of the considerations you bring up are not things that can be easily evaluated from the paper (and are hard to evaluate even if you have the agents in code and can look into their innards). One exception: you should expect that the information given by the agent will be true and useful: if it weren't, then the other agents would learn to ignore the information over time, which means that the information doesn't affect the other agents' actions and so won't get any intrinsic reward.

I'm surprised they got good results from "try to get other agents to do something different", but it is the borrowing from the structure of causality.

I do think that it's dependent on the particular environments you use.

This reminds me of the Starcraft AI, AlphaStar. While I didn't get all the details I recall something about the reason for the population was so they could each be given a bunch of different narrower/easier objectives than "Win the game" like "Build 2 Deathstalkers" or "Scout this much of the map" or "find the enemy base ASAP", in order to find out what kind of easy to learn things helped them get better at the game.

While I didn't read the Ray Interference paper, I think its point was that if the same weights are used for multiple skills, then updating the weights for one of the skills might reduce performance on the other skills. AlphaStar would have this problem too. I guess by having "specialist" agents in the population you are ensuring that those agents don't suffer from as much ray interference, but your final general agent will still need to know all the skills and would suffer from ray interference (if it is actually a thing, again I haven't read the paper).

This sounds like one of those "as General Intelligences we find this easy but it's really hard to program".

Yup, sounds right to me.

[-]Pattern6y30

Note that in the referenced paper the agents don't have the same information. (I think you know that, just wanted to clarify in case you didn't.)

Yes. I brought it up for this point: (edited)

gesturing at a continuum - providing a little information versus all of it.

A better way of putting it would have been - "Open AI Five cooperated with full information and no communication, this work seems interested in cooperation between agents with different information and communication."

The worry here is that by adding this extra auxiliary intrinsic reward, you are changing the equilibrium behavior. In particular, agents will exploit the commons less and instead focus more on finding and transmitting useful information. This doesn't really seem like you've "learned cooperation".

That makes sense. I'm curious about what value "this thing that isn't learned cooperation" doesn't capture.

Most of the considerations you bring up are not things that can be easily evaluated from the paper (and are hard to evaluate even if you have the agents in code and can look into their innards). One exception: you should expect that the information given by the agent will be true and useful: if it weren't, then the other agents would learn to ignore the information over time, which means that the information doesn't affect the other agents' actions and so won't get any intrinsic reward.

A better way of putting my question would have been:

Is "useful" a global improvement, or a local improvement? (This sort of protocol leads to improvements, but what kind of improvements are they?)

Most of the considerations you bring up are not things that can be easily evaluated from the paper (and are hard to evaluate even if you have the agents in code and can look into their innards).

Hm, I thought of them as things that would require looking at:

1) Behavior in environments constructed for that purpose.

2) Looking at the information the agents communicate.

My "communication" versus "information" distinction is an attempt to differentiate between (complex) things like:

Information: This is the payoff matrix (or more of it).*

Communication: I'm going for stag next round. (A promise.)

Information: If everyone chooses "C" things are better for everyone.

Communication: If in a round, I've chosen "C" and you've chosen "D", the following round I will choose "D".

Information: Player M has your source code, and will play whatever you play.

*I think of this as being different from advice (proposing an action, or an action over another).

[-]Rohin Shah6y30

I'm curious about what value "this thing that isn't learned cooperation" doesn't capture.

It suggests that in other environments that aren't tragedies of the commons, the technique won't lead to cooperation. It also suggests that you could get the same result by giving the agents any sort of extra reward (that influences their actions somehow).

Is "useful" a global improvement, or a local improvement?

Also not clear what the answer to this is.

Hm, I thought of them as things that would require looking at:

1) Behavior in environments constructed for that purpose.

2) Looking at the information the agents communicate.

The agents won't work in any environment other than the one they were trained in, and the information they communicate is probably in the form of vectors of numbers that are not human-interpretable. It's not impossible to analyze them, but it would be difficult.

[-]PhilGoetz2y20

So, "mesa" here means "tabletop", and is pronounced "MAY-suh"?

[-]Pattern6y00

Errata:

it if also

if it also

[-]Rohin Shah6y20

Fixed, thanks.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

55

[AN #58] Mesa optimization: what it is, and why we should care

55

Ω 22

55

Ω 22

Highlights

Technical AI alignment

Agent foundations

Learning human intent

Handling groups of agents

Uncertainty

Miscellaneous (Alignment)

AI strategy and policy

Other progress in AI

Reinforcement learning

Deep learning

Meta learning