What would you do differently? What general procedural changes would have resulted in a better outcome? What procedures worked well such that you'd recommend they stay largely the same for next time?
Thanks for the news! Did players automatically get reputation attached to them? It seems like it might be interesting to try to learn how to assign a reputation-like number, so as to get good equilibrium outcomes.
Yes, reputation was automatically updated each time step.
We thought it would make sense to decrease reputation for "unsustainable" and "violent" behaviors (e.g. over-harvesting apples, tagging other agents) and increase reputation for "sustainable" and "peaceful" behaviors. But these rules were all hardcoded.
Work by Quinn Dougherty, Ben Greenberg & Ariel Kwiatkowski
From the AI Safety Camp group that worked on Cooperativity and Common Pool Resources: a write-up of a problem we worked on, our proposed solution, and the results of implementing this solution in a simulated environment.
Contents:
Check out our GitHub repo here.
0. Abstract
When multiple parties share the access to a finite resource, they may overuse the resource leading to a Tragedy of the Commons. A simple way of mitigating this problem is allowing them to decrease the effective population by using violence - this, however, is not the best solution for obvious reasons. We study interventions for avoiding these outcomes in environments with multiple self-interested agents. In particular, a reputation system can incentivize agents to “cooperate” by harvesting the resource more sustainably. This system promotes multi-agent cooperation without modifying the agents’ reward functions.
1. Problem description: tragedy of the commons
A common pool resource (CPR) is a good which anyone can use but which no one can entirely control. When competition over the resource increases, individual parties are incentivized to appropriate as much of the resource as possible for themselves, which further increases competition, potentially leading to overconsumption. This process is commonly known as the tragedy of the commons, and has been extensively studied in economics and the social sciences.
A tragedy of the commons often leads to the exhaustion of a CPR. The classic example is of fishermen in an enclosed body of water; when fish are scarce, an every-man-for-himself dynamic emerges. Any available fish is quickly caught for fear that, if one hesitates, their counterparts would surely take it otherwise. This behavior results in the fish population failing to replenish, and everyone catching fewer fish in the long run.
Even if an individual would prefer to fish sustainably, they are punished for doing so unilaterally. This is apparent in the prisoners’ dilemma, where overfishing is analogous to “defection” and sustainable fishing to “cooperation”. In order to make cooperation a viable strategy, other participants must agree to follow suit.
Even if several parties do manage to coordinate, their incentives remain the same, unless non-cooperative parties are held accountable for their behavior. The best individual outcome for any player is to convince their counterparts to cooperate — then continue defecting, if they can get away with it. This is related to the free-rider problem, wherein agents profit from others’ cooperation without cooperating themselves.
If participants cannot coordinate in this way, they may resort to coercion or violence. After all, the most effective way to eliminate competition over resources may be to eliminate one’s competitors. While this can, in fact, lead to a stable consumption and avoiding the tragedy of the commons, it is undesirable from a societal point of view.
One solution might be to keep track of each participant’s “reputation”, as determined by their history of cooperation/defection. An agent with a good reputation is known to cooperate often, especially with other high-reputation agents. On the other hand, an agent with a bad reputation is known for defecting against trustworthy agents, and therefore should not be trusted. Reputation can stem both from the consumption behavior, and from the level of violence. An equitable solution would be all participants cooperating peacefully, and only using violence as a defense mechanism against the agents which are expected to defect.
Our goal in this project was to see if a transparent reputation system would allow agents to trust each other enough to cooperate, such that their combined rewards would be higher over time, while at the same time avoiding excessive violence. This problem may be important for existential safety as it relates to climate change and sustainability, as well as conflict over finite resources.
2. Related Work
Elinor Ostrom discusses tragedies of the commons in her book Governing the Commons: the Evolution of Institutions for Collective Action.
Zhu and Kirley, present a CPR game wherein multiple agents harvest a finite renewable resource. Agents seek to maximize a weighted combination of two objectives: personal wealth (i.e. the value of the resource they appropriate, minus the cost of this effort) and sustainability (i.e. a measure of whether the resource is being depleted). As the replenishment rate grows logistically with respect to the size of the resource pool, agents must cooperate (without communicating) to avoid exhausting the resource. While this paper solves a similar problem to the one we investigate, we found the prospect of reward function modification to be unrealistic; most real-world CPR problems will likely not be solved by convincing the participants to harvest more sustainably.
Leibo et al. defined a sequential social dilemma (SSD) as a Markov game which abstracts a repeated matrix game. In an SSD, sequences of actions are labeled as “cooperative” or “defective” based on empirical game-theoretic analysis. With this approach, we can use reinforcement learning (RL) algorithms to learn policies in simulated environments, and analyze the resultant incentive structures using the tools of game theory.
Most existing simulations of CPRs involve normal-form games, where, in each time step, agents simply choose to harvest a scalar amount of the resource. In 2017, DeepMind applied the SSD framework to CPR games, allowing agents to learn complex strategies for gathering apples in a grid world. We implemented a copy of this environment for our experiments; the mechanics are described in the next section.
We find that, though this paper presents a new paradigm for training groups of agents to interact in CPR environments, as well as tools for measuring a group’s “cooperativity”, it does not offer a satisfactory solution to the implied problems of safety. The researchers found that the tagging action increases rewards over time: “Conflict between agents in the commons game has the effect of lowering the effective population size and thus relieving pressure on the CPR stock. With less agents harvesting at any given time, the survivors are free to collect with greater impunity and less risk of resource depletion. Furthermore, they saw exclusion/privatization of resources as a useful tactic for increasing sustainability at the expense of equality. Needless to say, these “solutions” to scarcity have alarming implications for real world problems — though to be fair, prescribing “safe” solutions is not in the explicit aims of the paper, which offers a descriptive account of behavior in CPR problems.
Finally, the document Open Problems in Cooperative AI provides an overview of problems related to cooperation in multi-agent environments. Section 4.4.3 covers institutions such as reputation systems for building trust between multiple autonomous parties.
3. Experiment and results
Our main goal was to demonstrate how a reputation system could incentivize cooperative solutions to CPR problems.
To do this, we replicated the Harvest environment from Perolat et. al. The environment consists of a grid of cells, some of which contain apples. Agents can move up, down, left, and right, or rotate to either side during each time step, and can collect an apple by moving into a cell that contains one. An apple can regrow in its original cell, with a higher probability if there are several apples nearby.. Furthermore, agents are equipped with a wide “tagging beam” that can disable a competitor for 25 time steps. The episode ends when all apples have been gathered, or after 1000 time steps. You can see a video of an episode here.
The input to each agent is a set of binary matrices for each “layer” of the board: apples, walls, and agents. The visibility is partial, limited to a slice of the board in front of the agent. We added an additional layer for (centralized) reputation scores, so that agents could make decisions based on the reputations of the agents in its field of view.
In the original environment, agents have no way of recognizing which other agents are contributing to the tragedy of the commons dynamic. We added a centralized reputation system so agents could learn policies that depended on the “cooperativity” of their nearby counterparts. Ideally, “exploitative” agents would be given lower reputation scores than “cooperative” ones -- this would enable the latter to coordinate effectively, and “tag” agents more likely to exploit the resource. We hoped that over time, agents would learn that tagging their exploitative counterparts resulted in greater sustainability and overall reward. The factors we tested for the reputation system included:
We trained groups of 10 heterogeneous agents to selfishly maximize their own rewards in a custom environment:
Here the agents are represented by blue squares, and the apples by green squares. We ran 3 trials:
Overall, incorporating a reputation system slightly decreased the rate of tagging:
However, the reputation system had a negligible effect on rewards, sustainability (i.e. the average time step during which an apple was harvested), and equality (i.e. the gini coefficient of rewards), especially compared to removing tagging:
Basing the reputation score on aggression was ineffective. In thinking about tragedies of the commons, we had assumed that violence and selfish behavior would correlate with lower cumulative rewards. However, we observe that the addition of the tagging metric actually led to increased cumulative rewards, at the cost of a lower sustainability. So if the reputation system “punishes” tagging, agents may actually learn to cooperate with their low-reputation counterparts.
The other factor we considered, resource depletion, was also ineffective for avoiding the tragedy of the commons dynamic. In our implementation, it is not clear that “greedier” agents (i.e. those which readily harvest high-value apples) are tagged at a higher rate than their more conservative counterparts.
4. Retrospective
At the beginning of this project, we considered using a reputation mechanism system to modify agents’ individual rewards directly; this was the approach taken in Zhu & Kirley. However, we judged that this mechanism was an unrealistic intervention for avoiding tragedies of the commons related to common pool resources in the real world. Our alternative strategy was to build a reputation system for measuring “cooperativity,” and hope that agents would learn to reward prosocial behavior.
Now, it is clear that our alternative strategy was much more difficult than we had anticipated. Agents did not learn that defecting against defectors, or cooperating with cooperators, yielded greater individual reward. Though agents were trained heterogeneously, differences between individual policies may have been too small to allow agents to “learn” cooperativity. If we had identified this as a potential crux earlier in the project, we could have modified our approach after the current instantiation failed to produce any change in cumulative reward.
Instead, we spent several months (the bulk of the AISC5 program) implementing our own version of the Harvest environment. This work was undoubtedly useful, as we were able to create a modular and customizable environment for running experiments. However, waiting months to run the first reputation system experiment made it difficult to iterate as the final presentations drew near.
In retrospect, we could have modified DeepMind’s own implementation of Harvest, rather than building our own environment from scratch. On the bright side, we got lots of practice debugging the custom-built multi-agent RLlib environment.
------
Thanks to Max Chiswick, Remmelt Ellen, Caspar Oesterheld, and all of the AISC organizers for their advice and support for this project.