Notes on an Experiment with Markets

Jeffrey Heninger

Jeffrey Heninger, 22 November 2022

AI Impacts is a research group with seven employees. From Oct 31 – Nov 3, we had a work retreat. We decided to try using Manifold Markets to help us plan social events in the evenings. Here are some notes from this experiment.

Structure of the Experiment

Katja created a group on Manifold Markets for AI Impacts, and an initial collection of markets. Anyone could add a market to this group, and five of us created at least one market. Each of us would rate each evening from 0 to 10 on an anonymous Google form. Most of the questions in the group were about the results of the form, often conditional on what activity we would do that evening. For example: “On the first day that at least 4 people begin a game of One Night Werewolf at the AI Impacts retreat, will the average evening rating be above 8?” The markets would resolve at some point the next morning after we had submitted our forms and Katja calculated the average evening rating.

Disagreements about the Experiment

There were several disagreements about how the experiment was supposed to be run.

Initially, the role of the evening rating form was unclear. Was it asking for your honest assessment of the evening or was it part of the game? “What number would you like to assign to the evening?” is different from “How good was your evening honestly?” We decided that we wanted honest responses. Even then, the numbers were ambiguous. What constitutes a 7 evening vs. a 9 evening? Different people’s baselines result in different scores, which can alter the average. After the first evening, we had a better estimate of the baseline. Many of the markets had used an average score of above 8, which was higher than the baseline. This made the markets feel less useful, instead shifting the predictions to lower probabilities while remaining useful. It’s not clear why this happened, but it might have been because we didn’t want to bet against ourselves having a good time or because the tail of an unknown distribution is harder to predict than the middle of the distribution.

One morning, Katja told us the average score before resolving her markets. Zach used this information to bet on these markets. Rick thought that it was unclear whether this should be allowed, because not everyone was there and because the previous discussion about honest ratings suggested that we should ask before doing something that might give an advantage independent of prediction ability. We decided that this would not be allowed in the future, and that we would not tell each other the results of the markets before resolving them.

Unrealized Potential Problems

We thought of several other potential problems that did not end up being an issue.

One potential concern was that the interplay between the dynamics of the market and social events might make the socialization worse. Someone who had bet against having a good evening might have less reason to want the evening to be enjoyable to himself and others. If people spent time during the evening thinking about and frequently betting on the markets, it might disrupt the ongoing activities. In practice, while people did bet on the markets in the evening, it did not disrupt the other activities.

We had several other ideas for how to mess up the markets: filling out the anonymous form multiple times, colluding or bribing people to alter their scores, publicly filling out your form before the evening begins to manipulate the market, and purposely trying to thwart other people’s clever strategies. None of us tried doing any of these, but they might become relevant if the stakes were higher. There is also the concern that conditional and counterfactual predictions are not the same: For decision making, we would like to compare various counterfactuals, but it’s easier to make markets which are conditional on us doing something. If we decide to do that thing, it is probably because at least some of us want to do it, so the conditional prediction will be higher than the counterfactual prediction.

What We Did in the Evening

The goal of the markets was to help us plan out social events in the evenings. If the market thought that the evening’s rating would be more likely to be higher if we wore halloween costumes than if we used the hot tub, then we should decide to wear halloween costumes.

People mostly did not use the markets to decide what to do. On the first evening, the highest rated activity was a guitar sing-along. We did not end up doing that on any of the evenings. The activity that seems to have been the most fun for the most people^[1] was cooperative round-the-table ping-pong. This was done spontaneously, adding more people as they came to the table, without any market predicting the result. We spent a decent amount of time just sitting around talking to each other, which also did not have a market. Our decision making process seemed to be less formal: someone would suggest an activity or say that they would personally do the activity, and other people would join. Having someone look at the markets and announce which activity rated the highest would have added more steps and organization compared to what we did.

We also tried varying the structure of the markets to see if that made them more useful. For example, the market “Will we use the hot tub and have fun tonight?” had four choices for the combinations of whether or not at least four people would use the hot tub and whether the average evening rating would be above or below 7.^[2] Katja did use this market to argue that people should use the hot tub.

There seems to have been a few things that kept the markets from being more useful: (1) Most of us did not know what kinds of social activities most of the rest of us preferred, so it was hard for anyone to make an informed bet. It wasn’t clear how the market provided more information than if we had used a voting system. (2) The connection between four people doing an activity and the average evening rating was too weak for much of a signal to go through. The ratings ended up being noisy, and not specific enough for particular activities. (3) The act of checking the markets and announcing a decision was more formal than our actual decision making process. The market only included a short list of possibilities and did not suggest spontaneity.

Conclusion

Having prediction markets for the evening social activities was a fun addition to the AI Impacts retreat. There were about 20 markets about the retreat which most of the people at the retreat bet on. But the markets did not end up having a significant impact on what we did during the evening.

Most of us did not have experience using prediction markets before the retreat. We decided not to use the markets to make important decisions, because we did not know what problems they would cause. The markets would likely have been more impactful if we were more experienced and if the questions were about more important decisions. If we did use the markets for important decisions, we would have to make sure that the markets are harder to exploit and have more rules and fewer norms governing how we would bet on the markets.

Since the retreat, Katja has used a market to help plan an AI Impacts dinner. We plan to continue experimenting with using prediction markets to make predictions in the future.

Notes

^{^}
Three people rated the evening 9.
^{^}
A fifth choice for the average evening rating being exactly 7 was added by someone else.