An exploration of exploitation bias

Eric Neyman

This is a linkpost for https://ericneyman.wordpress.com/2020/07/05/an-exploration-of-exploitation-bias/

This is a map of Nassau Street, the northern edge of Princeton University.

It’s a very standard sort of street; I imagine one quite like it exists in most college towns. It has lots of great places to eat, shown on the map in orange.

During my senior year, because of Princeton’s absurdly expensive meal plan, it made financial sense for me to eat dinner at these restaurants. And so, over my time at Princeton, I visited these places around 200 times in total, often by myself.

And yet, I don’t recall ever once deciding to try out a new place by myself. Whenever I went somewhere for the first time, it was always because of a friend’s initiative. Over 90% of my visits to Nassau Street have been to one of just four places: Mamoun’s, Ajiten, Panera, and Tacoria, each of which I’ve been to maybe 50 or so times.

Moreover, when I went to one of these restaurants, I would always get the same thing. The first time I went to Panera I tried their creamy tomato soup, liked it, and never tried anything else. The first time I went to Ajiten, I tried their chicken baitan ramen, liked it, and never tried anything else (well, except the one time I worked up the willpower to try one of their sushi plates). Same story for Mamoun’s and Tacoria.

This is, from what I understand, quite unusual. And it’s not that I’m a particularly picky eater. So what’s going on here?

***

The multi-arm bandit problem is a classic problem in computer science. Imagine you have a machine with ten buttons, each of which gives you some (possibly random) amount of money when you press it. These buttons are different — maybe one gives you a random reward between 5 and 7 dollars, and another one gives you nothing with 90% probability and 100 dollars with 10% probability — but you don’t know what they do ahead of time. You are allowed to press whatever buttons you want, but can only press buttons 100 times in total. What should your strategy be if you want to maximize your reward?

The precise answer depends on the assumptions you make about the distributions of the rewards, but the basic idea is always the same. At each time step, you should do one of two things:

(1) Explore: choose a button you haven’t pressed all that much — so you’re pretty uncertain about how good it is — and press it. The value of doing this isn’t so much the reward you expect to get from the button, but rather the information and potential for more future reward. If you press the button and it gives you a lot of money, that’s really good to know, because maybe it will be worth pressing more in the future.

(2) Exploit: choose a really good button — one that you’ve pressed a lot before and has given you a lot of money on average — and press it. In this case, the value of pressing the button is the reward itself. It’s not so much the extra information you gain, because you’ve already pressed the button a lot; you know it’s good, and you know roughly how good it is.

(I know that “exploit” often has a negative connotation. I’m using it as a purely technical term that means going with an option that you know has a high reward.)

The key to having a good strategy is having the right balance of exploration and exploitation. In the beginning you want to explore a lot: you should probably try pressing all ten buttons at least once, and the most promising buttons more than that. Then you should gradually shift to exploiting, spending most of your presses on the button you think is best. Toward the end of your 100 presses, when new information about other buttons won’t do you much good, you should exclusively be pressing your favorite button.

***

The multi-arm bandit problem is, of course, a mathematical formulation of the Princeton restaurant problem. You’re a freshman at Princeton. You’ll eat maybe 200 meals in a restaurant during your time there. Each time you decide to eat out, how do you decide whether to get something you’ve had before or to try something new?

(In practice this problem could be more complicated if, for example, you prefer not to eat at the same restaurant multiple times in a row. For the most part, I happen to not have this preference.)

And the answer is: start by exploring. Go to all the different restaurants — or at least the ones that have something you might like and are within your budget. In fact, go to each restaurant multiple times and try different things! Then, after you’ve had a lot of time to figure out which menu items at which restaurants you liked the most, get them again. Shift more and more toward exploitation, and toward the end of your senior year spend most of your time ordering things you’ve already tried and you know are good.

Well anyway, that’s the right answer. There are also mediocre answers to this problem. And then there are totally wrong answers. My approach fell squarely into that last category. I tried a meal — chicken baitan ramen at Ajiten — liked it, and that was what I had for my next 30 meals. Then at some point I went out to eat at Mamoun’s with a group of friends, really liked their shawarma, and spent my next 50 meals eating that. There was essentially no exploration at all on my part, just exploitation.

I was given a math problem, four years to think about it, and I thoroughly failed to solve it, to even get close to solving it. How did this happen?

Before trying to explain what went wrong inside my brain, I’m going to give it a name: exploitation bias. Exploitation bias is when, presented with a real-life instance of the multi-arm bandit problem, you exploit substantially earlier and more frequently than is optimal. Conversely, one could imagine having an exploration bias. That’s the opposite extreme, where you spend your whole time exploring and don’t spend much time enjoying the options you liked the most. I’m not sure exploration bias actually exists: everyone I’ve talked to about this has said that they exploit more than they ought to. But it’s good to keep in mind that this is a bias that one could hypothetically have.

It’s natural to ask whether it makes sense to speak of “exploitation bias” as something that a person either has or does not have; it’s possible that the tendency to over-exploit isn’t correlated across different categories. Based on personal experience, it does seem to make sense to treat exploitation bias as a general phenomenon: I have a strong exploitation bias in many areas. Besides restaurants, I have this bias when it comes to academic subjects (I mostly just took classes in subjects I already knew I liked going into college) and music (well over 90% of the music I’ve listened to in the past year has been by Taylor Swift). Moreover, I also have empirical evidence that people who exploit more than average in one area are more likely to do so in other areas; I’ll get to that evidence in a bit.

But so far, all I’ve done is give this phenomenon a name; I’d be more interested in actually explaining exploitation bias, perhaps by connecting it to more basic or already established traits. I came up with two explanations of exploitation bias that seemed promising to me. The first of these is pretty intuitive:

Hypothesis 1: exploitation bias is caused by a lack of openness to experience. “Openness to experience” is one of the big five personality traits. (The Big Five Personality Test is the only system of characterizing personalities that has the stamp of approval of Big Science.) Wikipedia says:

Openness to experience is a general appreciation for art, emotion, adventure, unusual ideas, imagination, curiosity, and variety of experience. People who are open to experience are intellectually curious, open to emotion, sensitive to beauty and willing to try new things.

Maybe people who score lower on openness to experience also have stronger exploitation biases, because they are less inclined to try new things.

My second hypothesis needs more explanation.

Hypothesis 2: people tend to over-exploit for the same reason that they tend to procrastinate.

George Akerlof, an economist who would go on to win the Nobel Prize, was living in India for a year when he was visited by a close friend named Joseph Stiglitz (who would also go on to win the Nobel prize). Unfortunately, all did not go as planned when Stiglitz departed for America:

Because of unexpected limitations on carry-on luggage at the time of his departure, he left with me a box of clothes to be sent to him in the United States. Both because of the slowness of transactions in India and my own ineptitude in such matters, I estimated that sending this parcel would take a full day’s work.

The sooner Akerlof could send Stiglitz the package, the better. And yet:

Each morning for over eight months I woke up and decided that the next morning would be the day to send the Stiglitz box.

So… if there’s something that separates Nobel Prize winners like Akerlof from the rest of us, it sure isn’t his ability to consistently get things done in a timely manner. Instead, it might be his ability to introspect about these sorts of inabilities. Akerlof thought about the episode in terms of a decision tree, where every day he could decide to either send the package of to procrastinate:

Every day, the best decision was to send the package today; after all, he would have to send the package eventually no matter what, and meanwhile he would keep Stiglitz waiting. And yet each day he decided to procrastinate, concretely deciding that he would send the package tomorrow.

Akerlof posited that he kept making the same mistake each day because of a consistent bias: costs and benefits in the immediate future always seemed larger to him than the same costs and benefits at a later point. So instead of plainly seeing the decision tree for what it was and sending the package on the very first day, his brain processed the decision in front of him as something more like this:

That is, his brain treated costs in the immediate future as twice as large as they actually were. This caused him to evaluate the cost of sending the package today (6) as larger than the total cost of sending the package tomorrow (5), and so on day 1 he decided that he would send the package on day 2. Then day 2 arrived, and his brain made the same mistake as it did the previous day, and so forth. In this way, Akerlof procrastinated until almost the very end of his time in India.

The bias of over-weighting costs and benefits in the immediate future is called present bias, and it is what causes students to procrastinate on their homework. It is also what causes us to go to bed later than we should (I say, writing this at 3:23 AM). “Yeah, I have to get up early tomorrow and I’ll be tired if I don’t go to bed now, but what about all this fun stuff I could be doing now?“

Present bias could also explain why people choose to exploit more than they should. For example, let’s say I’m at Mamoun’s for the second time, and I’m deciding between the shawarma (which I had last time and I know is really good) and the chicken kebob (which may or may not be good, but in the unlikely event that it’s even better than the shawarma, that would be really good to know for future reference). The rational thing to do is to explore: try the kebob. But instead — because of present bias — I think “eh, I’ll explore next time I’m here” and get the shawarma again. And then the next time I’m at Mamoun’s, I also say to myself, “eh, I’ll explore next time I’m here“, and so on. My mistake was remarkably similar in nature to Akerlof’s; perhaps it also had the same root cause.

When I first planned this post, I was going to say that I favored Hypothesis 1 over Hypothesis 2 because of anecdotal evidence and leave it at that. But then I realized that I could try to test the hypotheses with a survey.

I made a survey with a few different types of questions. First, I asked three questions — all on a 1 to 5 scale — to quantify respondents’ tendency to explore:

When listening to music, how much do you play songs you haven’t heard before?
When going out to eat at a restaurant, how often do you try a restaurant you haven’t been before or a menu item you haven’t had before?
When choosing classes in college, how often do/did you take classes in subjects that were new to you / that you didn’t already know you liked?

As I suspected (and as alluded to earlier), responses to these questions were pairwise positively correlated. The correlations weren’t very strong, but this weakly confirmed my prior that it makes sense to speak of a person’s general tendency to explore or to exploit.

Second, I asked three questions — again on a 1 to 5 scale — to quantify respondents’ level of present bias:

How likely are you to be late to a meeting? (If it matters, assume the meeting is happening at 3PM.)
When you have an assignment for school or work that you know you won’t enjoy doing, how much do you typically procrastinate?
(Reversed) You have to get up an hour earlier than usual for a meeting tomorrow. How much earlier than normal are you likely to go to bed?

Likewise, answers to these questions were also pairwise correlated in the manner you’d expect (the first two questions were positively correlated and each negatively correlated with the third question), though the correlations were weak.

Using the answers to the three exploration/exploitation questions, I created a 0 to 12 exploitation index. Using the answers to these last three questions, I created a 0 to 12 present bias index.¹ Under Hypothesis 2, these two indices would likely be positively correlated.

I also asked respondents to take this Big Five personality test. I was primarily interested in openness to experience, so I could test Hypothesis 1 (which posits that openness to experience is negatively correlated with exploitation), though I was also curious if there would be correlations with any other personality traits. In particular I suspected a negative correlation between present bias and conscientiousness.

Finally, I asked about respondents’ age and gender just to see if there were any other interesting findings. You can see a summary of survey responses here.

So, what were the results of the survey? Disappointingly, I found no evidence to support either of my two hypotheses. There was no correlation in the survey data between openness to experience and exploitation index…

…and no statistically significant correlation between the present bias index and exploitation index.

So unfortunately, the survey did not help me understand exploitation bias better than I did previously, except to provide evidence against the two explanations I had come up with.

I did, however, find a few other interesting tidbits in the data. First, as I suspected, present bias negatively correlates with conscientiousness in my data.

The p-value for this finding is 0.0008, meaning that if in fact there’s no relationship between conscientiousness and present bias in the underlying population, there’s only a 0.08% chance of seeing such a strong negative relationship in the sample just by coincidence. This value is low enough that I feel justified concluding that present bias does negatively correlate with conscientiousness.

It also turns out (as I probably would have posited if you had asked me beforehand) that present bias negatively correlates with age in my data.

The p-value for this finding is 0.01. Removing the two 40-year-olds from the data doesn’t change this conclusion. On the other hand, there was no relationship between age and conscientiousness in the data. I also did not find a relationship between age and exploitation bias.

Third, I found that in my sample, men were significantly more inclined to exploit than women. Men had an average exploitation index of 8.2, compared to 6.4 for women; the p-value of this finding was 0.0014.² (Women in the sample had an average present bias of 6.4, slightly higher than 5.6 for men, but the p-value here is 0.11 so I’m not convinced that this is real.)

Finally, I found some minor evidence that exploitation bias was negatively correlated with extroversion, and that present bias was negatively correlated with emotional stability. These were both on the margin of statistical significance. If I were asked beforehand to guess at the directions of these correlations, this is probably what I would have guessed, but I wouldn’t read too much into this because if you test ten hypotheses (the two indices vs. the Big Five personality traits), you’d expect to find a result that’s (naïvely) close to statistical significance.

So, what to make of the data’s lack of support for my two hypotheses? The simplest explanation is just that the hypotheses are wrong. In that case, maybe there’s a different satisfying explanation for exploitation bias that I’m missing, or maybe exploitation bias should be treated as a primitive, like I’ve been treating present bias and the Big Five.

Alternatively, maybe I didn’t get positive results because I didn’t do a good enough job of measuring the variables. As I noted earlier, “openness to experience” is shorthand for the amalgamation of several traits, including intellectual curiosity and open-mindedness; maybe I’d get a positive result in the vein of Hypothesis 1 if I were able to isolate the “openness to literal real-world experiences” trait. And maybe it would help if I had a larger and wider variety of questions to measure exploitation and present bias.

In any case, I consider the nature of exploitation bias to be unresolved and meriting further study. Let me know if you have any other hypotheses!

^{1. A respondent’s exploitation index was 15 minus the sum of their answers to the three exploration/exploitation questions. A respondent’s present bias index was the sum of their answers to the first two present bias questions, minus their answer to the third question, plus 3 (so as to keep the index non-negative).}

^{2. The p-value here is for the alternative hypothesis that these two averages are}^different^{(rather than men have a higher average), because I didn’t have much of a prior in one direction or another on this.}

LESSWRONG
LW

An exploration of exploitation bias

14

New to LessWrong?

14