[Team Update] Why we spent Q3 optimizing for karma

by Ruby 11d7th Nov 201910 min read23 comments

63


In Q3 of 2019, the LessWrong team picked the growth of a single metric as our only goal. For the duration of this quarter, the overwhelming consideration in our decision-making what would most increase the target metric.

Why target a simple, imperfect metric?

The LessWrong team pursues a mixture of overlapping long-term goals, e.g. building a place where people train and apply rationality, building a community and culture with good epistemics, and building technologies which drive intellectual progress on important problems.

It’s challenging to track progress on these goals. They’re broad, difficult to measure, we don’t have complete agreement on them, and they change very slowly providing an overall poor feedback loop. If there’s a robust measure of “intellectual progress” or “rationality skills learnt” which doesn’t break down when being optimized for, we haven’t figure them out yet. We’re generally left pursuing hard to detect things and relying on our models to know that we’re making progress.

Though this will probably continue to be the overall picture for us, we decided that for three months it would be a good exercise for us to attempt maximizing a metric. Given that it’s only three months, it seem relatively safe [1] to maximize a simple metric which doesn’t perfectly capture everything we care about. And doing so might have the following benefits:

  • It would test our ability to get concrete, visible results on purpose.
  • It would teach us to operate with a stronger empirical feedback loop.
  • It would test how easily we can drive raw growth [2], i.e. see what rate of growth we get for our effort.
  • The need to hit a clear target would introduce a degree or urgency and pressure we are typically lacking. This pressure might prompt some novel creativity.
  • Targeting a metric is what YC advises their startups and it seems worth following that school of thought and philosophy for a time (see excerpts below).

So we decided to pick a metric and optimize for it throughout Q3.

[1] We had approval for this plan from our BDFL/admin, Vaniver. For extra safety, we shared out plans with trusted-user Zvi and told him we'd undo anything on the site he thought was problematic. We ran the plan by others too, but stopped short of making a general announcement lest this confound the exercise.

[2] Historically the team has been hesitant to pursue growth strategies out of fear that we could grow the site in ways which make it worse, e.g. eroding the culture while Goodharting on bad metrics. Intentionally pursuing growth for a bit is a way to test the likelihood of accidentally growing the site in undesirable ways.

Choosing a metric

The team brainstormed over fifty metrics with some being more likely candidates than others. Top contenders were number of posts with 50+ karma/week, number of weekly logged-in users, and number of people reading the Sequences. 

(We tried to be maximally creative however and the list also included MealSquares sold, impact-adjusted plan changes, and LessWrong t-shirts worn. Maybe we'll do one of those next time)

Ultimately, the team decided to target a metric derived from the amount of karma awarded via votes on posts and comments. Karma is a very broad metric and the amount given out can be increased via multiple methods, all of which we naively approve of increasing, e.g. increasing the number of posts, number of comments, and number of people reading posts and voting. This means that by targeting the amount of karma given out, we’re incentivizing ourselves to increase multiple valuable other “sub-metrics”.

Design of the metric 

We did not target the raw amount of karma given out but instead a slightly modified metric:

  1. Remove all votes made by LessWrong team members
  2. Multiply the value of all downvotes by 4x
  3. Aggregate karma to individual posts/comments and raise the magnitude to the power of 1.2

Clause #2 was chosen to disincentivize the creation of demon threads which otherwise might produce a lot of karma in their protracted, heated exchanges. 

Clause #3 was chosen to heighten to reward/punishment for especially good or especially bad content. We’re inclined to think that single 100-karma post is worth more than four 25-karma posts and the exponentiation reflects this. (For comparison: 25^1.2 is 47.6, 100^1.2 is 251.2. So in our metric, one 100-karma post was worth about 30% more than four 25-karma posts). 

In developing the metric, we experimented with a few different parameters and and checked them against our gut sense of how valuable different posts were.

[There’s some additional complexity in the computation in that the effect of each vote is calculated as the difference in the karma metric of a post/comment before and after the vote. This is necessary to compute changes in the metric nicely over time but makes no difference if you compute the metric for all time all at once.]

Following Paul Graham’s advice, we targeted 7% growth in this metric per week throughout Q3. This is equivalent to increasing the metric by 2.4x. Since PG’s advice was a major influence on us here, I’ll include a few excerpts [emphasis added]:

A good growth rate during YC is 5-7% a week. If you can hit 10% a week you're doing exceptionally well. If you can only manage 1%, it's a sign you haven't yet figured out what you're doing.

...

In theory this sort of hill-climbing could get a startup into trouble. They could end up on a local maximum. But in practice that never happens. Having to hit a growth number every week forces founders to act, and acting versus not acting is the high bit of succeeding. Nine times out of ten, sitting around strategizing is just a form of procrastination. Whereas founders' intuitions about which hill to climb are usually better than they realize. Plus the maxima in the space of startup ideas are not spiky and isolated. Most fairly good ideas are adjacent to even better ones.

...

The fascinating thing about optimizing for growth is that it can actually discover startup ideas. You can use the need for growth as a form of evolutionary pressure. If you start out with some initial plan and modify it as necessary to keep hitting, say, 10% weekly growth, you may end up with a quite different company than you meant to start. But anything that grows consistently at 10% a week is almost certainly a better idea than you started with.

What we did to raise the metric

At the highest level, we wanted to increase the number of posts, increase the number of comments, and increase the number of people viewing and voting. Major projects we worked on towards this included:

  • The launch of Shortform
    • We’d been experiencing demand for Shortform and metric quarter seemed like a good time to introduce a new section of the site with lower cost to entry.
  • Subscriptions
    • We failed to launch this during metric quarter, but we envisioned that subscriptions would increase the content people read and vote on.
  • Reaching out to authors
    • We reached out to a number of people who currently or previously have written top content for LessWrong to find out how we could help them write more.
  • Setting up automatic cross-posting for top authors
    • For authors whose material is a good fit for LessWrong, we reached out to them and asked about having their posts automatically cross-posted to LessWrong.
  • Removing login 90-day log-in expiry so that people stay signed in and able to vote/comment/post.
  • Making it easier to create an account or sign-in.
  • The LessLong Launch party.
    • We hosted a large party in Berkeley both to push the launch of Shortform but also generally to signal LessWrong’s activity and happeningness.

Other activities in this period which contributed were:

  • Petrov Day
  • MIRI Summer Fellows Program

These projects contributed significantly to the metric, but we would have probably done counterfactually even if we weren’t targeting the metric. (in truth the same can be said for everything else we did).

Targeting the metric did cause us to delay some other projects. For instance, we deprioritized reducing technical debt and new analytics infrastructure this quarter.

How did we do?

Summary

While our target was 7%/week growth, we achieved growth equivalent to 2%/week. As far as hitting the stated target went, we unambiguously failed. 

In retrospect, 7% was probably a mistaken target. We perhaps should have been comparing ourselves to LessWrong's historical growth rates, which we in fact we did exceed. Our actual growth in this period of 2%/week over 3-4 months is higher than LessWrong's typical rate of growth throughout most of its history which was at best equivalent 0.5%-1%. Compounded over three months, that's the difference between 15% and 29% growth.

(LessWrong grew between 2009 and 2012  at around 0.5-1.0%/week and then began declining until 2017 when the LW2.0 project was started.)

However, the 7% target was probably still a good choice when we began. It was conceivably achievable and worth testing. For one thing, historically LessWrong didn't have a full-time team in the past who were deliberately working full-time to drive growth. Given the resources we were bringing to bear, it was worth testing if we could dramatically outperform the more "natural" historical growth rates. We have learnt that the answer, unfortunately, is "not obviously or easily."

Also, notwithstanding the failure to hit the target, the exercise still helped with our goals of becoming more empirical, using a stronger feedback loop, being more creative, feeling more urgency, testing hypotheses about growth, and generally applying and testing more of our models against reality. I think the experience has nudged our decision-making and processes in a positive direction even as we return to pursuing long-term, slow-feedback, difficult-to-measure objectives.

Detailed Analysis

Weekly Karma Metric Values + Target Growth Lines

The first graph (above) here shows the karma metric each week and displays clearly that the value does not go up monotonically, but rather fluctuates a fair bit, usually related to the occurrence of events like Petrov Day, Alignment Writing Day, or the publication of controversial posts. This is normal for all the metrics, yet makes it difficult to discern overall trends.

We can apply a 4-week moving average filter in order to smooth the graph and see the trend a little better.

Weekly Karma Metric Values with 4-week Moving Average smoothing

The latter graph shows the overall increase since July, i.e., the beginning of our "metric quarter." However, smaller differences at the weekly level become larger differences at the monthly and quarterly level.

Karma Metric Aggregated Quarterly
Karma Metric Aggregated Monthly

The value of the karma metric in Q3 was 39% higher than than in Q2, 71k vs 51k. This is the largest growth since when the LW2 project began in late 2017 and when it first launched in 2018.

In the summary, I stated that this growth compares favorable to LessWrong's historical growth rates. Due to changes in how karma is computed introduced with LW2, we can't compare our karma metric backwards in time. Fortunately, the number of votes is a good proxy.

Number of Votes (Weekly) 

In absolute terms, LW2.0 has some catching up to do; growth-wise we compare nicely. The number of votes on LessWrong grew dramatically between 2009 and 2012 as can be seen in the graph. Growth in votes was 87% in 2009, 41% in 2011, 24% in 2012 and 92% in 2018 [3] . Those are the growth numbers for the entire years and correspond to average weekly growth rates of 1.2%, 0.7%, 0.4%, and 1.3%.

In comparison to that, growing votes by 40% in just one quarter (= 2.5%/week) is pretty good. The real question is whether this growth will be sustained. Yet so far so good. October saw the highest level of the karma metric so far in 2019.

We didn't hit 7%, but it's heartening that seemingly we managed to do something.

[3] Other years saw negative growth ranging between -20% and -65%.

What contributed to our performance?

As above, Q3 was 39% higher on the target metric relative to Q2, going from 51k to 71k. We can examine the contribution of our different activities to this. Where did the extra 20k come from?

Shortform

Karma granted to shortform posts and comments amounted to 5.5k KM or 7.7% of the total score for Q3 and 25% of the difference between Q2 and Q3. This is not fully counterfactual since we can assume Shortform cannibalized some activity from elsewhere on the site, however there has definitely been net growth.

We see that the total number of comments (including all Shortform activity and all responses to questions) grew since July due to the introduction of Shortform while the number of regular comments did not shrink.

Petrov Day

Our Petrov Day commemoration had an outsized impact with the two posts plus their comments (1, 2) together generating 2.4k KM, or 3.3% of total karma for Q3 and 12% of the difference from Q2 to Q3. 

It was a very good return on time spent by the team.

Author Outreach

In the hope of causing there to be more great content, we reached out to a number of authors to see what we could do get them posting. A lower bound on the KM we achieved this way is 2.7k, or 3.5% of total / 13.5% of difference.

AI Alignment Writing Day

The posts from MSFP writing day generated 3.0k KM, or 4.2% of total / 15% of the difference. However this is definitely something we would have done anyway and is not obviously something we should count as a special intentional activity to drive karma.

Novum Organum

The posting of the Novum Organum sequence was motivated by having more content to get more karma. The fives posts posted in Q3 netted 0.3k KM, or 0.4% of total / 1.5% of difference. Not that impactful on the metric.

Removing Login Expiry

Vulcan, the framework upon which LW2.0 was built, automatically signed people out after 90 days. This would require them log-in again before voting, commenting, or posting. We removed this and the number of logged-in users rose for several months going from 600 logged-in users/week to over 800 logged-in users/week. This seems to have flowed onto the number of unique people voting each week.

Making Login Easier

The more people logged-in, the more people who can vote, comment, and post. We improved the login popup and added more prompts for login to the site. There was no large or definite change in the rate of logins after this.

LessLong (Shortform) Launch Party

It’s unclear whether this party drove much immediate activity on LessWrong in Q3. We hosted this primarily off the model that it would be good to do something that made LessWrong seem really alive.

Subscriptions & LessWrong Docs (our new editor)

Though we worked on subscriptions and the new editor throughout Q3, our failure to release these means that they naturally didn’t generate any karma in Q3. Planning fallacy? (Subscriptions overhaul is now out, new editor is nearing beta release.)

Overall, how well did we accomplish our goals for this exercise?

Above I listed multiple reasons it would be a good idea to target a metric for a quarter, and regardless of well we maximized the metric, we can still ask if we achieved the goals one-level up.

It would test our ability to get concrete, visible results on purpose.

It would teach us to operate with a stronger empirical feedback loop.

I think the exercise helped along this dimension. 

  • The metric target had the team [4] everyday looking at our dashboard and regularly asking questions about the impact of different posts.
  • We were forced to make plans based on the short-term predictions of our models. This enabled us to learn where we learnt where we wrong.
    • For example, I overestimate the amount of KM generated by Novum Organum and underestimated that from Petrov Day.
  • Even after the end of Metric Quarter, team members want to continue to monitor the numbers and include these as an input to our decision-making.

[4] With the exception of Ben Pace who was in the Europe for most of the period.

It would test how easily we can drive raw growth [1], i.e. see what rate of growth we get for our effort.

By putting almost all of our effort into growth for three months, we were definitely able to make the metric jump up some. This was most salient with time-bound events like specific high-engagement posts or events like Petrov Day, yet seems to be true of long-term features like Shortform too (however Shortform is gotten a little bit quiet lately - I'll be looking into that).

At the same time, getting growth was not super easy. We're unlikely to 10x the site unless we try quite hard for some time. Which causes me to conclude that it's unlikely that we ought to fear growth: things probably won't happen so quickly that we'll be unable to react. My personal leaning is that should always be trying to grow at least a little bit, if only to keep from shrinking. 

The need to hit a clear target would introduce a degree or urgency and pressure we are typically lacking.

This effect was real. We definitely experienced sitting around, looking at the metric, and thinking how are we going to make it up go up this week? Unfortunately, this effect flagged a little after the first month when some of us became pessimistic about maintaining the 7% target. I think if the target was one where it continued to seem like had a shot, we'd have continued to feel more pressure to not fall below it. Overall though, we did keep trying to hit it then there.

I found myself working harder and longer of projects I enjoy less but thought would be more impactful for the metric. This makes me wonder about how much my usual slow-feedback, less-constrained activities is decided by pleasantness of the activities. It feels like a wake-up call to really be asking myself about what actually matters all the time.

Finale Take-aways

We're back to our more usual planning style where we're optimizing for long-term improvement of difficult-to-measure quantities, but I think we're retaining something of the empirical spirit of trying to make predictions about the results of our actions and comparing this to what actually happens.

63