[Team Update] Why we spent Q3 optimizing for karma

I'm gonna heckle a bit from the peanut gallery...

First, trying to optimize a metric without an A/B testing framework in place is kinda pointless. Maybe the growth achieved in Q3 was due to the changes made, but looking at the charts, it looks like a pretty typical quarter. It's entirely plausible that growth would have been basically the same even without all this stuff. How much extra karma was actually generated due to removing login expiry? That's exactly the sort of thing an A/B test is great for, and without A/B tests, the best we can do is guess in the dark.

Second (and I apologize if I'm wrong here), that list of projects does not sound like the sort of thing someone would come up with if they sat down for an hour with a blank slate and asked "how can the LW team get more karma generated?" They sound like the sort of projects which were probably on the docket anyway, and then you guys just checked afterward to see if they raised karma (except maybe some of the one-shot projects, but those won't help long-term anyway).

Third, I do not think 7% was a mistaken target. I think Paul Graham was right on this one: only hitting 2% is a sign that you have not yet figured out what you're doing. Trying to optimize a metric without even having a test framework in place adds a lot of evidence to that story - certainly in my own start-up experience, we never had any idea what we were doing until well after the test framework was in place (at any of the companies I've worked at). Analytics more generally were also always crucial for figuring out where the low-hanging fruit was and which projects to prioritize, and it sounds like you guys are currently still flying blind in that department.

So, maybe re-try targeting one metric for a full quarter after the groundwork is in place for it to work?

[-]habryka6y200

First, trying to optimize a metric without an A/B testing framework in place is kinda pointless. Maybe the growth achieved in Q3 was due to the changes made, but looking at the charts, it looks like a pretty typical quarter. It's entirely plausible that growth would have been basically the same even without all this stuff. How much extra karma was actually generated due to removing login expiry? That's exactly the sort of thing an A/B test is great for, and without A/B tests, the best we can do is guess in the dark.

I don't think A/B testing would have really been useful for almost any of the above. Besides the login stuff all the other things were social features that don't really work when only half of the people have access to them. Like, you can't really A/B test shortform, or subscriptions, or automatic crossposting, or Petrov Day, or MSFP writing day, which is a significant fraction of things we worked on. I think if you want to A/B test social features you need a significantly larger and more fractured audience than we currently have.

I would be excited about A/B tests when they are feasible, but they don't really seem easily applicable to most of the things we build. If you do have ways of making it work for these kinds of social features, I would be curious about your thoughts, since I currently don't really see much use for A/B tests, but do think it would be good if we could get A/B test data.

[-]Ruby6y130

Heckling appreciated. I'll add a bit more to Habryka's response.

Separate from the question of whether A/B would have been applicable to our projects, I'm not sure why think it's pointless to try to make inferences without them. True, A/B tests are cleaner and more definitive, and what we observed is plausibly what would have happened even with different activities, but that isn't to say we don't learn a lot when the outcome is one of a) metric/growth stays flat, b) small decrease, c) small increase, d) large decrease, e) large increase. In particular, the growth we saw (increase in absolute and rate) is suggestive of doing something real and also strong evidence against the hypothesis that it'd be very easy to drive a lot of growth.

Generally, it's at least suggestive that the first quarter where we explicitly we focus on growth is one where we see 40% growth from last quarter (compared to 20% in the previous quarter to the one before). It could be a coincidence, but I feel like there are still likelihood ratios here.

When it comes to attribution too, with some of these projects it's easy to get much more of an idea even without A/B testing. I can look at the posts from authors who we contacted and reasonably believe counterfactually would not have otherwise posted and see how much karma that generated. Same from Petrov Days and MSFP.

[-]johnswentworth6y150

Responding to both of you here: A/B tests are a mental habit which takes time to acquire. Right now, you guys are thinking in terms of big meaty projects, which aren't the sort of thing A/B tests are for. I wouldn't typically make a single A/B test for a big, complicated feature like shortform - I'd run lots of little A/B tests for different parts of it, like details of how it's accessed and how it's visible. It's the little things: size/location/wording of buttons, sorting on the homepage, tweaking affordances, that sort of thing. Think nudges, not huge features. Those are the kinds of things which let you really drive up the metrics with relatively little effort, once you have the tests in place. Usually, it turns out that one or two seemingly-innocuous details are actually surprisingly important.

It's true that you don't necessarily need A/B tests to attribute growth to particular changes, especially if the changes are big things or one-off events, but that has some serious drawbacks even aside from the statistical uncertainty. Without A/B tests, we can't distinguish between the effects of multiple changes made in the same time window, especially small changes, which means we can't run lots of small tests. More fundamentally, an A/B test isn't just about attribution, it's about having a control group - with all the benefits that a control group brings, like fine-grained analysis of changes in behavior between test buckets.

[-]cousin_it6y*300

I think incremental change is a bit overrated. Sure, if you have something that performs so well that chasing 1% improvements is worth it, then go for it. But don't keep tweaking forever: you'll get most of the gains in the first few months, and they will total about +20%, or maybe +50% if you're a hero.

If your current thing doesn't perform so well, it's more cost-effective to look for big things that could bring +100% or +1000%. A/B tests are useful for that too, but need to be done differently:

Come up with a big thing that could have big impact. For example, shortform.
Identify the assumptions behind that thing. For example, "users will write shortform" or "users will engage with others' shortform".
Come up with cheap ways to test these assumptions. For example, "check the engagement on existing posts that are similar to shortform" or "suggest to some power users that they should make shortform posts and see how much engagement they get". At this step you may end up looking at metrics, looking at competitors, or running cheap A/B tests.
Based on the previous steps, change your mind about which thing you want to build, and repeat these steps until you're pretty sure it will succeed.
Build the thing.

[-]Ruby6y60

This is roughly the procedure we usually follow.

[-]johnswentworth6y50

This line of thinking makes a major assumption which has, in my experience, been completely wrong: the assumption that a "big thing" in terms of impact is also a "big thing" in terms of engineering effort. I have seen many changes which are only small tweaks from an engineering standpoint, but produce 25% or 50% increase in a metric all on their own - things like making a button bigger, clarifying/shortening some text, changing something from red to green, etc. Design matters, it's relatively easy to change, but we don't know how to change it usefully without tests.

[-]cousin_it6y40

Agreed - I've seen, and made, quite a few such changes as well. After each big upheaval it's worth spending some time grabbing the low hanging fruit. My only gripe is that I don't think this type of change is sufficient over a project's lifetime. Deeper product change has a way of becoming necessary.

[-]Vaniver6y90

I think the other thing A/B tests are good for is giving you a feedback source that isn't your design sense. Instead of "do I think this looks prettier?" you ask questions like "which do users click on more?". (And this eventually feeds back into your design sense, making it stronger.)

[-]Ruby6y40

I find this compelling (along with the "finding out which things matter that you didn't realize mattered) and think this is a reason for us to begin doing A/B testing sometime in not too distant future.

[-]habryka6y60

Yes, heckling is definitely appreciated!

[-]habryka6y190

Second (and I apologize if I'm wrong here), that list of projects does not sound like the sort of thing someone would come up with if they sat down for an hour with a blank slate and asked "how can the LW team get more karma generated?"

It is a list of projects we prioritized based on how much karma we expect they would generate over the long run, filtered by things that didn't seem like obviously goodharty ideas.

If these don't seem like the things you would have put on the list, what other things would you have put on the list? I am genuinely curious, since I don't have any obvious ideas for what I would have done instead.

[-]Ruby6y90

A number of these projects were already on our docket, but less visible is the projects which were delayed and the fact that those selected might not have been done now otherwise. For example, if we hadn't been doing metric quarter, I'd like have spent more of my time continuing work on the Open Questions platform and much less of my time doing interviews and talking to authors. Admittedly, subscriptions and the new editor are projects we were already committed to and had been working on, but if we hadn't thought they'd help with the metric, we'd have delayed it to the next quarter the way we did with many of other project ideas.

We did brainstorm however, but as Oli said, it wasn't easy to come with any ideas which were obviously much better.

[-]johnswentworth6y140

Responding to both of you with one comment again: I sort of alluded to it in the A/B testing comment, but it's less about any particular feature that's missing and more about the general mindset. If you want to drive up metrics fast, then the magic formula is a tight iteration loop: testing large numbers of small changes to figure out which little things have disproportionate impact. Any not-yet-optimized UI is going to have lots of little trivial inconveniences and micro-confusions; identifying and fixing those can move the needle a lot with relatively little effort. Think about how facebook or amazon A/B tests every single button, every item in every sidebar, on their main pages. That sort of thing is very easy, once a testing framework is in place, and it has high yields.

As far as bigger projects go... until we know what the key factors are which drive engagement on LW, we really don't have the tools to prioritize big projects. For purposes of driving up metrics, the biggest project right now is "figure out which things matter that we didn't realize matter". A/B tests are one of the main tools for that - looking at which little tweaks have big impact will give hints toward the bigger issues. Recorded user sessions (a la FullStory) are another really helpful tool. Interviews and talking to authors can be a substitute for that, although users usually don't understand their own wants/needs very well. Analytics in general is obviously useful, although it's tough to know which questions to ask without watching user sessions directly.

[-]Ruby6y120

I see the spirit of what you're saying and think there's something to it though it doesn't feel completely correct. That said, I don't think anyone on the team has experience with that kind of A/B testing loop and given that lack of experience, we should try it out for at least a while on some projects.

To date, I've been working just to get us to have more of an analytics-mindset plus basic thorough analytics throughout the app, e.g. tracking on each of the features/buttons we build, etc. (This wasn't trivial to do with e.g. Google Tag Manager so we've ended up building stuff in-house.) I think trying out A/B testing would likely make sense soon, but as above, I think there's a lot of value even before it with more dumb/naive analytics.

We trialled FullStory for a few weeks and I agree it's good, but also we just weren't using it enough to justify it. LogRocket offers monthly subscription though and likely we'll sign up for that soon. (Once we're actually using it fully, not just trialling, we'll need to post about it properly, build opt-out, etc. and be good around privacy - already in trial we hid e.g. voting, usernames.)

To come back to the opening points in the OP, we probably shouldn't get too bogged down trying to optimize specific simple metrics by getting all the buttons perfect, etc., given the uncertainty over which metrics are even correct to focus on. For example, there isn't any clear metric (that I can think of) that definitely answers how much to focus on bringing in new users and getting them up to speed vs building tools for existing users already producing good intellectual progress. I think it's correct that have to use high-level models and fuzzier techniques to think about big project prioritization. A/B tests won't resolve the most crucial uncertainties we have though I do think they're likely to hugely helpful in refining our design sense.

[-]johnswentworth6y140

I actually agree with the overall judgement there - optimizing simple metrics really hard is mainly useful for things like e.g. landing pages, where the goals really are pretty simple and there's not too much danger of Goodharting. Lesswrong mostly isn't like that, and most of the value in micro-optimizing would be in the knowledge gained, rather than the concrete result of increasing a metric. I do think there's a lot of knowledge there to gain, and I think our design-level decisions are currently far away from the pareto frontier in ways that won't be obvious until the micro-optimization loop starts up.

I will also say that the majority of people I've worked with have dramatically underestimated the magnitude of impact this sort of thing has until they saw it happen first-hand, for whatever that's worth. (I first saw it in action at a company which achieved supercritical virality for a short time, and A/B-test-driven micro-optimization was the main tool responsible for that.) If this were a start-up, and we needed strong new user and engagement metrics to get our next round of funding, then I'd say it should be the highest priority. But this isn't a startup, and I totally agree that A/B tests won't solve the most crucial uncertainties.

[-]Ruby6y80

Trying to optimize a metric without even having a test framework in place adds a lot of evidence to that story - certainly in my own start-up experience, we never had any idea what we were doing until well after the test framework was in place (at any of the companies I've worked at). Analytics more generally were also always crucial for figuring out where the low-hanging fruit was and which projects to prioritize, and it sounds like you guys are currently still flying blind in that department.

I think I agree with the general spirit here. Throughout my year with the LessWrong team, I've been progressively building out analytics infrastructure to reduce my sense of the "flying blind" you speak of. We're not done yet, but I've now got a lot of data at my fingertips. I think the disagreement here would be over whether anything short of A/B testing is valuable. I'm pretty sure that it is.

[-]Wei Dai6y80

Due to changes in how karma is computed introduced with LW2, we can’t compare our karma metric backwards in time. Fortunately, the number of votes is a good proxy.

I'm not sure about this. At least for me personally, I feel like voting is more costly on LW2 than on LW1, and I probably vote substantially less as a result. (Not totally sure because I haven't kept statistics on my own voting behavior.) The reasons are:

Having to decide between strong vs weak vote.
Having a high enough karma that my vote strengths (3 for weak and 10 for strong) are pretty identifiable, so I have to think more about social implications. (Maybe I shouldn't, but I do.)
Sometimes I'm uncomfortable voting something up or down by at least 3 points because I'm not sure of my judgement of its quality.

Hmm, on second thought the number of people in my position is probably small enough that this isn't likely to significantly affect your "number of votes" comparison. I'll leave this here anyway as general feedback on the voting system. (To be clear I'm not advocating to change the current system, just offering a data point.)

Another thing I've been wondering about is, there's generally less voting per post/comment on LW2 than on LW1, but the karma on comparable posts seems more similar. Could it be that people have inherited their sense of how much karma different kinds of posts/comments "deserve" from LW1 and tend to stop voting up a post once it reaches that amount, which would result in similar karma but fewer votes?

[-]Viliam6y110

Having a high enough karma that my vote strengths (3 for weak and 10 for strong) are pretty identifiable, so I have to think more about social implications.

I think the other comments show that you are not that identifiable.

Having to decide between strong vs weak vote.

Just always do the weak vote and don't think about it.

[-]Kaj_Sotala6y80

To offer another data point, my vote weights are also 3 / 10, and it hasn't occurred to me to think about these things. I just treat my "3" as a "1", and usually only strong-upvote if I get a clear feeling of "oh wow, I want to reward this extra hard" (i.e. my rule is something like "if I feel any uncertainty about whether this would deserve a strong upvote, then it doesn't").

[-]Ben Pace6y50

Having a high enough karma that my vote strengths (3 for weak and 10 for strong) are pretty identifiable, so I have to think more about social implications. (Maybe I shouldn't, but I do.)

Hmm, I was starting to notice that a bit myself, and I think this is especially strong the more vote weight you have, which is an incentive counter to the very point of weighted voting. One option is to obscure some karma things a little to avoid this.

[-]Vaniver6y120

FWIW I don't have this effect (and am also at 3/10). But I think I was also always in the "if I like something at 50, I will upvote it anyway" camp instead of in the "I think this should have a karma of 40, and since it's at 50, I don't need to upvote it" camp.

[-]John_Maxwell6y50

Cool project!

I suggest you make Q3 "growth quarter" every year, and always aim to achieve 1.5x the amount of growth you were able to achieve during last year's "growth quarter".

You could have an open thread soliciting growth ideas from the community right before each "growth quarter".

[-]Thomas Kwa6y20

Clause #3 was chosen to heighten to reward/punishment for especially good or especially bad content. We’re inclined to think that single 100-karma post is worth more than four 25-karma posts and the exponentiation reflects this. (For comparison: 25^1.2 is 47.6, 100^1.2 is 251.2. So in our metric, one 100-karma post was worth about 30% more than four 25-karma posts).

Is the idea behind this that a high-quality post can provide more than a single strong-upvote of value per person, and that total karma is a proxy for this excess value?

[-]shanen5y-10

Pretty sure this comment is going to go badly. Please excuse me for my incoherence, amplified by my limited time. But I have a number of strong reactions. The three strongest are:

(1) I do not want to reduce humans to or be reduced to a single metric. Symmetry violation (of the Golden Rule).

(2) Arbitrary scaling should be avoided by normalization. Most obvious example is weighting down votes by 4. From a symmetry perspective, the weighting should reflect which way the votes are cast and who is casting the votes. (I also think negative votes should be justified, but that's a new aspect.)

(3) Insufficiently detailed accounting for the costs of the project. However I am quite favorably impressed that success criteria were at least considered. (Is cost recovery a symmetry? (But in actionable terms, I don't know if I would have pledged money to implement this project. I'm having trouble seeing it as a step in any positive direction.))

Now for the worst part. I have a delusion of a better solution approach. As a joke, "I know it when I see it" and this isn't it and doesn't even seem to be a step in a "right and proper" direction. I think a simple up-down vote is okay, but should mostly be limited to defining a weight that is applied to a multidimensional vector. I've described it as MEPR elsewhere, but here I'm going to retag it as DK for Deeper Karma.

Defining the direction of that DK vector should involve an optional deeper reaction. Rather than +/- it would involve looking at some dimensions and voting them up or down. As much as possible, the dimensions should be orthogonal and symmetric. (In the OP, this use of "dimension" is close to "metric" selection.)

A few examples of dimensions: Simple dimension of humor, with + for funny and - for unfunny. The age of the identity is a one-way metric, but it can be normalized on a scale from youngest to oldest. Really messy dimension but a dimension for fox versus hedgehog would be interesting. (Considering how IBM and the google analyze identities, there are hundreds of such dimensions, but we human beings have limited attention spans and at any one time the number of dimensions should be limited, perhaps to 5 or 7.)

Now for an elevator ride past the trickiest symmetry. It is necessary to begin by dividing DK in twain. Let's call them DK-A for Artifact and DK-I for Identity (that created the artifact). Now "You will know them by their fruits." Reacting to an artifact will change it's DK-A, and the identity whose comments earned those reactions will have those reactions reflected in the DK-I.

In addition, in the process of giving reactions, the identity's DK-I should be considered. The humor dimension is a simple example. If someone has earned many positive humor reactions, then it should count more when that identity reacts to another artifact by assessing it as +/- humor. In contrast, an identity with negative humor should not be able to affect humor scores much.

Now for the messy bit that seems to confuse people. How can the DK be displayed? I imagine the DK-I should be paired with its identity. I would actually favor a little radar diagram with selected dimensions. Clicking on the identity's own link would take you to whatever the identity wants to say about itself, but clicking on the DK-I icon will take you to the details about all of the data that contributes to the DK-I. (Symmetry time again. The comment (as a public artifact) would have a DK-A link for its data and history.)

Why do this? Now we're getting into the cans of worms by the six-packs of cans. But I can reduce it to two major cans:

(1) My time is limited and there is always much more content than I could read. (Ditto videos or podcasts or pictures or whatever.) DK could help me filter.

(2) I want to become a better person and looking at my own DK-I would be useful feedback in improving. Am I being too much of a prick? For example, I don't want to be rude, but I think too many of my reactions are negative in ways that seem negative on the polite dimension, and my own DK-I would let me get an honest assessment of how rude (or polite) other people think I am.

Already spent much more time than I had intended, but mostly I have to apologize for having taken up so much of your time. I think this description is quite shallow and confusing. If you have been able to follow it then you deserve some kudos and DK-I positive for reading skills' dimensions. Also you are being quite polite to a stranger, and your DK-I should go up for that reason. (Not only are I one, but I'm currently reading L'Étranger with Japanese annotations.)

Oh yeah. One more thing. It would be interesting if the the website itself could pick out DK-A dimensions of interest and relevance. That would simply the rating process if you opted to look at the deeper karma instead of the simple +/- reaction.

[-]Ruby5y20

Thanks for this detailed feedback! I can't delve into properly today, but I hope to look at it soon.

[-]shanen5y10

Is an ACK called for?

I would add one more aspect if I didn't suspect it's a moot topic. The financial side. Someone has to cover the costs of things... I personally favor cost-recovery from wannabe donors and actual beneficiaries. However I think LessWrong may use the big donor model, which only works as long as the donor's pockets stay full and the donor doesn't make too many bad calls.