Reviewing the Review

by Raemon 9 min read26th Feb 202016 comments

46


We just spent almost two months reviewing the best posts of 2018. It was a lot of development work, and many LW users put in a lot of work to review and vote on things. 

We’ve begun work on the actual printed book, which’ll be distributed at various conferences and events as well as shipped to the featured authors. I expect the finished product to influence the overall effect of the Review. But meanwhile, having completed the “review” part, I think there’s enough information to start asking: 

Was it worth it? Should we do it again? How should we do it differently?

Was it worth it? Should we do it again?

My short answer is “yes and yes.” But I have some caveats and concerns.

My own goals for the Review were:

  1. Actually identify the best posts
  2. Improve longterm incentive structures and feedback systems
    1. Give users a reason to improve old content
  3. Check our collective epistemic state on controversial posts
  4. Figure out how to evaluate blogposts
  5. Create a shared sense of ownership over LW’s intellectual pipeline
  6. Evaluate LessWrong as a site and project

Some of those (1, 4, 6) I feel able to evaluate independently, others depend on how other people felt about the process, and how much value they got for the effort they put in. It also depends on what the counterfactual actions the Review is being compared to.

But overall, I personally found the process very rewarding. Nominating, reviewing and voting gave me a clearer sense of how various ideas fit together, and what LessWrong had accomplished in 2018. This involved a fair effort on my part (several hours of re-reading, thinking, comparing), but it felt both enjoyable and worthwhile.

Identifying the best posts

I think the review did a decent job at this. The obvious comparison is “what were the top karma posts of 2018?”. Could we have saved ourselves a ton of work by checking that? This is somewhat confounded by the fact that we changed our karma system partway through 2018 (reducing the power of the average upvote, after initially increasing it at the beginning of LW 2.0). At some point we plan to re-run the voting history using the current upvote strengths, which will give us clearer information there.

Meanwhile, comparing the top karma posts of 2018 to the top-voted posts in the review,  there are some obvious differences. (Most obviously, “Arbital Postmortem” didn’t end up featured in the review, while being the top-scoring post)

I think the simple act of filtering on “Which posts had at least some people who had made use of them, and were willing to endorse them?” was a key factor.

I think there was a lot of room to improve here. I felt that the voting process seemed, at least in some cases, more like “how prestigious should this post be?”, rather than giving me a clear sense of “how useful was this post?”. 

Next year, I’d like to experiment with other voting processes that help disentangle “should this post be in a public-facing Best of LW book?” from “Was this post valuable?”, and “Did it reflect good intellectual practices?”.

Improving Longterm Incentives and Feedback

This is hardest to evaluate right now. But it’s also where I’m expecting most of the Review process’s value to lie. A decade from now, my bet is that LessWrong will have demonstrably better epistemics and value if we keep doing some form of a longterm, retrospective review process.

But for this year, we saw at least some outcomes in this space. 

First, as a nominated author, I found it pretty rewarding to see some posts of mine getting discussed, and that some of them had had a longterm impact on people. I’m guessing that generalizes. 

Second, at least 4 people I know of deliberately updated their post for the review. Two of those people were on the LW team, and one of them was me, so, well, I’m not going to count that too strongly. But meanwhile lots of authors gave self-reviews that reflected their updated thoughts. 

Third, we saw a number of reviews that gave critical feedback, often exploring not just the ideas in the post but how the post should fit conceptually into an overall worldview.

Not all of those reviews were clearly valuable. But I think the clearest sign of counterfactually valid and valuable reviews were:

  • Abram’s review of Rationality Realism (where a lot of latent disagreement came to light, followed by in depth discussion of that disagreement)
  • A review by Bucky which looked into a cited paper in Zvi’s Unknown Unknowns. This was a particular type of intellectual labor that I was hoping to come out of the review process, which I expected to not happen much by default.

Checking out epistemic state on controversial posts

I think this had more room for improvement. The aforementioned Rationality Realism discussion was great, and I think there was at least some progress in, say, Vaniver’s post on Circling that acted as a sort-of-review for Unreal’s 2018 post.

I don’t have a strong sense that any debates were settled. But, well, I do think it takes an absurdly long time to resolve disagreements even when people are trying hard in good faith.

I think we did at least get a clearer sense of how controversial each post was, from the voting process, and that seems like a good starting place.

Figure Out How To Evaluate Blogposts

There were a diverse array of blogposts. Some of them benefited from conceptual, philosophical debate. Some of them benefited from statistical review of scientific papers.

A few of them had implied empirical claims, which would be pretty hard to check. I still hope that someone investigates more thoroughly El Tyrei’s question about “Has there been a memetic collapse?”, which looks into some of Eliezer’s assumptions in Local Validity and Is Clickbait Destroying Our General Intelligence?. But, to be fair, it’s a lot of work, it’s confusing how to even go about it, and right now I don’t think we’ve really offered good enough rewards for answering it thoroughly.

Overall, we got a fair number of people who worked on reviews, but a small number of people did most of the work. A couple people noted that reviewing felt like “work”, and I think the strength of internet forums is making intellectual work feel like play.

I am uncertain how to take all of this into account for next year.

Shared sense of ownership over LW’s intellectual pipeline

I don’t have a clear sense of how this worked out. I know that participating in the review process increased my own sense of partial-ownership over the intellectual process, and I have some sense that the other people participated most heavily in the process felt something of that. But I’m not sure how it worked overall.

This goal was less necessary than some of the other goals but still seems useful for the longterm health of the site.

Evaluate LessWrong as a site

While engaging in the review process, I skimmed posts from 2017, and 2019 as well as digging significantly into the nominated posts from 2018. This gave me some sense of LW’s overall output trajectory. This doesn’t necessarily give us clear common knowledge of the community’s collective epistemic state, but I found it at least useful for myself to form an opinion on “how is LW doing?”

One thing I found was that in 2017, there were relatively few unique authors that I was excited – many of the most exciting things were posts from Eliezer’s Inadequate Equilibria sequence, which already had been made into a book, and then maybe… 5 other authors among posters I was particularly excited about?

In 2018, there was a much more diverse array of popular authors. Eliezer is present as one of the top contributors, but there were around 40 authors featured in the review, and even if you’re just focusing on the top half that’s a healthier base of authorship.

I think we have aways to go – 2018 was when LW 2.0 officially launched and it was still hitting its stride. My rough sense (partially informed by the metrics we’ve started tracking) is that 2019 was a slight improvement, and that things have started picking up in particular towards the end of 2019. 

In the 2018 Review, my overall sense is that there were many “quite solid” posts on the subject of general rationality, and coordination. Meanwhile, I think a lot of very concrete progress was made on Alignment. It looks (from the outside) like the field went from a position of not really having any common language to communicate, to establishing several major paradigms of thought with clear introduction sequences. 

Some people have expressed some sense that…  Alignment posts don’t quite feel like they count. Partly because the Alignment Forum is [sort of] a separate website, and partly because they’re just a lot less accessible. I think it’s admittedly frustrating to have a lot of high-context, technical content that’s hard for the average user to participate in.

But, it still seems quite important, and I think it is sufficient to justify LessWrong 2.0’s existence. Much of it is tightly interwoven with the study of rationality and agency. And much of it is precisely the sort of hard, confusing problem that LessWrong was founded to help solve.

I do hope for more accessible “rationality” oriented content to build momentum on LessWrong. I think some progress was made on that in 2019, and, well, we’ll see next year hopefully how that looked in retrospective.

Problems with Execution

Too much stuff

There were 75 posts that made it into the review process. This seemed roughly the right amount of contenders, but more than people could easily think about at once. One way or another, we need to do a better job of directing people’s attention.

Options I can see include:

  • Have a higher nomination threshold, somehow aiming for closer to 50 posts. (for reference, this year, 23 posts had 3+ nominations rather than 2. I’m assuming more posts would have gotten 3 nominations if we had stated that explicitly as a requirement.
    • I don’t think it’s reasonable to cull the initial pool to less than 50, especially if post volume grows each year.
  • Somehow culling the nomination pool some other way, partway through the process.
    • I don’t currently have good ideas on how to do this
  • Direct user’s attention to posts they’d previously engaged with (i.e. views, upvotes, comments)
    • I’m pretty confident we’ll do this next year, but it doesn’t seem sufficient
  • Direct people’s attention to a smaller group of posts at a time. Maybe every few  days, direct people’s attention to a different set of posts
    • This seems potentially promising, but 75 is still just a lot of posts to get through. It’d be 18 posts per week if the review period was a month, 9 if it were 2 months.
  • Radically restructure the thing somehow (perhaps doing rolling reviews every few months rather than a single all-encompassing one?)
    • Rolling reviews feel… less exciting somehow. But I could imagine this turning out to be right approach.
  • Randomly assigning each user a smaller number of posts to focus on. (Perhaps each user gets 5 or 10 posts they’re supposed to evaluate, and if they’ve evaluated those ones, they a) get to have their votes counted in the larger tally, b) they are then welcome to review other posts that they’re excited about)

A lot of the options feel like good ideas, but insufficient. But maybe if you combine them together you get something workable.

Voting vs Reviewing

This year, we initially planned to separate out the reviewing stage and the voting stage. Ben and I ended up deciding to have the Voting Phase overlap the end of the Review Phase. I can’t remember all the reasons for this but they included “we still wanted more reviews overall, and we expected people who show up to do voting to end up doing some reviews along the way”, and “some people might want to update their votes in response to views, and vice versa.”

I think it’s plausible that next year it might be better to just have the Vote and Review phases completely overlapping. (In particular if we end up doing something like “assign each user 10 random posts to evaluate.” I imagine it being a fairly hard task to “do a serious review of 10 posts”, but to be fairly achievable to “think about 10 posts enough to cast some votes on them, and if you end up writing reviews in the meanwhile that’d be great.”

Voting vs Survey

As I mentioned earlier: A worry I have with our voting system this year is that it felt more to me like "ranking posts by prestige" than "ranking them by truth/usefulness." It so happens that "prestige in LW community" does prioritize truth/usefulness and I think the outcome mostly tracked that, but I think we can do better.

I'm worried because:

  • I'd expected, by default, for these things to be jumbled together.
  • Whatever you thought of Affordance Widths, it seems unlikely for it's "-140" score to be based on the merits of the post, rather than people not wanting the author represented in a Best of LW book. (This isn't obviously wrong: reputational effects matter and it's fine to give people an outlet for that, but I think it's better to ask those questions separately from questions of truth, and usefulness. It currently seems to me that if an unpopular author wrote an obviously important piece, the review wouldn't be able to determine that)

I also noticed conflict between “which posts really make sense to showcase to the rest of the world?” and “which posts do we want to reward for important internal updates, that were either very technical, or part of high context ongoing conversations.”

So I’d like to frame it more as a survey next year, with questions like:

  • Have you thought about the ideas in this post in the past year?
  • Do the ideas in this post seem important?
  • Does this post demonstrate good epistemics?
  • Should this post appear in a public-facing Best of LW book?
  • Should this post appear in an inward-facing, high-context LW Journal?

(With options for "yes/no/neutral", perhaps with Strong No and Strong Yes)

The main reason not to do this is that in many cases the answers may be similar, enough that they feel annoyingly redundant rather than helpfully nuanced. But I currently lean towards “try it at least once.” I’m hoping this would prompt users to give honest answers rather than trying to strategically vote.

If we ended up combining the Review and Voting Phases, this might come along with text-boxes for Reviews. Possibly just one catch-all textbox. Or, possibly broken up into multiple freeform questions, such as:

  • How has this post been helpful to you?
  • What advice do you have to the author to improve this post?
  • What thoughts do you have for other voters to help them evaluate this post?
  • What further work or questions would you like to see done in this post?

Nomination Phase was a bit confusing

I originally hoped the nominations phase would include detailed information on why the posts were good. Instead most nominations ended up fairly brief, and people gave more nuanced positive thoughts in the review phase.

I think in many cases it was useful for nominations to include reasons why, to give other nominators some context for why they might consider seconding the nomination. But in some cases that just felt superfluous.

At the very least, next year I’d include a more prominent “endorse nomination” button, that makes it lower effort for subsequent nominators to increase the nomination count. It’s possible that including reasons for nominations isn’t necessary, and we can handle that as part of the Review step.

Alignment Review?

It did seem like the Alignment Forum would have benefited from having a somewhat different review process. My current guess is that next year, there would be a LessWrong Review, and an Alignment Review, and some of the content is overlapping but they’re optimized separately.

It's possible the Alignment Review might include things not published on the Alignment Forum (and, in fact, it’d be a fine outcome if it were concluded that the most important Alignment progress happened elsewhere). In the months leading up to the review, AF users might be encouraged to make link posts for 2019 Alignment Content that they think was particularly important. 

Further Feedback?

Did you have your own take on the review process? Were there particular problems with execution? Having seen how the review process fit together, do you have overall concerns about the approach or the ontology?

I'm interested in feedback on whatever levels stand out to you.

46