How not to sort by a complicated frequentist formula

[-]Eliezer Yudkowsky13y120

The main thing you want to calculate here is expected-value-of-information. Otherwise new posts drop into the void. Trying to maximize upvotes in the long run means showing new posts that might have a high-upvoting parameter.

[-]Manfred13y160

Well, you have to make somewhat of a secret sauce, because what you show on top should depend on what you want the top things and the people who see them to do. If you're a site like reddit, putting highly-rated stuff on the front page is good, but what you really want is highly rated new stuff. But if you're urbandictionary, you don't really care when something was submitted, you want to put the best answer at the top - except that if a word has multiple definitions, you want to have variety in your top definitions. Or maybe you're amazon, and you want people to see stuff they've already bought so it's easy for them to rate it. Etc.

[-]Meni_Rosenfeld13y40

This is interesting, especially considering that it favors low-data items, as opposed to both the confidence-interval-lower-bound and the notability adjustment factor, which penalize low-data items.

You can try to optimize it in an explore-vs-exploit framework, but there would be a lot of modeling parameters, and additional kinds of data will need to be considered. Specifically, a measure of how many of those who viewed the item bothered to vote at all. Some comments will not get any votes simply because they are not that interesting; so if you keep placing them on top hoping to learn more about them, you'll end up with very few total votes because you show people things they don't care about.

[-]Eliezer Yudkowsky13y50

Yep. You'd want to check or guess the size of the user's monitor and where they were scrolling to, and calculate upvotes-per-actual-user-read. As things are read and not upvoted, your confidence that they're not super-high-value items increases and the value of information from showing them again diminishes.

[-]dbaupp13y90

(Link to How Not To Sort By Average Rating.)

Something of interest: Jeffery's interval. Using the lower bound of a credible interval based on that distribution (which is the same as yours) will probably give better results than just using the mean: it handles small sample sizes more gracefully. (I think, but I'm certainly willing to be corrected.)

But I fear that it would cause irreparable damage if the world settles on this solution.

This is probably vastly exaggerating the possible consequences; it's just a method of sorting, and either the Wilson's interval method and a Bayesian method are definitely far better than the naive methods.

[-]gwern13y90

Something of interest: Jeffery's interval. Using the lower bound of a credible interval based on that distribution (which is the same as yours) will probably give better results than just using the mean: it is handles small sample sizes more gracefully. (I think, but I'm certainly willing to be corrected.)

I recently did a similar thing for ranking vendors by feedback, using both a Jeffreys interval and a Wilson interval; even on the vendors with little feedback, they were overall pretty similar. IIRC, I don't think they differed by more than 10% anywhere.

[-]Meni_Rosenfeld13y70

(Link to How Not To Sort By Average Rating.)

I forgot to link in the OP. Then remembered, and forgot again.

Something of interest: Jeffery's interval. Using the lower bound of a credible interval based on that distribution (which is the same as yours) will probably give better results than just using the mean: it handles small sample sizes more gracefully. (I think, but I'm certainly willing to be corrected.)

This seems to use specific parameters for the beta distribution. In the model I describe, the parameters are tailored per domain. This is actually an important distinction.

I think using the lower bound of an interval makes every item "guilty until proven innocent" - with no data we assume the item is of low quality. In my method we give the mean quality of all items (and it is important we calibrate the parameters for the domain). Which is better is debatable.

[-]Meni_Rosenfeld13y20

But I fear that it would cause irreparable damage if the world settles on this solution.

This is probably vastly exaggerating the possible consequences; it's just a method of sorting, and either the Wilson's interval method and a Bayesian method are definitely far better than the naive methods.

I just feel that it will place this low-hanging fruit out of reach. e.g.,

Me: Hey Reddit, I have this cool new sorting method for you to try!

Reddit: What do you mean? We've already moved beyond the naive methods into the correct method. Here, see Miller's paper. No further changes are needed.

Maybe I'm exaggerating - I mean, things can be improved again after being improved once - but I just feel that if the world had a "naive rating method" itch to scratch, and something like Miller's method became the go-to method, something is wrong.

[-]IlyaShpitser13y80

But the "correct" method is complicated, poorly motivated, insufficiently parameterized, and founded on frequentist statistics.

A quick question. It seems from this sentence that you view "insufficiently parameterized" as a flaw in a method. Can you explain what that phrase means and why it is a flaw? Is this a statement of preference for parametric stat or something else?

[-]Meni_Rosenfeld13y110

It means that the model used per item doesn't have enough parameters to encode what we know about the specific domain (where domain is "Reddit comments", "Urban dictionary definitions", etc.)

The formulas discussed define a certain mapping between pairs (positive votes, negative votes) to a quality score. In Miller's model, the same mapping is used everywhere without consideration of the characteristics of the specific domain. In my model, there are parameters a and b (or alternatively, a/(a+b) and a+b) that we first train per-domain, and then apply per item.

For example, let's say you want to decide the order of a (5, 0) item and a (40, 10) item. Miller's model just gives one answer. My model gives different answers depending on:

The average quality - if the overall item quality is high (say, most items have 100% positive votes), the (5,0) item should be higher because it's likely one of those 100% items, while (40,10) has proven itself to be of lower quality. If, however, most items have low quality, (40,10) will be higher because it has proven itself to be one of the rare high-quality items, while (5,0) is more likely to be a low-quality item which lucked out.

The variance in quality - say the average quality is 50%. If the variance in quality is low, (5,0) will be lower because it is likely to be an average item which lucked out, while (40, 10) has proven to be of high quality. If the variance is high (with most items being either 100% or 0%), (5,0) will be higher because in all likelihood it is one of the 100% items, while (40, 10) has proven to be only 80%.

In short, using a cookie-cutter model without any domain-specific parameters doesn't make the most efficient use of the data possible.

[-]VincentYu13y80

This is also somewhat meta in that LW also aggregates ratings, and I believe changing the model was once discussed (and maybe the beta model was suggested).

LW actually uses the same comment sorting that Reddit uses by default (i.e., it uses the solution suggested by Miller).

[-]TrE13y20

Specifically, the option can be found above the "Show help" button. "Sort by: Best" is what you want.

[-]EHeller13y40

Its not obvious to me that your method improves upon the Wilson score. Certainly, the traditional Bayesian approach (Jeffreys interval) is rarely that different from the WIlson score- have you played with values to see what the largest differences would look like?

[-]Meni_Rosenfeld13y20

Given that a and b are arbitrary, I think the differences can be large. Whether they actually are large for typical datasets I can't readily answer.

In any case the advantages are:

Simplicity. Tuning the parameters is a bit involved, but once you do the formula to apply for each item is very simple. In many (not all) cases, a complicated formula reflects insufficient understanding of the problem.
Motivation. Taking the lower bound of a confidence/credible interval makes some sense but it's not that obvious. The need for it arises because we don't model the prior mean, so we don't want to take risk on unproven items. A posterior mean of the quality is more natural, and won't cause much problems because items default to the true population mean.
Parametrization. The interval methods has a parameter for the probability to take for the size of the interval, but it's not at all clear how to choose it. My method has parameters for mean and variance which are based on the data.
Generalization. This framework makes it easier to clearly think about what we want, and replace the posterior mean of p with a posterior mean of some other quantity of interest. e.g., the suggested "explore vs. exploit" tends to give something closer to an interval upper bound than lower bound, and other methods have been suggested.

[-]EHeller13y40

Yes, a and b are arbitrary- but if they aren't chosen well your model could be hugely inferior. I'd suggest making a few toy data sets and actually comparing the standard methods (Wilson Score, Jeffreys interval) to yours before suggesting everyone embrace it.

Edit for clarity: Just to be clear, the Jeffery's interval (which is usually very close to the Wilson coeff) is essentially the same as your model but with the initial parameters 1/2,1/2.

[-]A1987dM13y20

This is the way I would do it, also taking into account EY's point of not hiding away new comments:

Assume each comment has an ‘upvote rate’ U, such that the probability that a comment at the age of t has u upvotes is a Poisson distribution with parameter Ut,

P(u|U, t) = (Ut)^u exp(−Ut)/u!

and similarly for downvotes,

P(d|D, t) = (Dt)^d exp(−Dt)/d!

If the prior probability distribution for U and D is P(U, D), their posterior probability distribution will be

P(U, D|u, d, t) = D^d U^u exp(−(D + U)t) P(U, D)/(a normalization constant).

Then, you sort comments according to a functional of the posterior pdf of U and D; in analogy with expected utility maximization you could use the posterior expectation value of some function f(U, D), but other choices would be possible. (This reduces to your proposal when you take f(U, D) = U/(U + D).)

Of course this model isn't entirely realistic because U and D ought to vary with time (according to timezone, how old the thread is and whether it's currently linked to from the main page, etc.), but the main effect of disregarding this (pretending that a comment has the same probability of getting upvoted in the 10,000th hour after its publication as in the 1st hour) would be to cause very recent comments to be sorted higher, which IMO is a Good Thing anyway.

[-]EHeller13y20

Why a Poisson distribution? It seems fairly clear we are looking at Bernoulli trials (people who look either upvote, or not). I doubt its a rare enough event (though it depends on the site, I suppose) that a poisson is a better approximation than a normal.

[-]Meni_Rosenfeld13y40

I think it's reasonable to model this as a Poisson process. There are many people who could in theory vote, only few of them do, at random times.

[-]A1987dM13y00

You'd need to know how many people read each comment, though.

[-]Meni_Rosenfeld13y00

I think some factor for decreasing votes over time should be included. Exponentially decaying rates seem reasonable, and the decay time constant can be calibrated with the overall data in the domain (assuming we have data on voting times available).

[-]A1987dM13y10

Exponentially decaying rates seem reasonable

That's likely way too fast. It's not that rare for people to comment on posts several years old (especially on Main), and I'd guess such people also vote comments. (Well, I most certainly do.) You can use an exponential decay with a very large time constant, but that would mean that comments from yesterday are voted nearly as often as comments from three months ago. So, the increase in realism compared to a constant rate isn't large enough to justify the increase in complexity. (OTOH, hyperbolic decay is likely much more realistic, but it also has more parameters.)

[-]SarahNibs13y00

Some users' votes are not independent of the current net vote count. http://lesswrong.com/lw/7x/voting_etiquette/5ic

[-]A1987dM13y00

Guilty as charged.

Showing upvotes and downvotes separately would stop me (and, probably, other people) from doing that, but I'm kind of getting tired of beating this dead horse.

[-]Meni_Rosenfeld13y00

True. This is a problem since the current net vote count is mutable, while an individual vote, once cast, is not. You could try fitting a much more complicated model that can reproduce this behavior, calibrate it with A/B testing, etc. Or maybe try to prevent it by sorting according to quality, but not actually displaying the metrics.

[-]TrE13y00

Specifically, on a continuous quality scale from 0 to 1, with a prior of a uniform density in this interval with k upvotes and n downvotes, one receives for the posterior distribution the (unnormalized) measure $p(x)=x^n \cdot (1-x)^k$ .

A Gaussian-like prior might be more suited here, though.

Knowing the actual probability distribution and not just the average can be useful if, for some reason, you're not interested in the comments with the best average, but in those which are least or most controversial.

[This comment is no longer endorsed by its author]Reply

[-]Meni_Rosenfeld13y10

The beta distribution is a conjugate prior for Bernoulli trials, so if you start with such a prior the posterior is also beta, which greatly simplifies the calculations. It also converges to normal for large alpha and beta, and in any case can be fit into any mean and variance, so it's a good choice.

Whatever your target function is, you'll want the item with the greatest posterior mean for this target. To do this generally you'll need the posterior distribution of p rather than the mean of p itself. But the distribution just describes what you know about p, it doesn't itself encode properties such as "controversial".

LESSWRONG
LW

LESSWRONG
LW

4

How not to sort by a complicated frequentist formula

4

4