LESSWRONG
LW

Bayes' TheoremMachine Learning (ML)Practical
Frontpage

20

Fullrank: Bayesian Noisy Sorting

by Max Niederman
24th Jul 2025
Linkpost from maxniederman.com
4 min read
2

20

Bayes' TheoremMachine Learning (ML)Practical
Frontpage

20

Fullrank: Bayesian Noisy Sorting
5gwern
6Max Niederman
New Comment
2 comments, sorted by
top scoring
Click to highlight new comments since: Today at 12:22 AM
[-]gwern2mo50

Could you say more about the efficient sampling through convolutions? Bad interactive latency was a major reason I didn't spend any time exploring Bayesian approaches for my quick-and-dirty resorter script.

I take it you are not using a conjugate approach for fast exact estimates, nor a Laplacian approximation, nor full slow MCMC, but I'm not sure what sort of speed you actually get from your approach.

Reply
[-]Max Niederman2mo60

The bottleneck for sampling from the posterior is sampling v from the truncated normal distribution, which I'm doing using this implementation of minimax tilting rejection sampling. It's not exactly instant, but on my laptopt it's a few CPU-ms/sample up to ~100 comparisons and 50 CPU-ms up to ~200 comparisons. This probably prevents it from being useful for lists with many hundreds of items, but I've found it to be fine for interactive use since selecting the next comparison doesn't require any sampling. The only thing which is prohibitively slow (for me at least) is computing the entropy of the full posterior since it involves so many evaluations of Φm.

Reply
Moderation Log
More from Max Niederman
View more
Curated and popular this week
2Comments

Fullrank is an interactive CLI tool for Bayesian inference of list rankings based on noisy comparisons. It takes a list of items, then efficiently prompts the user to compare pairs of items until the user decides that the posterior distribution is sufficiently low entropy. It can then sample from the resulting posterior distribution and compute various statistics.

GitHub Repository

Background

Deterministic sorting algorithms rank lists by comparing pairs of items. If an item is greater than another, it is moved higher in the list. However, sometimes it is uncertain which item is greater. For example:

  • The best chess player wins only 60% of their non-draw games against the second-best player.
  • Only three-quarters of human raters prefer one LLM completion over another.
  • A person prefers one food over another, but only on 80% of days.

Estimating rankings in the presence of this uncertainty is called noisy sorting. A common approach is to model comparisons between items as depending on a latent numerical value ("skill" or "rating") for each item. For example, the commonly used Bradley–Terry model assumes that

p(i>j)=σ(si−sj)

where si denotes the latent skill of item i and σ is the logistic function.

Motivation

@gwern's Resorter is a CLI tool for noisy sorting of lists based on the Bradley–Terry model. However, its frequentist approach limits it in a few ways:

  • It has to use imperfect heuristics to guess which comparison is most informative.
  • There's no principled way to quantify how accurate the resulting maximum-likelihood ranking is, or tell when to stop comparing items.
  • It can't answer questions like "What is the probability that item i is truly the best item?"

As a project to learn more about Bayesian inference, I decided to build a Bayesian version of Resorter.

Thurstonian Model

The Bradley–Terry model is quite nice for maximum-likelihood estimation, but I was unable to get it to work well in a Bayesian setting. Given a normal prior on the skills s∼N(μ,Σ), the posterior density is

p(s|w>l)=ϕ(s;μ,Σ)m∏iσ(swi−sli)[∫Rnϕ(s;μ,Σ)m∏iσ(swi−sli)ds]−1

where ϕ denotes the normal density, m is the number of comparisons, and wi and li are the winning and losing items in comparison i. It appears some researchers have designed efficient sampling procedures for this posterior, but frankly they are beyond me.

Instead, I used a probability model very similar to Bradley–Terry, but using a probit link instead of a logit link. That is, under the Thurstonian model,

p(i>j)=Φ(si−sj)

where Φ denotes the cumulative distribution function of the standard normal distribution.

I'll now derive the posterior density in the Thurstonian model. For convenience, I'll represent the observed comparisons as a matrix D∈Rm×n mapping score vectors to probits for each comparison. That is, Dij=1 if item j wins comparison i, Dij=−1 if item j loses comparison i, and Dij=0 otherwise.

p(s|D)=p(s)p(D|s)p(D) =ϕ(s;μ,Σ)Pr[Ds<z]Pr[Dt<z]

where t∼N(μ,Σ) and z∼N(0,Im).

It turns out that the normalization constant can be represented quite nicely using the multivariate normal CDF Φm:

Pr[Dt<z]=Pr[D(t−μ)+Dμ<z] =Pr[Dμ<z−D(t−μ)]

And since D(t−μ)∼N(0,DΣDT), we have

z−D(t−μ)∼N(0,Im+DΣDT) Pr[Dt<z]=Φm(Dμ;Im+DΣDT)

Likewise, Pr[Ds<z]=Φ(Ds). Therefore,

p(s|D)=ϕ(s;μ,Σ)Φm(Ds)[Φm(Dμ;Im+DΣDT)]−1

This is called a unified skew-normal (SUN) distribution, and it is the posterior of most probit models. Using the convention of Arrellano-Valle and Azzalini[1], we can write

s|D∼SUNn,m(μ,Σ,ΣTD,Dμ,Im+DΣDT)

Efficient Sampling

Arrellano-Valle and Azzalini[1:1] also gives us a convolutional representation of the posterior. If

u∼N(0,Σ−ΣTD(Im+DΣDT)−1DTΣ), and v∼N−Dμ(0,Im+DΣDT)

where Nτ denotes the normal distribution truncated below τ, then

μ+u+ΣTD(Im+DΣDT)−1v∼SUNn,m(μ,Σ,ΣTD,Dμ,Im+DΣDT)

Fullrank exploits this fact to efficiently sample from the posterior using samples of u and v.

Optimal Comparison Choice

Ideally, Fullrank should always present the user with the most informative comparison. That is, the comparison whose probit has maximal entropy.

Unified skew-normal distributions are closed under full-rank linear transformations, so each comparison probit is distributed according to a one-dimensional SUN distribution. At least in the case of a standard normal prior, each comparison has identical first and second moments, so intuitively the entropy should be controlled by the skewness.

Fullrank currently assumes that the entropy is decreasing in the L2 norm of the skewness parameter Δ=ΣTD, which seems to work well in practice. However, I haven't been able to prove that this works, and it definitely fails for certain non-scalar choices of prior covariance (though these are currently not supported anyway). If you have any better ideas for choosing comparisons, please let me know!


  1. R. B. Arellano-Valle and A. Azzalini, "Some properties of the unified skew-normal distribution," Statistical Papers, vol. 63, pp. 461–487, 2022. doi:10.1007/s00362-021-01235-2 ↩︎ ↩︎