Hertford, Sourbut (rationality lessons from University Challenge)

Oliver Sourbut

Amongst the huge range of excitements offered by joining the University of Oxford was the unexpected opportunity to join this lovely bunch

Sourbut, Keskin, Whittle, and Balakrishnan Raju sat at the Hertford Oxford desk at the University Challenge set

Hertford College University Challenge team 2023, with our mascot Simpkin the cat

You can tune in on 2023-09-04 at 20:30 UK time on BBC 2 or watch or catch up online if you want to see us in action.

As a relative quiz-noob^[1], joining an elite quizzing team (hold your applause) was an eye-opening experience in a few ways. I'm not allowed to talk about how things went on the show (on pain of getting told off by the NDA police), but actually (as with all forms of performance and competition), the vast majority of the time was spent in prep and practice, which is where most of the insights came in anyway.

I'm going to talk a bit about University Challenge, and also gesture at how the experience as a competitive quizzer relates to broader theory and practice in decision-making under uncertainty. If you just want to see some fun quiz questions and my take at answering them, you can skip the middle Real-time calibrated decision-making section, or just skip reading this entirely and watch the show.

The format and some example questions

For readers unfamiliar with University Challenge, it's a competitive quiz, where each match consists of two teams head-to-head. Importantly for this discussion, a key part of the format is buzzer rounds ('Starter for 10'): that means you don't just have to know the answer, you have to know the answer and buzz before your opponent if they also know, otherwise you get nothing. But buzz too soon with a wrong answer and you lose points^[2].

Here are some example questions. Maybe you know some of the answers! If you want to, imagine hearing the question word by word - when do you have a good guess or some ideas? At what point are you confident of the answer? Would you risk buzzing early and losing points if you're wrong - and on what basis?

I'll go through these examples later, and give the answers (my realtime guesses and the actual ground truth).

What single-digit number links: the element boron, the fourth root of 625, and the planet Jupiter’s position from the Sun?

Resembling a cornet but having a slightly larger bell, which instrument is a standard in British brass bands, its name being the German for ‘wing horn’?

Rayleigh-Taylor, Kelvin-Helmholtz and Rayleigh-Bénard are all types of what general physical phenomenon, characterised by the unbounded growth of small disturbances?

Following the example of the Cadbury brothers’ model at Bourneville, which manufacturer and philanthropist developed the model village of New Earswick, north east of York?

Which art gallery links How It Is by Miroslaw Balka, Shibboleth by Doris Salcedo, Embankment by Rachel Whiteread, Marsyas by Anish Kapoor and The Weather Project by Olafur Eliasson?

In 2010, which tennis player became the seventh player to win all four Grand Slam tournaments when he defeated Novak Djokovic in the US Open men’s final?

‘Chain’, ‘double treble’, ‘reverse half double’ and ‘slip stitch’ are all terms used in which handicraft, whose name is a diminutive of the French word for ‘hook’?

Real-time calibrated decision-making

Uncertainty in beliefs

A lot of theory and practice point to respecting and manipulating uncertainty as being mandatory for good truth-seeking and good decision-making. There's a lot of theory here which I won't elaborate on, but Bayes' Rule features heavily in my favourite chunks of literature^[3].

Bayes says that (assuming certain reasonable assumptions about how you want your beliefs to work) when you see evidence which is more or less likely under different hypotheses, you should adjust the odds of your credence in each hypothesis according to how relatively likely the evidence was, in a particular way. For hypotheses and $B$ , and observed evidence $O$ ^[4]:

$P (A | O) : P (B | O) = [P (O | A) \times P (A)] : [P (O | B) \times P (B)]$

That is, the ratio of your credence in $A$ and $B$ after observing $O$ (the ratio on the left hand side) is scaled by the ratio of the likelihood of the observation under each hypothesis.

How does this apply to competitive quizzing? We're trying to hone in on an answer to a question - but we receive the question one word (indeed, syllable!) at a time. The question, word by word: that's the evidence we receive. Our ideas about what the answer might be, the hypotheses.

Calibration

Besides representation of uncertainty, another important aspect of belief-formation and decision-making is correctly handling uncertainty.

How this is often operationalised is a notion of calibration of our uncertainties. If we express (or act in such a way as to implicitly express) a confidence level of 80% in some proposition, we are calibrated if 80% of similar cases resolve positively.

This actually gets trickier and more philosophical than we might like! What are 'similar cases'? What about confidence levels of tiny orders of magnitude, like 0.00001%? - surely we'll never be able encounter or identify sufficiently many 'similar cases' to find out if we were calibrated or not! Do we have to start reaching for notions, heavens forbid, of counterfactual possible outcomes? I don't have the answers, and as far as I know these are open questions in philosophy on one hand (descriptively) and machine learning and statistics on the other (prescriptively).

When quizzing, if I feel 80% sure of the answer, I want that to correspond to my being right about 80% of the time! Otherwise I'll make a loss in expectation, either because I buzz too confidently and get it wrong (losing points), or buzz too late and lose out to someone on the opposing team.^[5]

Logical or computational uncertainty

There's a big hole in the Bayesian literature. Actually it's a big hole in the entire statistical literature, it's just more obvious in Bayes-land because it's more explicit.

Sometimes our uncertainty is because we haven't had long enough to think yet. Consider the digits of pi. (Presumably you know them. If not, I can tell you an effective procedure for enumerating them and you can come back when you're done.) Suppose I want to know the millionth such digit. Well, I know all the facts I need to get there. There are only ten things it can be (assuming decimal). I don't need to make any more 'observations' per se to arrive at a conclusion. But still I don't know the answer yet.

One of the interesting things about being a self-reflective computer and trying to do hard things is that you start to notice when you bump up against computational constraints - especially ones like this which aren't always obvious (or which get neglected for simplicity) the first few times theorists wade in to try to disentangle things! This is just one example where an appreciation of time constraints^[6] as a major determinant of effective computational procedures gives rise to interesting scientific problems and insights.

It's exactly the same when quizzing. Our brains' word-association and retrieval and evaluation and updating can only run so fast - often not as fast as the quizmaster can read the question!

As a technical puzzle, understanding this aspect of uncertainty intrigues me. I'm certainly less au fait than with the standard timeless perspective on uncertainty. There's some great work by MIRI which begins to address this, for example discussing logical uncertainty and logical (or Garrabrant) induction.

Doing this stuff fast

We don't get the opportunity to pause time after every question syllable, pull up a notepad, run some supercomputer evaluations, compute exact Bayesian posteriors, estimate our teammates' and opponents' credences and likely buzzing behaviour, and so on. Cruelly, time flows at one second per second and the quizmaster keeps quizmastering. So too in life! Our decisions in modern life might not usually be as split-second as in a head-to-head quiz, but our uncertainties (including logical uncertainty) and the costs of mistakes are just as real^[7].

One of the main changes resulting from practising and competing in UC was that I went from 'quiz noob with broad knowledge base and slow mental retrieval' to 'quiz rookie with broad knowledge base and slightly-less-slow mental retrieval'. One of our team in particular was a much more experienced quizzer, and an absolute master of buzzer technique! Entering the world of highly-practised quizzers gave me an appreciation for the challenges involved. I don't know how well these competences generalise, but I wouldn't be surprised if competitive quizzers (more experienced than me) would on the whole be great (or have great potential) at calibrated decision-making under uncertainty, and forecasting.

Of course, maybe I'm overanalysing this: after all, maybe general knowledge quizzes mostly come down to brute knowledge-base-retrieval. You either know or you don't! But I think the head-to-head competitive aspect brings out this mandate for fast approximate calibrated estimation: you have to eke this out sometimes, or your opponent will! I didn't expect this insight going into it, but it's given me a fresh appreciation for UC the show, and for head-to-head quizzing in general.

But at what cost?

This whole discussion sidesteps the reasons for wanting to have accurate and calibrated beliefs on the basis of limited evidence. Of course, it's fun and perhaps even virtuous and whatnot to have accurate beliefs, but usually we want them because they help us to do the good things.

This raises the question of costs: if my beliefs are feeding into my actions, and I have limited computational budget (always), it matters how much difference I expect it to make. In the case of University Challenge, for each buzzer question, there are seven outcomes, where I've indicated a rough net score for each

(-25) lose points and other team gains points
(-20) other team gains points
(-5) lose points and other team gets nothing
(0) nobody gets anything
(5) other team loses points
(20) gain points
(25) other team loses points and you gain points

(As we may find out, there are also secretly other options, like '(-1000) say something embarrassing on national TV'.)

Our expectation of these outcomes depends on

our current estimate of what the answer might be
our estimate of our teammates' state
our estimate of our opponents' state
(a guess at how soon the question will end)

For these reasons, when time and/or decision-making computational resources are scarce, it can pay to make fast approximations and gut checks on all of these. I felt my own system 1 slowly incorporating some of this stuff through practice, and I expect think this is a large part of what separates really good quizzers from the rest of us!

In University Challenge in particular, the cost of an interruption is a mere 5 points, which is no big deal in the scheme of things (compared to the upside of +10-25 for a correct answer, depending on bonuses). But what it really costs is the opportunity for you, or a teammate, to reduce uncertainty about the answer - either by hearing more of the clue, or by having more time to think! This is something it took a while to internalise, and you do have to internalise it to get good in this competitive setting.

To some extent, this whole thing pattern-matches a lot of work I've done in my time as a data scientist and software engineer in industry, relating to online secret auctions: our estimate of teammates and opponents' states corresponding to their 'bid' and our own estimate corresponding to our 'true evaluation' of the item. There, as in quizzing, time is limited and computational constraints reign. High throughput and rapid constrained decision-making often trumps slow and painstaking deliberation in these contexts. I even have a hunch that some of the theory I developed there might transfer over! - especially concerning how to handle optimal bidding over a time-distributed volume in an uncertain market environment. But I've not played enough with the maths, and I'd need to check what's covered by NDA, so I won't elaborate any more.

You'll spot me doing some time-saving approximations and inversions below, as well as accounting for cost, not just for belief-updating.

Just answer the questions already!

OK here goes, in dialogue format. Remember, these are honest but hypothetical (I'm not allowed to reveal any real things from the show). I've partly but not totally biased the sample toward topics I know about^[8].

Quizmaster (QM): What single-digit number...

Oly (O): P(anything other than [0-9]) = 0

QM: ...links: the element Boron, ...

O: Boron is $B$ , group 3, element 5, mass 10 (right, usually?) [thinking]

QM: ... the fourth ...

O: Gotta be 5, the proton number, group number would be a weird choice and it's subject to debate anyway because of the transition elements, buzz time...?

QM: ... root of 625 ...

O: [buzz] 5

And the whole question

What single-digit number links: the element boron, the fourth root of 625, and the planet Jupiter’s position from the Sun? (5)

Review: not bad, if I'd been more confident of calibration I could have buzzed sooner. Who knows if my opponent would have been so cautious...? If I knew they were the waiting type, I could have rested a bit and double checked the fourth power of 5, just to be really sure, or waited even more and counted 'My Very Easy Method Just'.

What about a speed superintelligence? We can imagine they suffer no or limited logical uncertainty - effectively they can pause after every syllable. Well then, 'element' would be a big clue and they could precompute a mapping from the first ten chemical elements to their proton numbers and mass numbers. Because stable mass numbers aren't uniquely defined, proton number would already get most of the posterior, and once 'bor-' has been said there's really only one possible answer.

QM: Resembling a cornet...

O: Hmm, I've played a lot in brass bands, is this going to be cornet the brass instrument or something else? P(brass instrument) = 0.7. Maybe trumpet. P(trumpet) = 0.4. Should I buzz??

QM: ... but having a slightly larger bell, ...

O: 'bell'! Got to be brass! P(trumpet) = 0.7 [NB this should be lower to follow Bayes] but uhh, what are those other ones? Piccolo trumpet, soprano cornet, ...

QM: ... which instrument is a standard in British brass ...

O: ...wait, they all have smaller bells. THINK

QM: ...bands, its name being the German

O: Flugel horn! [buzz] Flugel horn

And the whole question

Resembling a cornet but having a slightly larger bell, which instrument is a standard in British brass bands, its name being the German for ‘wing horn’? (Flugelhorn)

Review: My logical uncertainty and brain speed let me down here. But I got very lucky on actually having spent a lot of time in British brass bands. Handy coincidence. The speed superintelligence would have done better in my place, but only if it had niche knowledge of brass band instruments.

QM: Rayleigh-Taylor, ...

O: [pure association] physics or maths stuff?

QM: ...Kelvin-Helmholtz and ...

O: Definitely physics! I've heard of Rayleigh scattering, won't be Taylor polynomials, Kelvin did... temperature stuff? Helmholtz did a bit of everything. Acoustics?

QM: ...Rayleigh-Bénard are all types...

O: Rayleigh again! Scattering? But I can't confidently link the other names, and I haven't heard of these name pairings at all.

QM: ... of what general physical phenomenon, characterised by the unbounded growth of small disturbances?

O: Oh, must be chaos! That's a general physical phenomenon. Question seems to be over, teammates haven't buzzed yet... [buzz] Chaos

QM (looking disappointed): I'm afraid that's the wrong answer

And the whole question

Rayleigh-Taylor, Kelvin-Helmholtz and Rayleigh-Bénard are all types of what general physical phenomenon, characterised by the unbounded growth of small disturbances? (Instability)

Ouch. If any of the other players had better physics breadth or recall, they'd have certainly got it long before I failed to. In practice there's at least one of my teammates (cough Omer Keskin cough) who I'd expect to get this question, or to have buzzed sooner with a guess than me! At least I didn't buzz early with a wrong answer.

QM: Following the example of the Cadbury brothers' model...

O: Ooh, I did my undergrad in Birmingham and I remember some history about this

QM: ... at Bourneville, which manufacturer and philanthropist...

O: OK, Bourneville was a constructed model village for confectionary workers, designed to offer better quality of life - getting distracted. It's probably another confectioner. Rowntree, Fry, ...? P(something I didn't think of) = 0.3

QM: ...developed the model village of New Earswick, north east of York?

O: My Granny lived in York and told me this! Pretty sure it was Rowntree. Confidence? Question seems over, time to [buzz]. Rowntree.

QM: Can I ask which Rowntree...?

O: Uhhhh. [I really should have read that fun Chocolate Wars book they gave us at undergrad induction. Pick a common early C20 businessman's name.] John?

QM: I'm afraid that's the wrong answer

And the whole question

Following the example of the Cadbury brothers’ model at Bourneville, which manufacturer and philanthropist developed the model village of New Earswick, north east of York? (Joseph Rowntree (1836-1926; not to be confused with his son, Seebohm Rowntree))

Review: well, that seems harsh. I'm not sure actually on the real show if they'd have given this or not. This was basically a case of fact retrieval, though it illustrates the important issue of allocating probability/credence to 'something I didn't think of yet'.

QM: Which art gallery...

O: P(art gallery) = 1. I know hardly any, thankfully my teammates might know a few more. I remember visiting the Tate and Tate Modern as a kid. P(something I've never heard of) = 0.8.

QM: ...links How It Is by Miroslaw Balka, Shibboleth by Doris Salcedo, ...

O: Sound quite modern. P(modern) = 0.8

QM: ... Embankment by Rachel Whiteread, Marsyas by Anish Kapoor ...

O: Confidently all modern

QM: ... and The Weather Project by Olafur Eliasson?

O: Wait, that sounds familiar. It might actually be Tate Modern. Question's over. [buzz] Tate Modern

And the whole question

Which art gallery links How It Is by Miroslaw Balka, Shibboleth by Doris Salcedo, Embankment by Rachel Whiteread, Marsyas by Anish Kapoor and The Weather Project by Olafur Eliasson? (Tate Modern)

Review: I got very lucky here (as I said, I biased my sample a bit^[8:1]). In practice, someone else would probably have beaten me to it by a long shot, hopefully someone on my team!

QM: In 2010, which tennis player ...

O: Oh no, sport facts. Hopefully Daniel is on it. Federer? Djokovic? Williams? Nadal? Murray?

QM: ... became the seventh player to win all four Grand Slam tournaments when he defeated Novak Djokovic in the US Open men's final?

O: Yep, no idea. P(Federer or Nadal or Murray) = 0.3. Strictly, if alone, I ought to buzz and guess something now that the question has finished. But surely a teammate has a better guess than me?

QM: Anyone? Anyone going to buzz?

O: I really should buzz and guess. But besides getting answer right, my utility function also includes not putting myself forward embarrassingly wrongly on TV. P(wrong) = 0.9.

And the whole question

In 2010, which tennis player became the seventh player to win all four Grand Slam tournaments when he defeated Novak Djokovic in the US Open men’s final? (Rafael Nadal)

Review: This awkward silence actually happens, quite rarely, but sometimes, on the show. I assume this can only be when everyone involved has a similar thought process to me at the end. A superintelligence optimised exclusively to value University Challenge performance might not have similar compunctions, but there's also no knowing what instrumental strategies it might pursue. Maybe embarrassment or some analogue would actually play a part there.

QM: ‘Chain’, ‘double treble’, ‘reverse half double’...

O: Sport? 'treble' is a musical clef? Gymnastics?

QM: ... and ‘slip stitch’ ...

O: Oh, knitting or something? I remember something about this.

QM: ... are all terms used in which handicraft

O: It has to be knitting, right? Can I afford to wait? [buzz] Knitting

QM: I'm afraid that's the wrong answer and you lose five points. ...in which handicraft, whose name is a diminutive for the French word for ‘hook’?

O: [fuming] Should have waited! It's got to be crochet.

And the full question

‘Chain’, ‘double treble’, ‘reverse half double’ and ‘slip stitch’ are all terms used in which handicraft, whose name is a diminutive of the French word for ‘hook’? (Crochet (not knitting or embroidery))

Review: I'd have thought this would be an acceptable error, but then, for my sins, I don't know much about the fine distinctions between wool-based handicrafts. Alas. This is one where the competitive uncertainty bit me, my model of the opponents' lack of buzzing interpreting it as over-caution (therefore we're in a race and I'd better not waste more time thinking) rather than appropriate caution (therefore I'd better think harder to generate alternative hypotheses in case). A speed SI version of me would have generated at least 'crochet' as an alternative, and, I expect, waited until 'name is a diminutive' which would be enough evidence to be confident.

Takeaways

This post is already far too long. You can tune in on 2023-09-04 20:30 UK time on BBC 2 or watch or catch up online if you want to see me and my team in action!

Quizzing (and more importantly the practice matches and friendlies) gave me a fresh appreciation for various decision-making concerns, theoretical and practical. Plus, it was a great laugh and a chance to meet some really intriguing and friendly people, here in Oxford as well as from other teams.

Uncertainty is powerful! Calibration is a slightly elusive concept, but an important part of using uncertainty appropriately.

Logical uncertainty is a fascinating and under-studied phenomenon - especially in time- or compute-constrained settings!

As well as getting things right, we often need to accept some tradeoff for getting things wrong in order to free up decision-making and acting resources for other uses. When you're in a head-to-head time-based competition, this bites hard.

Finally, decision-making is, ultimately, about value. If my belief doesn't make a value-difference by changing my behaviour, then I might better pay attention to other beliefs which do, even at the expense of the first belief being true. Again this only bites in a constrained context: whether constrained by evidence (for factual uncertainty) or by compute (for logical uncertainty). Since there are so many things we are terribly clueless about, this actually applies in practice all the time.

Quizzing is fun, and I haven't even got round to mentioning the arcane art of quiz-writing. A peek behind the scenes at how those particular sausages are made was also illuminating. Were it possible, quiz- and puzzle-setters have gone up even further in my estimations.

If you're a quizzer or a quiz-setter (whether with more or less experience than me), I'd love to hear if any of this resonated with you, and about your own reflections on the art of quizzing and decision-making!

I've enjoyed as many pub quizzes as the next Brit but that's about as far as it goes ↩︎
You win 10 points for a correct answer, whether early or not (hence 'Starter for 10'), and you lose 5 for an interruption and an incorrect answer. An incorrect answer after the end of the question gets nothing. There are also non-buzzer questions where teams can confer, for bonus points. ↩︎
OK, when I said 'literature' I meant 'scientific literature' i.e. papers and textbooks, but actually this sentence applies to some of my favourite actual literature too: Harry Potter and the Methods of Rationality, for example. ↩︎
I've used my favoured formulation of Bayes' Rule here where it's all about odds (ratios of probabilities/likelihoods). I think it's a historical accident that Bayes usually gets introduced in less obvious ways and then forgotten. ↩︎
It gets more complicated! There are multiple members of the team, so really I want to be sensitive to when my particular knowledge areas overlap with those of my teammates, and when they might buzz (right or wrong), and which of us is more likely to be right, given the current information, ... And the decision to buzz early or not of course also depends on some belief model of the opposing team's capabilities! If I know for sure they won't get it, or that they'll play cautiously, I can afford to wait, but if I have some sense that they're aggressive on early guesses, I need to be willing to play with less confidence. ↩︎
There are other computational bottlenecks, and tradeoffs around parallelism and memory-use are other fruitful considerations. ↩︎
Well, the cost of mistakes in life are more real, provided your utility function includes anything other than 'be good at quizzes' ↩︎
I ran through 30 actual questions and somewhat ad-hoc chose 7 that seemed especially interesting. ↩︎ ↩︎

LESSWRONG
LW

LESSWRONG
LW

30

Hertford, Sourbut (rationality lessons from University Challenge)

30

The format and some example questions

Real-time calibrated decision-making

Uncertainty in beliefs

Calibration

Logical or computational uncertainty

Doing this stuff fast

But at what cost?

Just answer the questions already!

Takeaways

30

30