I've been curious about how good LeelaPieceOdds is, so I downloaded a bunch of data and graphed it.

For context, Leela is a chess bot and this version of it has been trained to play with a handicap.

This is BBNN odds, meaning Leela starts without bishops and knights.

I first heard about LeelaQueenOdds from simplegeometry's comment, and I've been playing against it occasionally since then. LeelaPieceOdds is the version with more varied handicap options, which is more interesting to me because LeelaQueenOdds ~always beats me.

Playing Leela feels like playing chess while tired. I just keep blundering. It's not invincible, and when you beat it it doesn't feel like it was that difficult. But more often than not it manages to overcome the huge disadvantage it started with.

The data

Bishop-Bishop-Knight-Knight

How to read the graphs

Orange means the human player won. The title contains the handicap^[1] and the time control. I split the data by time control (Bullet, Blitz, Rapid), which indicates how much time the human is allowed to spend thinking over the whole game. Leela always plays quickly so the time control is essentially only affecting the human.

For reference, these are the time control categories I'm using:

Bullet:≤3 minutes, usually ~1 minute.
Blitz:≤8 minutes, usually ~5 minutes.
Rapid:≤ 25 minutes, usually ~15 minutes.
Classical:≥ 25 minutes.

Often there wasn't enough data show anything interesting for the Bullet and Classical time controls, so I usually omitted them.^[2]

In the x-axis labels, n is the number of games in that bucket. Often it's too low to draw confident conclusions. I've grouped together games where Leela played as white and as black, even though playing as white gives a small advantage.

The human player rating (on the x-axis) is the Lichess rating^[3] of the human player, for that particular time control, at the time when the game against Leela was played. Because Lichess tracks different ratings for each time control, the graphs aren't precisely comparable across time controls.

Knight-Knight

Bishop-Bishop

Rook-Knight

Bishop-Bishop-Knight

Rook-Knight-Knight

Rook-Bishop-Bishop

Queen-Knight

Queen-Rook

Queen-Knight-Knight

Queen-Bishop-Bishop

Queen-Rook-Knight

Queen-Rook-Rook

The spike in 1400-1499 is caused by a single user who played more than a thousand games.

Thoughts

I think the stats are biased in favour of Leela

In my experience playing different odds, I tend to win a little more than the data indicates I should. I'm not sure why this is, but I have a couple of guesses:

When people start consistently winning, they stop playing (or change to a more difficult odds). But consistently losing is interesting, so they keep going.
Because Leela plays fast, people mirror it and also play fast. This makes them play worse than they do against humans.

If you want the most impressive-to-you demonstration, I recommend selecting an odds where your rating wins <1/3 of the time.

How much room at the top?

How much could LeelaPieceOdds be improved, in principle? I suspect a moderate but not huge amount. An easy change would be to increase the compute spent on the search. I'm really curious how much of a difference this makes, because deep searches favour the machine.

A more difficult change would be adaptation to particular players. Leela works by exploiting weaknesses in human play, but it was trained using a model of aggregated human play. In principle, it could adapt to the particular player by analysing their past games and building a more accurate model of their likely mistakes.

Relevance to AI control cruxes

AI control proponents want to keep researcher AIs in a box and make them to solve alignment for us. Stories of AI doom often involve the AI escaping from this situation. Odds chess could be a tiny analogue of this situation,^[4] so one might hope that this would provide some empirical evidence for or against control. But unfortunately we mostly agree with control proponents on the plausibility of escape for near-human-level AI. My reasons for thinking control won't work aren't related to the capabilities of a superintelligence. They are:

Distinguishing between good research and AI generated research trained to look good is extremely human-labour intensive. Therefore we don't get a huge research speed-up on problems that aren't easily verifiable.
When our AI fails to solve the problem, we'll inevitably keep training it to be cleverer and giving it more tools, even as it looks increasingly capable of escaping.

Even though there's plenty of current relevant evidence about these points, it's apparently easy to twist it to fit several worldviews.

Relevance to cruxes with people outside the alignment sphere

A lot of people have a strong intuition that structural advantages tend to trump intelligence differences in a conflict. Playing chess with Leela is a great exercise for adding some nuance to that intuition. A friend said they better understood AI risk after we had a conversation about Leela.

^{^}
The options are NN, BB, RN, RR, BBN, RNN, RBB, QN, BBNN, QR, QNN, QBB, QRN or QRR. The initial stands for which pieces Leela is missing at the beginning of the game. N=Knight, B=Bishop, R=Rook, Q=Queen.
^{^}
E.g. here's the Classical time control for BBNN:
^{^}
Note that Lichess and Chess.com ratings differ significantly at the lower ratings, but start to be fairly similar above 2000.
^{^}
I think this section from The case for ensuring that powerful AIs are controlled, which lists the handicaps of the AI, makes the analogy fairly clear:
For your control evaluations to be conservative, the red team must be at least as good at finding attacks as scheming models, so that if the red team fails to find any good attack, it would imply that scheming AIs would also fail to find a good attack. We also need to be able to sufficiently measure the capabilities of models in ways that are robust to that very model sabotaging our evaluations.^[11] We think that finding good enough attacks and robustly measuring capabilities for early transformatively useful AIs (but not arbitrarily powerful AIs) seems achievable, because our AIs have significant disadvantages compared to our red team:
1. Less serial thinking time. Current LLMs can only do a small amount of serial reasoning outside of natural language reasoning. So, if future powerful AIs are similar in this way to current LLMs and we monitor the AI's natural language reasoning for problematic topics like reasoning about evading our countermeasures, we can ensure that in order to break our countermeasures the AI would need to think of its strategy in a forward pass. Even if future AIs aren't restricted to this extent, we still expect to be able to impose some important limitations on serial reasoning or sharing uninterpretable state ("neuralese" memories) between different instances. In contrast, our red team is allowed to search for attacks and ways to elicit capabilities with lots of serial time, a team of people, and AI assistance.^[12]^[13]
2. Less information about countermeasures.
3. Little practice, no iteration against countermeasures. When the red team experiments with a strategy and it totally fails, they can shrug, learn from the failure, and try again. But if the AI fails and is caught, its situation likely becomes much worse. And even if AIs don't get caught, they might be unable to learn from their failures because of limitations on their state and memories.

LESSWRONG
LW