I've been curious about how good LeelaPieceOdds is, so I downloaded a bunch of data and graphed it.
For context, Leela is a chess bot and this version of it has been trained to play with a handicap.
I first heard about LeelaQueenOdds from simplegeometry's comment, and I've been playing against it occasionally since then. LeelaPieceOdds is the version with more varied handicap options, which is more interesting to me because LeelaQueenOdds ~always beats me.
Playing Leela feels like playing chess while tired. I just keep blundering. It's not invincible, and when you beat it it doesn't feel like it was that difficult. But more often than not it manages to overcome the huge disadvantage it started with.
Orange means the human player won. The title contains the handicap[1] and the time control. I split the data by time control (Bullet, Blitz, Rapid), which indicates how much time the human is allowed to spend thinking over the whole game. Leela always plays quickly so the time control is essentially only affecting the human.
For reference, these are the time control categories I'm using:
Often there wasn't enough data show anything interesting for the Bullet and Classical time controls, so I usually omitted them.[2]
In the x-axis labels, n is the number of games in that bucket. Often it's too low to draw confident conclusions. I've grouped together games where Leela played as white and as black, even though playing as white gives a small advantage.
The human player rating (on the x-axis) is the Lichess rating[3] of the human player, for that particular time control, at the time when the game against Leela was played. Because Lichess tracks different ratings for each time control, the graphs aren't precisely comparable across time controls.
The spike in 1400-1499 is caused by a single user who played more than a thousand games.
In my experience playing different odds, I tend to win a little more than the data indicates I should. I'm not sure why this is, but I have a couple of guesses:
If you want the most impressive-to-you demonstration, I recommend selecting an odds where your rating wins <1/3 of the time.
How much could LeelaPieceOdds be improved, in principle? I suspect a moderate but not huge amount. An easy change would be to increase the compute spent on the search. I'm really curious how much of a difference this makes, because deep searches favour the machine.
A more difficult change would be adaptation to particular players. Leela works by exploiting weaknesses in human play, but it was trained using a model of aggregated human play. In principle, it could adapt to the particular player by analysing their past games and building a more accurate model of their likely mistakes.
AI control proponents want to keep researcher AIs in a box and make them to solve alignment for us. Stories of AI doom often involve the AI escaping from this situation. Odds chess could be a tiny analogue of this situation,[4] so one might hope that this would provide some empirical evidence for or against control. But unfortunately we mostly agree with control proponents on the plausibility of escape for near-human-level AI. My reasons for thinking control won't work aren't related to the capabilities of a superintelligence. They are:
Even though there's plenty of current relevant evidence about these points, it's apparently easy to twist it to fit several worldviews.
A lot of people have a strong intuition that structural advantages tend to trump intelligence differences in a conflict. Playing chess with Leela is a great exercise for adding some nuance to that intuition. A friend said they better understood AI risk after we had a conversation about Leela.
The options are NN, BB, RN, RR, BBN, RNN, RBB, QN, BBNN, QR, QNN, QBB, QRN or QRR. The initial stands for which pieces Leela is missing at the beginning of the game. N=Knight, B=Bishop, R=Rook, Q=Queen.
E.g. here's the Classical time control for BBNN:
Note that Lichess and Chess.com ratings differ significantly at the lower ratings, but start to be fairly similar above 2000.
I think this section from The case for ensuring that powerful AIs are controlled, which lists the handicaps of the AI, makes the analogy fairly clear:
For your control evaluations to be conservative, the red team must be at least as good at finding attacks as scheming models, so that if the red team fails to find any good attack, it would imply that scheming AIs would also fail to find a good attack. We also need to be able to sufficiently measure the capabilities of models in ways that are robust to that very model sabotaging our evaluations.[11] We think that finding good enough attacks and robustly measuring capabilities for early transformatively useful AIs (but not arbitrarily powerful AIs) seems achievable, because our AIs have significant disadvantages compared to our red team:
- Less serial thinking time. Current LLMs can only do a small amount of serial reasoning outside of natural language reasoning. So, if future powerful AIs are similar in this way to current LLMs and we monitor the AI's natural language reasoning for problematic topics like reasoning about evading our countermeasures, we can ensure that in order to break our countermeasures the AI would need to think of its strategy in a forward pass. Even if future AIs aren't restricted to this extent, we still expect to be able to impose some important limitations on serial reasoning or sharing uninterpretable state ("neuralese" memories) between different instances. In contrast, our red team is allowed to search for attacks and ways to elicit capabilities with lots of serial time, a team of people, and AI assistance.[12][13]
- Less information about countermeasures.
- Little practice, no iteration against countermeasures. When the red team experiments with a strategy and it totally fails, they can shrug, learn from the failure, and try again. But if the AI fails and is caught, its situation likely becomes much worse. And even if AIs don't get caught, they might be unable to learn from their failures because of limitations on their state and memories.