Wiki Contributions

Comments

I had the following results:

Stockfish level 2 vs Stockfish level 0, 0.01 seconds per move, 5k games:

0 random moves: win rate 81.2%

20 random moves: win rate 81.2%

40 random moves: 77.9%

95% confidence interval is about +- 1%

Stockfish level 15 vs level 9, 0.01 seconds per move, 5k games:

0 random moves: 65.5%

20 random moves: 72.8%

40 random moves: 67.5%

Once again, 95% confidence interval is about +- 1%

At 120 seconds per move, both of these level differences correspond to ~300 Elo: https://github.com/official-stockfish/Stockfish/commit/a08b8d4

This is 0.01 seconds per move. It appears that less search time lowers the Elo difference for level 15 vs level 9. A 65% win rate corresponds to a ~100 Elo difference, while a 81% win rate corresponds to a 250-300 Elo difference.

Honestly not too sure what to make of the results. One possible variable is that in every case, the higher level player is White. Starting in a game with a random position may favor the first to move. Level 2 vs level 0 seems most applicable to the Chess-GPT setting.

Both are great points, especially #1. I'll run some experiments and report back.

That's an interesting idea, I may test that out at some point. I'm assuming the softmax would be for kings / queens, where there is typically only one on the board, rather than for e.g. blank squares or pawns?

The all stockfish data engine played at a level that was 100-200 Elo higher in my tests, with a couple caveats. First, I benchmarked the LLMs against stockfish, so an all stockfish dataset seems helpful for this benchmark. Secondly, the stockfish LLM would probably have an advantage for robustness because I included a small percentage of stockfish vs random move generator games in the stockfish dataset in the hopes that it would improve its ability.

I haven't done an in depth qualitative assessment of their abilities to give a more in depth answer unfortunately.

Yes, in this recent OpenAI superalignment paper they said that GPT-4's training dataset included a dataset of chess games filtered for players with greater than 1800 Elo. Given gpt-3.5-turbo-instruct's ability, I'm guessing that its dataset included a similar collection.