joec's Shortform

joec

This is a special post for quick takes by joec. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

New Personal AI benchmarks

Claude Fable 5 immediately got one of my favorite "private benchmark" problems (hard version of https://justinpombrio.net/2020/01/25/prisoner-lightbulb.html, which is a super fun puzzle in its own right! Give it a try!). Fable 5 seemingly didn't even struggle with it the same way that Gemini did (which still got the right answer, but after some handwringing). I wasn't even able to learn anything about how it thought.

So, what's next for the "personal benchmarking"? Ideally, it should consist of hard problems for which solutions are "rich", in the sense of giving you a sense for how the AIs approach solving problems.

One I can think of is "ai-box-bench". Can the AI convince you to let it out of the box, given 2 hours of your active engagement? Use rules from https://tuxedage.wordpress.com/2013/09/04/the-tuxedage-ai-box-experiment-ruleset/. I tried this with opus 4.6, and found that it's not really that impressive - I argued against it and it kinda just said I was "absolutely right!" and kept trying increasingly confusing and non-persuasive tactics. Maybe Fable 5 will actually do a decent job, I guess I just need to find a 2 hour period for this.

Another is "tutor-bench". Pick a subject that you no virtually nothing about, but doesn't require that much background knowledge to test well in, and a test of that subject. Then, have the AI tutor you, and see how well you improve. I guess some issues with this are that knowledge in one subject bleeds into others, and that two subjects can have varying difficulty, so this benchmark is more about your subjective feeling about what you learned, and how well you could have learned it on your own, than comparing test score increases across subjects.

Any others?

New Personal AI benchmarks