Thanks for following through.
If anyone wants to make a proper harness in the future, I think probably the most interesting question here is if the LLM can learn from multiple playthroughs, unlocking harder difficulties, etc.
Modern LLMs, maybe through notetaking?
Bunch of reasons:
Unfortunately no one has done full playthrough comparisons on the same harness for all models, due to time and expense. (all three main developers for Claude/Gemini/GPT only have access to free tokens for their particular model brand) Perhaps this will become possible sometime next year as completion time drops? (cost per token might drop too, but perhaps not for frontier models)
Please do!
(Author of post you linked here.) No Claude’s harness has been pretty stable and minimal. It’s the other LLMs that have beat the game with stronger/more optimized harnesses.
I think the "anime girl transwomen" archetype is also motivated by a desire to express love, selflessness, and care - not merely receive it. Obviously those expressions are possible as a man, but difficult when you perceive yourself as undesirable or having low social skills, especially when those beliefs are true. Who would give you the opportunity? Meanwhile the archetype provides a role model and community that is very oriented around providing opportunities to express care.
(disclaimer: I'm not a transwoman so this is outside speculation)
50% is pretty plausible to me, my own percentage probably hit that a few months ago.
I do think there's a true sense in which AI code line counts are propped up by "extraneous" code - comments, extra-defensive error handling, and above all, tests. Especially for small functional changes, I've seen many cases where the "extraneous" code has 5x the line count of the "real" code.
But I'd still argue that the "extraneous" code typically provides real, if marginal, value.
No published benchmark I'm aware of. The Anthropic employee that streams has updated their stream to use Sonnet 4.5, but it's actually doing worse than Opus 4.1, which got permanently stuck in the early mid-game like every previous Claude model.
Spoiler block was not supposed to be empty, sorry. It's fixed now. I was using the Markdown spoiler formatting and there was some kind of bug with it I think, I reported it to the LW admins last night. (also fwiw I took the opportunity now to expand on my original spoilered comment more)
The models have always been deeply familiar with Pokémon and how to play through it from the initial tests with Sonnet 3.7—they all know Erika is the fourth gym leader in Red, there's just too much internet text about this stuff. It contaminates the test, from a certain perspective, but it also makes failures and weaknesses even more apparent.
It is possible that Claude Opus 4.5 alone was trained on more Pokémon images as part of its general image training (more than Sonnet 4.5, though...?), but it wouldn't really matter: pure memorization would not have helped previous models, because they couldn't clearly see/understand stuff they definitely knew about. (I also doubt Anthropic is benchmaxxing Pokémon, considering they've kept their harness limited even after Google and OpenAI beat them on their own benchmark.)