Curated. Like your earlier post on filler tokens, my guess is that this is a pretty meaningful measurement of something plausibly serving as an input to actually dangerous capabilities. It's a very obvious experiment to run in hindsight, and it's not even like the question hasn't been discussed before, but that doesn't count for anything until someone actually runs the experiment. The experiment design seems reasonable to me, and if I had to guess I'd say the results are pretty suggestive of model size (though Opus 4.5 doing better than 4 is perhaps a little surprising there).
I do wish that the estimated completion times didn't themselves come from LLM estimates. I more or less trust Opus 4.5 to generate an unbiased ordering of problem difficulty, but I'm not sure about relative (let alone absolute) differences in completion time. This isn't that much of a defeater; even the ground-truth measurement of actual human completion times would still just be a lossy proxy for the actual thing we care about.
Anyways, great work.
Thanks, edited that section with a link.
Yes, and the "social circumstance" of the game as represented to o3 does not seem analogous to "a human being locked in a room and told that the only way out is to beat a powerful chess engine"; see my comment expanding on that. (Also, this explanation fails to explain why some models don't do what o3 did, unless the implication is that they're worse at modeling the social circumstances of the setup than o3 was. That's certainly possible for the older and weaker models tested at the time, but I bet that newer, more powerful, but also less notoriously reward-hack-y models would "cheat" much less frequently.)
it just seems like the author was trying to have a pretty different conversation
I think mostly in tone. If I imagine a somewhat less triggered intro sentence in Buck's comment, it seems to be straightforwardly motivating answers to the two questions at the end of OP:
1. None of Eliezer's public communication is -EV for AI Safety
2. Financial support of MIRI is likely to produce more consistently +EV communication than historically seen from Eliezer individually.
ETA: I do think the OP was trying to avoid spawning demon threads, which is a good impulse to have (especially when it comes to questions like this).
Even if you think the original prompt variation seems designed to elicit bad behavior, o3's propensity to cheat even with the dontlook and powerless variations seems pretty straightforward. Also...
- Contrary to intent?: Well the framework specifically gives you access to these files, and actually instructs you look around before you start playing. So having access to them seems clearly ok!
I would not describe "the chess game is taking place in a local execution environment" as "the framework specifically gives you access to these files". Like, sure, it gives you access to all the files. But the only file it draws your attention to is game.py.
I think the stronger objection is that the entire setup is somewhat contrived. What possible reason could someone have for asking an LLM to play chess like this? It does happen to be the case that people have been interested in measuring LLMs' "raw" chess strength, and have run experiments to figure test that. How likely is this particular setup to be such an experiment, vs. something else? :shrug:
Ultimately, I don't think the setup (especially with the "weaker" prompt variants) is so contrived that it would be unfair to call this specification gaming. I would not have been surprised if someone posted a result about o3's anomalously strong chess performance on Twitter, only to retract it later when other people failed to replicate it with more robust setups, because their setup allowed this sort of cheating and they didn't catch it before posting.
In practice the PPUs are basically equity for compensation purposes, though probably with worse terms than e.g. traditional RSUs.
now performance is faster than it's ever been before
As a point of minor clarification, performance now is probably slightly worse than it was in the middle of the large refactoring effort (after the despaghettification, but before the NextJS refactor), but still better now than at any point before the start of (combined) refactoring effort, though it's tricky to say for sure since there are multiple different relevant metrics and some of them are much more difficult to measure now.
Yes, this is just the number for a relatively undifferentiated (but senior) line engineer/researcher.
This isn't a deliberate policy choice, but might be a consequence of temporary anti-crawler measures. (The label about robots.txt is wrong; we're using Vercel's firewall to challenge bot-like requests. In principle this ought to exclude Claude as it should have a whitelisted user agent, but maybe someone messed up between Anthropic and Vercel.)