That's pretty fair and useful criticism. If I was making this again I'd have Water Elementals roll +2d6 or +3d4 instead of +1d12; the uniform distribution is in retrospect suspiciously unnatural.
(I do however plead the mitigating factors that foam swords are cheap and "foam swords plus the stuff that ends up mattering" is one of the 'correct' solutions.)
((Also, I'm happy you played one of my older scenarios, and that you took the time to share your thoughts. Feedback is always greatly appreciated.))
I tried ChatGPT(-5.2-Thinking) on the original D&D.Sci challenge (which is tough, but not tricky) and it got almost a perfect answer, one point shy of the optimal.
I also tried ChatGPT on the second D&D.Sci challenge (which is tricky, but not tough), and it completely failed (albeit in a sensible and conservative manner). Repeated prompts of "You're missing something, please continue with the challenge" didn't help.
This was incredibly good, though it seemed borderline incoherent on first reading.
Why do you think METR hasn't built more tasks then, if it's easy?
I have no idea, I just don't think the "actually making the tasks" part can be the limiting factor.
I take it you have a negative opinion of them?
Yes; I also have a positive opinion of them, and various neutral opinions of them.
(My position could be summed up as "the concept of time horizons was really good & important, and their work is net positive, but it could use much stronger methodological underpinning and is currently being leaned on too heavily by too many people"; I'm given to understand that's also their position on themselves.)
. . . I realize the start of this post reads like a weird brag but imo it really isn't. "Hey failed-wannabe-gamedev, I need a bunch of puzzles and it's ok if they're not very fun and it's ok if there's no UI and it's actively preferable if they're ridiculously complicated and time-consuming and spreadsheet-requiring and reminiscent-of-someone's-dayjob, we're paying a couple grand apiece" is a pitch I imagine a lot of people would be willing and able to jump at, many much moreso than me.
Apparently it's hard
No? I contributed a ~20hr task to them and it was pretty easy actually? I've been making benchmark-shaped things on and off for the past five years, for free, as a hobby?
(Most of the effort my end was getting it METR's required format, recruiting & managing my playtester, and contemplating whether I was complicit in intellectual fraud[1]; if they'd made those things easier or handled them themselves I'd have made more; IIRC the actual "make a ~20hr task" part took me <20hrs.)
--It becomes standard practice for any benchmark-maker to include a human baseline for each task in the benchmark, or at least a statistically significant sample.
--They also include information about the 'quality' of the baseliners & crucially, how long the baseliners took to do the task & what the market rate for those people's time would be.
--It also becomes standard practice for anyone evaluating a model on a benchmark to report how much $ they spent on inference compute & how much clock time it took to complete the task.
I agree emphatically with all the above and raise you
--Saturated benchmarks & benchmark components are released publicly as a matter of course, so people can independently confirm the time horizons are where they were claimed to be.
--'Centaur' time horizons ("how hard is this task for a smart human with SoTA LLM assistance?") are reported alongside 'pure' time horizons ("how hard is this task for a smart human on their own?").
A miscommunication (ETA: miscommunication was probably at least 50% a me problem) led me to believe they weren't going to Baseline tasks at all, and were relying solely on the estimated times provided by task-makers and playtesters (i.e. people with a financial and ideological stake in reporting larger numbers), instead of using the more complex and less dubious protocol they actually went with; this combined with my less serious qualms led me to call it quits before building the other scenarios I had planned for them.
This was very important. I don't think the specific example is timeless or general enough to belong in a Best Of 2024 collection, but the fact that prediction markets can behave this way - not just in theory, but in practice, in such a way that it potentially alters the course of history - is a big deal, and worth recording.
(Something the OP doesn't mention is the way this effect recurses. The people who shifted probability on prediction markets thereby shifted probabilities in real life, and thus ended up with more money.)
First, I don't think there's anything wrong with not getting along particularly well with most of humanity. Most people mostly don't interest me much and I mostly don't interest them much; that doesn't have to be a big deal.
Second, I suspect you'd have liked those people more in almost any other context. They were operating in Far Mode, recreationally and informally, in a situation where no-one would call them on bullshit and everyone would call them on gatekeeping or rudeness. (Most people mostly don't interest me much, but I've sometimes asked random strangers "would you like to tell me your life story?" while we're both stuck waiting for something, and I can't remember ever getting a boring answer.)
Third, fwiw, I don't see a problem with "they turned out to just be kind of verbally shitposting, so I verbally shitposted back, and we enjoyed each others' company as a result". Sounds like it would have been a fun night out . . . at least, if you hadn't been hoping for something else.
Typo in title, unless I'm misunderstanding something: 15 != 50.