The thing METR is measuring seems slightly different than "superhuman coder". My understanding is that they're dropping an AI into an unfamiliar codebase and telling it to do something with no context or design help, so this is partially software architecture partially coding. On pure coding tasks, Claude Code is clearly superhuman already.
I spent a few hours over the last few days collaborating with Claude on design docs and some general instructions, then having it go through massive todo lists fully autonomously[1]. This is weeks of coding and it did it in a few hours (mostly slowed down by me getting around to giving it more work).
This is the first time I've had it do tasks of this scale so I'm not doing anything special, just having it propose a design, telling it which parts I want done differently, then having it make a todo list and execute it.
Example prompt:
Can you go through @TODO.md, delegating each task to opus subagents and ensuring that they understand all of the necessary context and implement the task, check it off, and commit it, then move onto the next task until the list is done?
Isn't getting working/production-ready code done faster the definition of being better than you at coding? It's possible the creator of Claude Code is incorrect about this and he would be more productive long-term writing this code himself, or the code is actually unacceptable in ways that he hasn't noticed yet, but if he's correct that it's more productive to have Claude write it, then Claude is better at coding than him.
I'm not sure if his approach is actually productive for this, but for the longest time, the standard response to Eliezer's concerns was that they're crazy sci-fi. Now that they're not crazy sci-fi, the response is that they're obvious. Constantly reminding people that his crazy predictions were right (and everyone else was wrong in predictable ways) is a strategy to get people to actually take his future predictions seriously (even though they're obviously crazy sci-fi).
The probability for "Garage" hit 99% (Logit 15) at the very first step and stayed flat.
Is the problem that these questions are too easy, so the LLM is outputing reasoning since that's sometimes helpful, but in this case it doesn't actually need it?
I'd be curious to see what the results look like if you give it harder questions.
What do we see if we apply interpretability tools to the filler tokens or repeats of the problem? Can we use internals to better understand why this helps with the model?
Yeah, I predicted that this would be too hard for models to learn without help, so I'm really curious to see how it managed to do this. Is it using the positional information to randomize which token does which part of the parallel computation, or something else?
I'll have to think about what other weird things I can write about to get my variance up. I won't stop until I'm number one!
While this might be useful to you in particular, I'm not convinced this generalizes very well to the whole US.
(It's also unusual that your median American eats 3500 calories of meat every day, but these numbers should be comparable over time)
I decided to incur the personal cost of not getting vaccinated as a sign of protest.
If you think vaccination is a good idea but forcing people to prove vaccination status is bad, wouldn't it make more sense to get vaccinated but to refuse to provide proof to gatekeepers? I'd worry that being unvaccinated sends the wrong signal, that you're anti-vaccine and not anti-vaccine-mandate.
One thing to consider is food expenditure over time:
People spent twice as much on food in 1950, despite eating out half as much and mostly cooking foods we'd consider extremely cheap today.
I was thinking of "coder" as specifically the job of writing code, which I assume is what the Claude Code guy meant too. AI is clearly not reliable at system design yet.