Another consideration is that we tend to prioritize evaluations on frontier models, so coverage for smaller models is spottier.
Here's one framing: getting a higher ECI score requires making progress on (multiple) benchmarks that other models find difficult. Making progress on METR instead involves being more consistently successful at longer coding tasks.
So ECI tracks general capabilities on a "difficulty-weighted" scale, and seems better suited to understanding the pace of progress in general, but it's also an abstract number. There's currently no mapping like "ECI of X == AGI", or a human ECI baseline. On the other hand, METR's benchmark has a nice concrete interpretation, but is more narrow.
We're working on mapping ECI to more interpretable metrics (in fact, METR Time Horizons is one candidate), as well as allowing users to choose a subset of underlying benchmarks if they would prefer to weight ECI more heavily towards particular skills like coding.
Also note that we don't currently include METR's benchmarks as inputs to ECI, but we may add them in future iterations.
Furthermore, without unprecedented changes in redistribution, declining labor share also translates into a structural decline in household consumption power, as humans lose their primary means of earning the income needed to participate in the economy as consumers.
This holds only if the labor share of income shrinks faster than purchasing power grows. Overall, I still think the misaligned economy argument goes through if household consumption power grows in absolute terms but "human preference aligned dollars" shrinks as a fraction of total dollars spent.
Domain: Piano
Link: Seymour Bernstein Teaches Piano https://youtu.be/pRLBBJLX-dQ?si=-6EIvGDRyw0aJ0Sq
Person: Seymour Bernstein
Background: Pianist and composer, performed with the Chicago Symphony Orchestra, Adjunct Associate Professor of Music and Music Education at New York University.
Why: Tonebase (a paid music learning service) recorded a number of free to watch conversations with Bernstein while he plays through or teaches a piece. Bernstein is about 90 years old at the time of recording and shares an incredible amount of tacit knowledge, especially about body mechanics when playing piano.
Re: specific claims to falsify, I generally buy the argument.
If I had to pick out specific aspects which seem weaker, I think they would mostly be related to our confusion around agent foundations. It isn't trivially obvious to me that the way we describe "intelligence" or "goals" within the instrumental convergence argument is a good match for the way current systems operate (though it seems close enough, and we shouldn't expect to be wrong in a way that makes the situation better).
I would agree that instrumental convergence is probably not a necessary component of AI x-risk, so you're correct that "crux" is a bit of a misnomer.
However, in my experience it is one of the primary arguments people rely on when explaining their concerns to others. The correlation between credence in instrumental convergence and AI x-risk concern seems very high. IMO it is also one of the most concerning legs of the overall argument.
If somebody made a compelling case that we should not expect instrumental convergence by default in the current ML paradigm, I think the overall argument for x-risk would have to look fairly different from the one that is usually put forward.
Currently, we've chosen to scale things such that Claude 3.5 Sonnet gets 130 and GPT-5 (medium) gets 150. As we add new benchmarks, the rough plan is to try to maintain that. We're also planning on adding some way for users to define their own subset of benchmarks, in case you disagree with our choices. That should let you see how things would look under various hypothetical "rebasings".
To be clear, that's not why we didn't include older models. There is no technical problem with including older models, we just have sparse data on models prior to 2023. We chose to leave them out for now since the noisy estimates could be misleading, but we're trying to collect more data so that we can extend things back further.
We do! Currently we include many old benchmarks like GSM8k, HellaSwag, MMLU, WinoGrande, etc. There is a list of all benchmarks on the ECI methodology page, if you scroll down into the methodology section. In our initial release we haven't visualized benchmark difficulty scores, but we're planning on making those public and showing some analysis soon.