luke_emberson — LessWrong

Introducing the Epoch Capabilities Index (ECI)

I'm interested in the question of how you plan to re-base the index over time.

Currently, we've chosen to scale things such that Claude 3.5 Sonnet gets 130 and GPT-5 (medium) gets 150. As we add new benchmarks, the rough plan is to try to maintain that. We're also planning on adding some way for users to define their own subset of benchmarks, in case you disagree with our choices. That should let you see how things would look under various hypothetical "rebasings".

The index excludes models from before 2023, which is understandable, since they couldn't use benchmark released after that date.

To be clear, that's not why we didn't include older models. There is no technical problem with including older models, we just have sparse data on models prior to 2023. We chose to leave them out for now since the noisy estimates could be misleading, but we're trying to collect more data so that we can extend things back further.

The obvious way to do this seems (to me) to be by including earlier benchmarks that are now saturated so that the time series can be extended backwards.

We do! Currently we include many old benchmarks like GSM8k, HellaSwag, MMLU, WinoGrande, etc. There is a list of all benchmarks on the ECI methodology page, if you scroll down into the methodology section. In our initial release we haven't visualized benchmark difficulty scores, but we're planning on making those public and showing some analysis soon.

Introducing the Epoch Capabilities Index (ECI)

luke_emberson2mo40

Another consideration is that we tend to prioritize evaluations on frontier models, so coverage for smaller models is spottier.

Introducing the Epoch Capabilities Index (ECI)

luke_emberson2mo81

Here's one framing: getting a higher ECI score requires making progress on (multiple) benchmarks that other models find difficult. Making progress on METR instead involves being more consistently successful at longer coding tasks.

So ECI tracks general capabilities on a "difficulty-weighted" scale, and seems better suited to understanding the pace of progress in general, but it's also an abstract number. There's currently no mapping like "ECI of X == AGI", or a human ECI baseline. On the other hand, METR's benchmark has a nice concrete interpretation, but is more narrow.

We're working on mapping ECI to more interpretable metrics (in fact, METR Time Horizons is one candidate), as well as allowing users to choose a subset of underlying benchmarks if they would prefer to weight ECI more heavily towards particular skills like coding.

Also note that we don't currently include METR's benchmarks as inputs to ECI, but we may add them in future iterations.

Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development

luke_emberson10mo10

Furthermore, without unprecedented changes in redistribution, declining labor share also translates into a structural decline in household consumption power, as humans lose their primary means of earning the income needed to participate in the economy as consumers.

This holds only if the labor share of income shrinks faster than purchasing power grows. Overall, I still think the misaligned economy argument goes through if household consumption power grows in absolute terms but "human preference aligned dollars" shrinks as a fraction of total dollars spent.

The Best Tacit Knowledge Videos on Every Subject

luke_emberson2y30

Domain: Piano

Link: Seymour Bernstein Teaches Piano https://youtu.be/pRLBBJLX-dQ?si=-6EIvGDRyw0aJ0Sq

Person: Seymour Bernstein

Background: Pianist and composer, performed with the Chicago Symphony Orchestra, Adjunct Associate Professor of Music and Music Education at New York University.

Why: Tonebase (a paid music learning service) recorded a number of free to watch conversations with Bernstein while he plays through or teaches a piece. Bernstein is about 90 years old at the time of recording and shares an incredible amount of tacit knowledge, especially about body mechanics when playing piano.

Best arguments against instrumental convergence?

luke_emberson3y10

Re: specific claims to falsify, I generally buy the argument.

If I had to pick out specific aspects which seem weaker, I think they would mostly be related to our confusion around agent foundations. It isn't trivially obvious to me that the way we describe "intelligence" or "goals" within the instrumental convergence argument is a good match for the way current systems operate (though it seems close enough, and we shouldn't expect to be wrong in a way that makes the situation better).

Best arguments against instrumental convergence?

luke_emberson3y10

I would agree that instrumental convergence is probably not a necessary component of AI x-risk, so you're correct that "crux" is a bit of a misnomer.

However, in my experience it is one of the primary arguments people rely on when explaining their concerns to others. The correlation between credence in instrumental convergence and AI x-risk concern seems very high. IMO it is also one of the most concerning legs of the overall argument.

If somebody made a compelling case that we should not expect instrumental convergence by default in the current ML paradigm, I think the overall argument for x-risk would have to look fairly different from the one that is usually put forward.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments