Introducing the Epoch Capabilities Index (ECI)

I'd be interested in anyone's thoughts on when to use this vs e.g., METR's time horizon. The latter is of course more coding-focused than this general-purpose compilation, but that might be a feature not a bug for our purposes (predicting takeoff).

[-]luke_emberson20h51

Here's one framing: getting a higher ECI score requires making progress on (multiple) benchmarks that other models find difficult. Making progress on METR instead involves being more consistently successful at longer coding tasks.

So ECI tracks general capabilities on a "difficulty-weighted" scale, and seems better suited to understanding the pace of progress in general, but it's also an abstract number. There's currently no mapping like "ECI of X == AGI", or a human ECI baseline. On the other hand, METR's benchmark has a nice concrete interpretation, but is more narrow.

We're working on mapping ECI to more interpretable metrics (in fact, METR Time Horizons is one candidate), as well as allowing users to choose a subset of underlying benchmarks if they would prefer to weight ECI more heavily towards particular skills like coding.

Also note that we don't currently include METR's benchmarks as inputs to ECI, but we may add them in future iterations.

[-]Mo Putera29m20

The ECI suggests that the best open-weight models train on ~1 OOM less compute than the best closed weight ones. Wonder what to make of this if at all.

[-]Davidmanheim9h20

As I mentioned elsewhere, I'm interested in the question of how you plan to re-base the index over time.

The index excludes models from before 2023, which is understandable, since they couldn't use benchmark released after that date, which are now the critical ones. Still, it seems like a mistake, since I don't have any indication of the adaptability of the method for the future when current metrics are saturated. The obvious way to do this seems (to me) to be by including earlier benchmarks that are now saturated so that the time series can be extended backwards. And I understand that this data may be harder to collect, but as noted, it seems important to show future adaptability.

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

50

Introducing the Epoch Capabilities Index (ECI)

50

50