Alexander Barry — LessWrong

Thanks for this, it has been bounding about in my head for a bit and I think it speaks to two schools of thought one can have about individual benchmarks:

1. They tell us whether/how well LLMs can do a specific task of interest, where we care about something like 'can LLMs do the task with most performant reasonable harness/setup'. Cyber capabilities might be one example of these, where we are interested in whether there is any way to apply LLMs to compromise certain systems. Proving maths theorems might be another.

2. The tasks themselves are meaningless/of... (read more)

How far behind are open models?

Alexander Barry11d10

Thanks for this, I think this is a sensible methodological approach and puts some nice data behind a common intuition many people have.

Are Mythos' Cyber Capabilities Overstated? - Yes and No

Alexander Barry17d30

Thanks for the writeup. One confusion I think many have that I didn't see addressed here is the distinction between the version of Mythos Preview that was launched on April 7th (henceforth 'Launch') and the early pre-release checkpoint that some where given access to ('Early').

Some of the less impressive results for Mythos Preview were using the early checkpoint (AISI's original results, CTI-REALM, also METR Time Horizon results) which seems to be notably less capable than the April 7th release. AISI include some benchmarks with both 'early' and 'launch' (... (read more)

(Updated) METR's data can't distinguish between trajectories (and 80% horizons are an order of magnitude off)

Alexander Barry4mo21

I think we agree and I just stated this badly - I was just meaning to say that METR's original approach is closer to marginal despite them not explicitly doing the integrating over the random effects (although I agree you need to do integrate over the random effects in models that include them to get the marginal time horizon).

(Updated) METR's data can't distinguish between trajectories (and 80% horizons are an order of magnitude off)

Alexander Barry4mo20

Yes sorry for just dropping in with "I have a model that gives different results" without actually giving the details. I'm trying to get a minimal version of it written up (I had designed it to integrate into METR's codebase so need to extract it as something that can exits standalone).

Within the runs.json there is a (not especially clearly named) 'human_source' field for each row. If this is set to "baseline" then the task length is based on (one or more) human baseliners, if it is "estimate" then it was just estimated without any human actually finishing... (read more)

(Updated) METR's data can't distinguish between trajectories (and 80% horizons are an order of magnitude off)

Alexander Barry4mo30

I chatted with Thomas a bit about this, and I also agree that the default METR model should also output things that are close to the 'marginal' definition of time horizon (or at least as well as it can be approximated with the inverse logit sigmoid).

I think the important thing to realise is that while one needs to take additional steps for the 'marginal' approach when fitting a model that explicitly accounts for the deviation in task-length-for-humans vs task-difficulty-for-llms, models that don't explicitly account for this (such as the original METR mode... (read more)

(Updated) METR's data can't distinguish between trajectories (and 80% horizons are an order of magnitude off)

Alexander Barry4mo10

Good catch! Edited my comment. It had been a while since I had looked at the results and I must have also lost the ability to read in the meantime.

(Updated) METR's data can't distinguish between trajectories (and 80% horizons are an order of magnitude off)

Alexander Barry4mo*50

Thanks for the great writeup!

I'm a statistician who does some work with METR, and I recently worked on a very similar project to create a Bayesian version of the Time Horizon model. Mine ended up being somewhat different to yours (mine deviates a bit more from the currently structure of the METR model), but its great to see other people stress testing modelling.

On the 80% Time Horizon results I agree that your 'marginal' approach is correct, and it is the one I also took in my model. However my 80% results ended up being a factor of 2 Edit:higher than the ... (read more)

On METR’s AI Coding RCT

Alexander Barry11mo*110

While I think it is plausible the results would have been different if the devs had had e.g. 100 hours more experience with cursor, it is worth also noting that:

- 14/16 of the devs rated themselves as 'average' or above cursor users at the end of the study

- The METR staff working on the project thought the devs were qualitatively reasonable cursor users (based on screen recordings etc.)

So I think it is unlikely the devs were using cursor in an unusually unskilled way.

The forecasters were told that only 25% of the devs had prior cursor experience (the actua... (read more)