I assume that both were inspired by https://arxiv.org/abs/2108.12099 and are related via that shared ancestor
I think there's a missing link on https://alignmentproject.aisi.gov.uk/how-to-apply :
"The GFA for DSIT/AISI funded projects is standard and not subject to negotiation. The GFA can be accessed here: link to GFA attachment."
Agree that it would be better not to have them up as readily downloadable plaintext, and it might even be worth going a step farther and encrypting the gzip or zip file, and making the password readily available in the repo's README. This is what David Rein did with GPQA and what we did with FindTheFlaws. Might be overkill, but if I were working for a frontier lab building scrapers to pull in as much data from the web as possible, I'd certainly have those scrapers unzip any unencrypted gzips they came across, and I assume their scrapers are probably doing the same.
PS to the original posters: seems like nice work! Am planning to read the full paper and ask a more substantive follow-up question when I get the chance
Love pieces that manage to be both funny and thought-provoking. And +1 for fitting a solar storm in there. There is now better evidence of very large historical solar storms than there had been during David Roodman's Open Phil review in late 2014, have been meaning to write something up about that but other things have taken priority.
This is cool, although I suspect that you'd get something similar from even very simple models that aren't necessarily "modelling the world" in any deep sense, simply due to first and second order statistical associations between nearby place names. See e.g. https://onlinelibrary.wiley.com/doi/pdfdirect/10.1111/j.1551-6709.2008.01003.x , https://escholarship.org/uc/item/2g6976kg .
Leopold and Pavel were out ("fired for allegedly leaking information") in April. https://www.silicon.co.uk/e-innovation/artificial-intelligence/openai-fires-researchers-558601
Nice job! I'm working on something similar.
> Next, I might get my agent to attempt the last three tasks in the report
I wanted to clarify one thing: Are you building custom prompts for the different tasks? If so, I'd be curious to know how much effort you put into these (I'm generally curious how much of your agent's ability to complete more tasks might be due to task-specific prompting, vs. the use of WebDriverIO and other affordances of your scaffolding). If not, isn't getting the agent to attempt the last three tasks as simple as copy-pasting the task instructions from the ARC Evals task specs linked in the report, and completing the associated setup instructions?
Some colleagues and I did some follow-up on the paper in question and I would highly endorse "probably it worked because humans and AIs have very complementary skills". Regarding their MMLU findings, appendix E of our preprint points out that that participants were significantly less likely to answer correctly when engaging in more than one turn of conversation. Engaging in very short (or even zero-turn) conversations happened often enough to provide reasonable error bars on the plot below (data below from MMLU, combined from the paper study & the replication mentioned in appendix E):

I think this suggests that there were some questions that humans knew the answer to and the models didn't, and vice versa, and some participants seemed to employ a strategy of deferring to the model primarily when uncertain.
On the QuALITY findings, the original paper noted that
so it's not surprising that an LLM that does get to read the full story outperforms humans here. Based on looking at some of the QuALITY transcripts, I think the uplift for humans + LLMs here came from the fact that humans were better at reading comprehension than 2022-era LLMs. For instance, in the first transcript I looked at the LLM suggested one answer, the human asked for the excerpt of the story that supported their claim, and when the LLM provided it the human noticed that the excerpt contained the relevant information but supported a different answer than the one the LLM had provided.