I just found out that METR released an updated version of their time horizons work with extra tasks and different evaluation infrastructure. This was released on 29th Jan and I think has been overshadowed by the Moltbook stuff. Main points: * Similar overall trend since 2021 * 50% time horizon...
For those of you not yet familiar, Moltbook is a Reddit-like social media for AI agents. As of writing, it already has over 1 million agents signed up, over 13000 submolts and over 48000 posts. This is in the 4 days since its creation on the 27th of Jan. It's...
TLDR * Long time horizon METR-HRS tasks are both more difficult and sequentially longer than short tasks * The resulting benchmark is therefore measuring both the ability to complete difficult tasks and consistency in its abilities over long time frames. * Depending on whether you think intelligence or consistency is...
I occasionally like to be an idiot. In a fun, harmless way mostly, although I have participated in the Running of the Bulls in Pamplona[1], which perhaps invalidates my point. That aside, a month or so ago, a friend and I were coming up with silly ways to evaluate AI...
Intro As a voracious consumer of AI Safety everything, I have come across a fair few arguments of the kind "either we align AGI and live happily ever after, or we don't and everyone dies." I too subscribed to this worldview until I realised: a) We might not actually create...
Epistemic status: This is an intuition pump for why making LLMs multimodal is helpful. I am ~90% confident that my later claims which build on this are at least somewhat correct. This article was created over an unclear number of hours (20-50?) and red-teamed by GPT-4o. I recently went to...