fidgetsinner — LessWrong

Forecasting Frontier Language Model Agent Capabilities

This work was done as part of the MATS Program - Summer 2024 Cohort. Paper: link Website (with interactive version of Figure 1): link Executive summary Figure 1: Low-Elicitation and High-Elicitation forecasts for LM agent performance on SWE-Bench, Cybench, and RE-Bench. Elicitation level refers to performance improvements from optimizing agent...

Feb 24, 202535

Do models know when they are being evaluated?

Interim research report from the first 4 weeks of the MATS Program Winter 2025 Cohort. The project is supervised by Marius Hobbhahn. Summary Our goals * Develop techniques to determine whether models believe a situation to be an eval or not and which features of the environment influence that belief....

Feb 17, 202557

Current safety training techniques do not fully transfer to the agent setting

by Simon Lermen and fidgetsinner

TL;DR: We are presenting three recent papers which all share a similar finding, i.e. the safety training techniques for chat models don’t transfer well from chat models to the agents built from them. In other words, models won’t tell you how to do something harmful, but they are often willing...

Nov 3, 2024162

~80 Interesting Questions about Foundation Model Agent Safety

by RohanS and fidgetsinner

Many people helped us a great deal in developing the questions and ideas in this post, including people at CHAI, MATS, various other places in Berkeley, and Aether. To all of them: Thank you very much! Any mistakes are our own. Foundation model agents - systems like AutoGPT and Devin...

Oct 28, 202448

Analyzing DeepMind's Probabilistic Methods for Evaluating Agent Capabilities

by Axel Højmark, fidgetsinner, Arjun Panickssery, Marius Hobbhahn, and Jérémy Scheurer

Produced as part of the MATS Program Summer 2024 Cohort. The project is supervised by Marius Hobbhahn and Jérémy Scheurer Update: See also our paper on this topic admitted to the NeurIPS 2024 SoLaR Workshop. Introduction To mitigate risks from future AI systems, we need to assess their capabilities accurately....

Jul 22, 202469