METR has not intended to claim to have audited anything, or to claim to be providing meaningful oversight or accountability, but there has been some confusion about whether METR is an auditor or planning to be one. To clarify this point: 1. METR’s top priority is to develop the science...
This is METR’s collection of resources for evaluating potentially dangerous autonomous capabilities of frontier models. The resources include a task suite, some software tooling, and guidelines on how to ensure an accurate measurement of model capability. Building on those, we’ve written an example evaluation protocol. While intended as a “beta”...
This is a quick update that METR (formerly ARC Evals) is recruiting for four positions. I encourage you to err on the side of applying to positions that interest you even if you’re unsure about your fit! We’re able to sponsor US visas for all the roles below except Research...
Update 3/14/2024: This post is out of date. For current information on the task bounty, see our Task Development Guide. Summary METR (formerly ARC Evals) is looking for (1) ideas, (2) detailed specifications, and (3) well-tested implementations for tasks to measure performance of autonomous LLM agents. Quick description of key...
Update: We are no longer accepting gnarly bug submissions. However, we are still accepting submissions for our Task Bounty! Tl;dr: Looking for hard debugging tasks for evals, paying greater of $60/hr or $200 per example. METR (formerly ARC Evals) is interested in producing hard debugging tasks for models to attempt...
Note: This is not a personal post. I am sharing on behalf of the ARC Evals team. Potential risks of publication and our response This document expands on an appendix to ARC Evals’ paper, “Evaluating Language-Model Agents on Realistic Autonomous Tasks.” We published this report in order to i) increase...
Blogpost version Paper We have just released our first public report. It introduces methodology for assessing the capacity of LLM agents to acquire resources, create copies of themselves, and adapt to novel challenges they encounter in the wild. Background ARC Evals develops methods for evaluating the safety of large language...