The evaluation of artificial intelligence systems has undergone a transformation over the past few years. This exponential growth in task complexity - documented by analysis such as METR's on time horizons across benchmarks - reveals that AI capabilities are expanding far faster than many anticipated. In 2020, frontier models could handle tasks requiring a few seconds of computation. By 2025, they were tackling problems requiring hours of sequential reasoning. This acceleration has profound implications for how we evaluate AI systems. What began as static tests of language understanding has evolved into complex assessments of multi-step reasoning, tool use, and autonomous decision-making. This survey traces the chronological development of AI agent benchmarks from 2018 to 2025, examining how evaluation methodologies have adapted to measure increasingly sophisticated capabilities.
The benchmarks covered here represent influential examples from broader categories of evaluation. For instance, TheAgentCompany emphasizes business tool workflows, while OSWorld evaluates general operating system interactions. This survey focuses on representative benchmarks that have shaped the field's understanding of agent capabilities.
The Foundation Era (2018-2021)
GLUE: Establishing the Baseline (2018)
The General Language Understanding Evaluation (GLUE) benchmark [1] emerged in 2018 as one of the first comprehensive attempts to measure natural language understanding across multiple tasks. GLUE aggregated nine diverse tasks including sentiment analysis, textual entailment, and semantic similarity, providing a standardized evaluation framework that would influence benchmark design for years to come.
MMLU: Measuring Breadth of Knowledge (2020)
Measuring Massive Multitask Language Understanding (MMLU) [2] comprised 15,908 multiple-choice questions spanning 57 subjects from elementary mathematics to professional law and medicine. The benchmark was explicitly designed to test knowledge acquired during pretraining by evaluating models in zero-shot and few-shot settings.
MMLU revealed that while models demonstrated an impressive breadth of knowledge, they exhibited highly variable performance across domains. GPT-3 achieved 43.9% accuracy on average but performed near-randomly on subjects like morality and law.
HumanEval: Code as a Verifiable Domain (2021)
HumanEval [3] marked a shift toward domains with objective verification. The benchmark consisted of 164 hand-written programming problems, each with a function signature, docstring, body, and multiple unit tests. Models generated code completions that were then executed against test cases—pass or fail, with no ambiguity.
Example problems of increasing difficulty level in HumanEval
HumanEval's key innovation was the pass@k metric, which measured the probability that at least one of k generated samples passed all unit tests. This metric acknowledged that code generation is inherently stochastic and that practical utility depends on generating at least one correct solution. The benchmark demonstrated that verification asymmetry - easy to verify, hard to generate - could enable reliable automated evaluation without human judges.
The Conversational Turn: Human Preference and Multi-Turn Dialogue (2023-2024)
Chatbot Arena: Crowdsourced Human Preference (May 2023)
Chatbot Arena [4], launched by LMSYS, pioneered large-scale crowdsourced evaluation through pairwise comparisons. Users interacted with two anonymous models simultaneously, voting for which response they preferred. Elo ratings computed from these "battles" provided a single scalar measure of model quality based on human preference.
By January 2025, Chatbot Arena had collected over 6 million votes, establishing itself as the gold standard for human preference evaluation. The platform's strength lay in its scale and diversity: real users asking genuine questions across domains, with no artificial constraints. However, this strength also introduced challenges; user preferences could reflect superficial qualities like verbosity or formatting rather than correctness or helpfulness.
MT-Bench [5] offered a more controlled alternative to open-ended crowdsourcing. MT-Bench consisted of 80 carefully curated multi-turn questions across eight categories: writing, roleplay, reasoning, math, coding, extraction, STEM, and humanities. Each conversation involved two turns, with the second turn designed to test the model's ability to maintain context and handle follow-up instructions.
MT-Bench introduced LLM-as-a-Judge evaluation, where GPT-4 scored responses on a 10-point scale. This approach enabled rapid, scalable evaluation without human annotators, though it introduced new challenges around judge bias and reliability. MT-Bench scores correlated strongly (0.93) with Chatbot Arena Elo ratings, validating the LLM-as-a-Judge paradigm while highlighting the need for careful prompt engineering and bias mitigation.
AlpacaEval: Length-Controlled Win Rates (2023)
AlpacaEval [6] addressed a flaw in LLM-as-a-Judge evaluation: length bias. Early versions of AlpacaEval revealed that judges disproportionately favored longer responses, even when conciseness was preferable. AlpacaEval 2.0 introduced length-controlled win rates, explicitly instructing judges to penalize unnecessary verbosity.
The benchmark consisted of 805 diverse instructions, with model responses compared against a baseline (typically GPT-4 Turbo or Claude). AlpacaEval's win rate metric - the percentage of times a model's response was preferred over the baseline - provided an intuitive measure of relative quality.
MT-Bench-101 [7] extended MT-Bench's multi-turn framework with a comprehensive three-tier taxonomy of conversational abilities. The benchmark organized 1,388 dialogues (4,208 turns) into three top-level categories: Perceptivity (understanding user intent), Adaptability (adjusting responses based on context), and Interactivity (engaging in natural dialogue flow).
MT-Bench-101's three-tier hierarchical taxonomy breaking down conversational AI capabilities into 13 distinct tasks across Perceptivity, Adaptability, and Interactivity dimensions.
MT-Bench-101's key contribution was its fine-grained analysis of conversational capabilities. While GPT-4 achieved strong overall performance, the benchmark revealed persistent weaknesses in adaptability (adjusting tone or style) and interactivity (maintaining coherent multi-turn exchanges). These findings suggested that conversational competence required more than strong language modeling; it demanded explicit modeling of user state, context tracking, and strategic response planning.
The Agent Era: From Knowledge to Action (2023-Present)
AgentBench [8] marked the beginning of the agent era with its evaluation across eight diverse environments: operating system commands, database queries, knowledge graphs, digital card games, lateral thinking puzzles, household tasks, web shopping, and web browsing. Released in August 2023, AgentBench tested whether models could translate language understanding into goal-directed action.
AgentBench revealed a capability gap: while GPT-4 dominated most environments, even the strongest models struggled with tasks requiring extended reasoning chains or precise tool use.
WebArena: Realistic Web Interaction (July 2023)
WebArena [9] introduced a paradigm shift toward realistic, end-to-end evaluation. The benchmark comprised 812 tasks across real-world websites (shopping, forums, collaborative software, content management) with fully functional backends. Agents navigated actual web interfaces, filled forms, and accomplished multi-step goals—all automatically verified through execution traces and final state checks.
WebArena's realistic web environments including e-commerce, forums, and collaborative software, with tasks requiring multi-step navigation and form interactions.
WebArena's initial results were sobering: the best models achieved only 14% success rate in 2023. However, this low ceiling proved valuable—it provided headroom for measuring progress. Currently, success rates has climbed to over 70%, demonstrating rapid improvement while still leaving substantial room for advancement. WebArena established that realistic, open-ended environments could drive progress more effectively than saturated benchmarks.
GAIA: General AI Assistants (November 2023)
GAIA [10] (General AI Assistants) pushed evaluation toward questions requiring genuine reasoning, tool use, and multi-step planning. Released in November 2023, GAIA consisted of 466 questions spanning three difficulty levels, each requiring real-world knowledge, web search, code execution, or file manipulation to answer.
GAIA's defining characteristic was its validation set design: questions were crafted to have unambiguous answers (numbers, names, dates) that could be automatically verified, yet required complex reasoning to derive. Level 1 questions tested basic tool use, Level 2 required multi-step reasoning, and Level 3 demanded sophisticated planning and error recovery. As of now, the best models achieve 90% success across all levels.
SWE-bench [11] elevated evaluation to real-world software engineering by collecting 2,294 GitHub issues from popular Python repositories. Each task required understanding a bug report or feature request, navigating a large codebase, and generating a patch that passed existing unit tests—all without human intervention.
SWE-bench tasks sourced from real-world python repositories.
SWE-bench's difficulty stemmed from its authenticity: these were not simplified textbook problems but genuine issues that human developers had struggled with. Success required code comprehension, debugging skills, and the ability to make targeted changes without breaking existing functionality. In Jan 2026, state-of-the-art systems achieve 74% resolution rates, underscoring the rapid improvement in autonomous software.
Tau-bench: Agent-User Collaboration (June 2024)
Tau-bench (τ-bench) [12] and its follow-up Tau2-bench introduced a novel evaluation paradigm: dual-control environments where both the agent and a simulated user could execute actions. Tau-bench spanned three domains (retail, airline, and telecom customer service) with tasks requiring the agent to assist users while adhering to company policies and maintaining conversation flow.
Tau-bench's dual-control environment architecture where both agent and simulated user can take actions, requiring coordination and policy compliance.
Tau-bench's key innovation was the pass^k metric, which measured consistency across multiple rollouts. An agent might succeed once through luck, but consistent success required robust understanding of policies, user intent, and appropriate action sequencing. Performance varied dramatically: GPT-4o achieved ~45% in the telecom domain while GPT-5 reached 96%, suggesting that agent-user collaboration remained a frontier capability where model improvements yielded substantial gains.
ML-Dev-Bench: Machine Learning Development (February 2025)
ML-Dev-Bench [13] evaluated agents on 30 real-world machine learning development tasks across six categories: dataset handling, data preprocessing, model training, performance evaluation, hyperparameter tuning, and performance improvement. Each task was derived from actual ML workflows, requiring agents to write code, execute experiments, and interpret results.
ML-Dev-Bench task categories and distribution, covering the spectrum of ML development workflows from data handling to performance optimization.
At the time of release, ML-Dev-Bench showed that while agents handled structured tasks (data loading, preprocessing, basic training) reasonably well, they achieved 0% success on open-ended performance improvement tasks. These tasks required iterative experimentation, hypothesis formation, and creative problem-solving. The benchmark highlighted that ML development demands more than code generation; it requires scientific reasoning and experimental design.
Related work includes MLE-Bench [14], which focuses on Kaggle competition-style tasks with 75 competitions spanning data science challenges. While ML-Dev-Bench emphasizes standard ML workflows, MLE-Bench tests competitive problem-solving under time constraints, providing complementary perspectives on ML automation.
TheAgentCompany [15] created a comprehensive simulated workplace with 175 tasks spanning software engineering, data analysis, project management, and business operations. It provided agents with access to realistic business tools (email, file systems, task trackers, version control) and measured their ability to accomplish work objectives.
TheAgentCompany's simulated workplace environment with integrated business tools and realistic task workflows.
TheAgentCompany's tasks ranged from simple (sending an email) to complex (analyzing customer data and generating a report with recommendations). The best agents achieved approximately 40% task completion, with performance degrading sharply as task complexity increased. The benchmark demonstrated that workplace automation required not just tool use but understanding of business context, prioritization, and appropriate communication.
A related benchmark, OSWorld [16], focuses on general operating system interactions across Windows, macOS, and Ubuntu with 369 tasks. While OSWorld emphasizes low-level OS operations, TheAgentCompany tests higher-level business workflows, providing complementary views of agent capabilities in digital environments.
Terminal-Bench 2.0 (2025)
Terminal-Bench 2.0 [18] evaluated agents on 89 command-line interface tasks requiring file system navigation, text processing, system administration, and tool orchestration. The benchmark tested whether agents could effectively use the Unix command-line: a fundamental skill for software development and system operations.
Terminal-Bench 2.0's tasks ranged from basic (listing files, searching text) to advanced (writing shell scripts, managing processes, configuring services). As of Jan 2026, the best agents achieved around 60% success rates, with failures often stemming from incorrect command syntax, misunderstanding of tool options, or inability to chain commands effectively.
GDPval: Economically Valuable Work (October 2025)
GDPval [17] represented a shift in evaluation philosophy: rather than measuring what models can do on artificial tasks, it directly assessed performance on economically valuable work that professionals are paid to perform. Released by OpenAI, GDPval comprised 1,320 tasks (with 220 open-sourced) spanning 44 occupations across the top 9 sectors contributing to U.S. GDP.
GDPval's coverage of 44 occupations across 9 economic sectors, from manufacturing engineers to registered nurses to financial analysts.
GDPval's tasks were constructed from actual work products created by industry professionals with an average of 14 years of experience. Each task consisted of a request (often with reference files) and a deliverable work product. Tasks required manipulating diverse formats including CAD files, photos, video, audio, social media posts, diagrams, slide decks, spreadsheets, and customer support conversations. The average task required 7 hours of expert work to complete, with some spanning multiple weeks.
Evaluation employed pairwise expert comparisons: professional experts in each occupation ranked model outputs against human expert baselines. This methodology captured subjective quality factors - structure, style, format, aesthetics, relevance - that matter in real-world work but are difficult to measure with automated metrics. GDPval's primary metric was win rate: the percentage of times a model's output was preferred over the expert baseline.
Results revealed that frontier models were approaching parity with industry experts. Claude Opus 4.1 achieved a 47.6% win rate, nearly matching the 50% threshold that would indicate equivalence with human professionals. Performance improved roughly linearly over time, with earlier models like GPT-4o achieving only 12.4% win rates. The benchmark also demonstrated that increased reasoning effort, task context, and scaffolding all improved model performance.
Understanding Progress and Limitations
The Asymmetry of Verification
One way to understand which tasks will see greater progress by agents is the asymmetry of verification [19]: many tasks are far easier to verify than to generate. Code can be tested against unit tests in seconds, but writing correct code may require hours. Mathematical proofs can be checked mechanically, but discovering them demands creativity and insight.
This asymmetry explains why benchmarks like HumanEval and SWE-bench enable reliable automated evaluation - verification is cheap and objective. However, it also reveals a constraint on progress: progress is fundamentally limited by how quickly we can verify solutions. If verification is slow, progress can sometimes be tougher because we cannot generate enough training signal.
The implication is that domains with fast, objective verification (code, math, formal reasoning) will see rapid progress, while domains requiring human judgment (creative writing, strategic planning, subjective quality assessment) will lag behind.
The Future: Learning from Human Demonstrations
Current agent benchmarks primarily evaluate zero-shot or few-shot capabilities - can the agent succeed with minimal or no task-specific training? This paradigm reflects the reality that deploying agents at scale requires generalization beyond training data. However, it may not reflect how agents will ultimately be developed and deployed.
The future of agent evaluation likely involves learning from human demonstrations - the multi-platform, communication-heavy, judgment-requiring work that fills most white-collar jobs. Here's what that might entail:
Video-based learning: Rather than screenshots, we would train on continuous video of humans performing tasks, with natural language narration. This is how you actually onboard employees, not through disconnected screenshots. Video captures temporal dynamics, cursor movements, hesitations, and corrections - all valuable signals for learning robust policies.
Uncertainty calibration: Agents need to know when they don't know. When should an agent ask a clarifying question versus proceeding? This requires much better confidence estimation. Current agents can fail silently or hallucinate rather than acknowledging uncertainty and requesting help.
Efficient learning: Once a human shows the agent a task once or twice, it shouldn't need the full reasoning chain every time. Humans develop fast, intuitive patterns. Agents should too. This suggests a two-system architecture: slow, deliberate reasoning for novel situations, and fast, cached responses for familiar patterns.
Vision-only operation: There's an interesting analogy to self-driving cars here. Tesla bet on vision (cameras) versus Waymo's approach with LIDAR and specialized sensors. If agents can work with just visual input - no DOM access, no special APIs - then these agents could live anywhere: smart glasses, your phone, a laptop without special permissions. That's compelling from a deployment perspective, though it places greater demands on visual understanding and reasoning.
Conclusion
The evolution of AI agent evaluation from GLUE to GDPval reflects a fundamental shift in what we ask of AI systems. Early benchmarks measured knowledge and understanding; modern benchmarks measure action, planning, and goal achievement. This progression mirrors the field's progress: moving from systems that know to systems that do.
The benchmarks surveyed here provide the measurement infrastructure necessary for progress. They reveal capabilities and limitations, guide research priorities, and enable objective comparison. As agents grow more capable, evaluation must evolve in tandem - developing benchmarks that capture the full complexity of real-world work while remaining tractable to automate.
References
[1] Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. R. (2018). GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. arXiv preprint arXiv:1804.07461. https://arxiv.org/abs/1804.07461
[2] Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. (2020). Measuring Massive Multitask Language Understanding. arXiv preprint arXiv:2009.03300. https://arxiv.org/abs/2009.03300
[3] Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., ... & Zaremba, W. (2021). Evaluating Large Language Models Trained on Code. arXiv preprint arXiv:2107.03374. https://arxiv.org/abs/2107.03374
[4] Chiang, W. L., Zheng, L., Sheng, Y., Angelopoulos, A. N., Li, T., Li, D., ... & Stoica, I. (2024). Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. arXiv preprint arXiv:2403.04132. https://arxiv.org/abs/2403.04132
[5] Zheng, L., Chiang, W. L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., ... & Stoica, I. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv preprint arXiv:2306.05685. https://arxiv.org/abs/2306.05685
[6] Dubois, Y., Galambosi, B., Liang, P., & Hashimoto, T. B. (2024). Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators. arXiv preprint arXiv:2404.04475. https://arxiv.org/abs/2404.04475
[7] Zheng, L., Chiang, W. L., Sheng, Y., Li, T., Zhuang, S., Wu, Z., ... & Stoica, I. (2024). MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues. arXiv preprint arXiv:2402.14762. https://arxiv.org/abs/2402.14762
[9] Zhou, S., Xu, F. F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., ... & Neubig, G. (2023). WebArena: A Realistic Web Environment for Building Autonomous Agents. arXiv preprint arXiv:2307.13854. https://arxiv.org/abs/2307.13854
[10] Mialon, G., Dessì, R., Lomeli, M., Nalmpantis, C., Pasunuru, R., Raileanu, R., ... & Scialom, T. (2023). GAIA: A Benchmark for General AI Assistants. arXiv preprint arXiv:2311.12983. https://arxiv.org/abs/2311.12983
[11] Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., & Narasimhan, K. (2023). SWE-bench: Can Language Models Resolve Real-World GitHub Issues? arXiv preprint arXiv:2310.06770. https://arxiv.org/abs/2310.06770
[12] Yao, S., Zhao, J., Yu, D., Shafran, I., Griffiths, T. L., Cao, Y., & Narasimhan, K. (2024). Tau-bench: A Benchmark for Evaluating LLM-Based Agents in Interactive Environments. arXiv preprint arXiv:2406.12045. https://arxiv.org/abs/2406.12045
[13] Padigela, H., Shah, C., & Juyal, D. (2025). ML-Dev-Bench: Evaluating Large Language Models for Machine Learning Development. arXiv preprint arXiv:2502.00964. https://arxiv.org/abs/2502.00964
[15] Liu, F., Cheng, S., Zhuge, M., Ruan, B., Gao, J., Zhou, Q., ... & Lin, B. Y. (2024). TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks. arXiv preprint arXiv:2412.14161. https://arxiv.org/abs/2412.14161
[16] Xie, T., Zhang, D., Chen, J., Li, X., Zhao, S., Luo, R., ... & Gao, J. (2024). OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments. arXiv preprint arXiv:2404.07972. https://arxiv.org/abs/2404.07972
[17] Patwardhan, T., Dias, R., Proehl, E., Kim, G., Wang, M., Watkins, O., ... & Tworek, J. (2025). GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks. arXiv preprint arXiv:2510.04374. https://arxiv.org/abs/2510.04374
Introduction
The evaluation of artificial intelligence systems has undergone a transformation over the past few years. This exponential growth in task complexity - documented by analysis such as METR's on time horizons across benchmarks - reveals that AI capabilities are expanding far faster than many anticipated. In 2020, frontier models could handle tasks requiring a few seconds of computation. By 2025, they were tackling problems requiring hours of sequential reasoning. This acceleration has profound implications for how we evaluate AI systems. What began as static tests of language understanding has evolved into complex assessments of multi-step reasoning, tool use, and autonomous decision-making. This survey traces the chronological development of AI agent benchmarks from 2018 to 2025, examining how evaluation methodologies have adapted to measure increasingly sophisticated capabilities.
The benchmarks covered here represent influential examples from broader categories of evaluation. For instance, TheAgentCompany emphasizes business tool workflows, while OSWorld evaluates general operating system interactions. This survey focuses on representative benchmarks that have shaped the field's understanding of agent capabilities.
The Foundation Era (2018-2021)
GLUE: Establishing the Baseline (2018)
The General Language Understanding Evaluation (GLUE) benchmark [1] emerged in 2018 as one of the first comprehensive attempts to measure natural language understanding across multiple tasks. GLUE aggregated nine diverse tasks including sentiment analysis, textual entailment, and semantic similarity, providing a standardized evaluation framework that would influence benchmark design for years to come.
MMLU: Measuring Breadth of Knowledge (2020)
Measuring Massive Multitask Language Understanding (MMLU) [2] comprised 15,908 multiple-choice questions spanning 57 subjects from elementary mathematics to professional law and medicine. The benchmark was explicitly designed to test knowledge acquired during pretraining by evaluating models in zero-shot and few-shot settings.
MMLU revealed that while models demonstrated an impressive breadth of knowledge, they exhibited highly variable performance across domains. GPT-3 achieved 43.9% accuracy on average but performed near-randomly on subjects like morality and law.
HumanEval: Code as a Verifiable Domain (2021)
HumanEval [3] marked a shift toward domains with objective verification. The benchmark consisted of 164 hand-written programming problems, each with a function signature, docstring, body, and multiple unit tests. Models generated code completions that were then executed against test cases—pass or fail, with no ambiguity.
Example problems of increasing difficulty level in HumanEval
HumanEval's key innovation was the pass@k metric, which measured the probability that at least one of k generated samples passed all unit tests. This metric acknowledged that code generation is inherently stochastic and that practical utility depends on generating at least one correct solution. The benchmark demonstrated that verification asymmetry - easy to verify, hard to generate - could enable reliable automated evaluation without human judges.
The Conversational Turn: Human Preference and Multi-Turn Dialogue (2023-2024)
Chatbot Arena: Crowdsourced Human Preference (May 2023)
Chatbot Arena [4], launched by LMSYS, pioneered large-scale crowdsourced evaluation through pairwise comparisons. Users interacted with two anonymous models simultaneously, voting for which response they preferred. Elo ratings computed from these "battles" provided a single scalar measure of model quality based on human preference.
By January 2025, Chatbot Arena had collected over 6 million votes, establishing itself as the gold standard for human preference evaluation. The platform's strength lay in its scale and diversity: real users asking genuine questions across domains, with no artificial constraints. However, this strength also introduced challenges; user preferences could reflect superficial qualities like verbosity or formatting rather than correctness or helpfulness.
MT-Bench: Structured Multi-Turn Evaluation (June 2023)
MT-Bench [5] offered a more controlled alternative to open-ended crowdsourcing. MT-Bench consisted of 80 carefully curated multi-turn questions across eight categories: writing, roleplay, reasoning, math, coding, extraction, STEM, and humanities. Each conversation involved two turns, with the second turn designed to test the model's ability to maintain context and handle follow-up instructions.
MT-Bench introduced LLM-as-a-Judge evaluation, where GPT-4 scored responses on a 10-point scale. This approach enabled rapid, scalable evaluation without human annotators, though it introduced new challenges around judge bias and reliability. MT-Bench scores correlated strongly (0.93) with Chatbot Arena Elo ratings, validating the LLM-as-a-Judge paradigm while highlighting the need for careful prompt engineering and bias mitigation.
AlpacaEval: Length-Controlled Win Rates (2023)
AlpacaEval [6] addressed a flaw in LLM-as-a-Judge evaluation: length bias. Early versions of AlpacaEval revealed that judges disproportionately favored longer responses, even when conciseness was preferable. AlpacaEval 2.0 introduced length-controlled win rates, explicitly instructing judges to penalize unnecessary verbosity.
The benchmark consisted of 805 diverse instructions, with model responses compared against a baseline (typically GPT-4 Turbo or Claude). AlpacaEval's win rate metric - the percentage of times a model's response was preferred over the baseline - provided an intuitive measure of relative quality.
MT-Bench-101: Fine-Grained Capability Taxonomy (February 2024)
MT-Bench-101 [7] extended MT-Bench's multi-turn framework with a comprehensive three-tier taxonomy of conversational abilities. The benchmark organized 1,388 dialogues (4,208 turns) into three top-level categories: Perceptivity (understanding user intent), Adaptability (adjusting responses based on context), and Interactivity (engaging in natural dialogue flow).
MT-Bench-101's three-tier hierarchical taxonomy breaking down conversational AI capabilities into 13 distinct tasks across Perceptivity, Adaptability, and Interactivity dimensions.
MT-Bench-101's key contribution was its fine-grained analysis of conversational capabilities. While GPT-4 achieved strong overall performance, the benchmark revealed persistent weaknesses in adaptability (adjusting tone or style) and interactivity (maintaining coherent multi-turn exchanges). These findings suggested that conversational competence required more than strong language modeling; it demanded explicit modeling of user state, context tracking, and strategic response planning.
The Agent Era: From Knowledge to Action (2023-Present)
AgentBench: Multi-Environment Evaluation (August 2023)
AgentBench [8] marked the beginning of the agent era with its evaluation across eight diverse environments: operating system commands, database queries, knowledge graphs, digital card games, lateral thinking puzzles, household tasks, web shopping, and web browsing. Released in August 2023, AgentBench tested whether models could translate language understanding into goal-directed action.
AgentBench revealed a capability gap: while GPT-4 dominated most environments, even the strongest models struggled with tasks requiring extended reasoning chains or precise tool use.
WebArena: Realistic Web Interaction (July 2023)
WebArena [9] introduced a paradigm shift toward realistic, end-to-end evaluation. The benchmark comprised 812 tasks across real-world websites (shopping, forums, collaborative software, content management) with fully functional backends. Agents navigated actual web interfaces, filled forms, and accomplished multi-step goals—all automatically verified through execution traces and final state checks.
WebArena's realistic web environments including e-commerce, forums, and collaborative software, with tasks requiring multi-step navigation and form interactions.
WebArena's initial results were sobering: the best models achieved only 14% success rate in 2023. However, this low ceiling proved valuable—it provided headroom for measuring progress. Currently, success rates has climbed to over 70%, demonstrating rapid improvement while still leaving substantial room for advancement. WebArena established that realistic, open-ended environments could drive progress more effectively than saturated benchmarks.
GAIA: General AI Assistants (November 2023)
GAIA [10] (General AI Assistants) pushed evaluation toward questions requiring genuine reasoning, tool use, and multi-step planning. Released in November 2023, GAIA consisted of 466 questions spanning three difficulty levels, each requiring real-world knowledge, web search, code execution, or file manipulation to answer.
GAIA's defining characteristic was its validation set design: questions were crafted to have unambiguous answers (numbers, names, dates) that could be automatically verified, yet required complex reasoning to derive. Level 1 questions tested basic tool use, Level 2 required multi-step reasoning, and Level 3 demanded sophisticated planning and error recovery. As of now, the best models achieve 90% success across all levels.
SWE-bench: Real-World Software Engineering (October 2023)
SWE-bench [11] elevated evaluation to real-world software engineering by collecting 2,294 GitHub issues from popular Python repositories. Each task required understanding a bug report or feature request, navigating a large codebase, and generating a patch that passed existing unit tests—all without human intervention.
SWE-bench tasks sourced from real-world python repositories.
SWE-bench's difficulty stemmed from its authenticity: these were not simplified textbook problems but genuine issues that human developers had struggled with. Success required code comprehension, debugging skills, and the ability to make targeted changes without breaking existing functionality. In Jan 2026, state-of-the-art systems achieve 74% resolution rates, underscoring the rapid improvement in autonomous software.
Tau-bench: Agent-User Collaboration (June 2024)
Tau-bench (τ-bench) [12] and its follow-up Tau2-bench introduced a novel evaluation paradigm: dual-control environments where both the agent and a simulated user could execute actions. Tau-bench spanned three domains (retail, airline, and telecom customer service) with tasks requiring the agent to assist users while adhering to company policies and maintaining conversation flow.
Tau-bench's dual-control environment architecture where both agent and simulated user can take actions, requiring coordination and policy compliance.
Tau-bench's key innovation was the pass^k metric, which measured consistency across multiple rollouts. An agent might succeed once through luck, but consistent success required robust understanding of policies, user intent, and appropriate action sequencing. Performance varied dramatically: GPT-4o achieved ~45% in the telecom domain while GPT-5 reached 96%, suggesting that agent-user collaboration remained a frontier capability where model improvements yielded substantial gains.
ML-Dev-Bench: Machine Learning Development (February 2025)
ML-Dev-Bench [13] evaluated agents on 30 real-world machine learning development tasks across six categories: dataset handling, data preprocessing, model training, performance evaluation, hyperparameter tuning, and performance improvement. Each task was derived from actual ML workflows, requiring agents to write code, execute experiments, and interpret results.
ML-Dev-Bench task categories and distribution, covering the spectrum of ML development workflows from data handling to performance optimization.
At the time of release, ML-Dev-Bench showed that while agents handled structured tasks (data loading, preprocessing, basic training) reasonably well, they achieved 0% success on open-ended performance improvement tasks. These tasks required iterative experimentation, hypothesis formation, and creative problem-solving. The benchmark highlighted that ML development demands more than code generation; it requires scientific reasoning and experimental design.
Related work includes MLE-Bench [14], which focuses on Kaggle competition-style tasks with 75 competitions spanning data science challenges. While ML-Dev-Bench emphasizes standard ML workflows, MLE-Bench tests competitive problem-solving under time constraints, providing complementary perspectives on ML automation.
TheAgentCompany: Simulated Workplace (December 2024)
TheAgentCompany [15] created a comprehensive simulated workplace with 175 tasks spanning software engineering, data analysis, project management, and business operations. It provided agents with access to realistic business tools (email, file systems, task trackers, version control) and measured their ability to accomplish work objectives.
TheAgentCompany's simulated workplace environment with integrated business tools and realistic task workflows.
TheAgentCompany's tasks ranged from simple (sending an email) to complex (analyzing customer data and generating a report with recommendations). The best agents achieved approximately 40% task completion, with performance degrading sharply as task complexity increased. The benchmark demonstrated that workplace automation required not just tool use but understanding of business context, prioritization, and appropriate communication.
A related benchmark, OSWorld [16], focuses on general operating system interactions across Windows, macOS, and Ubuntu with 369 tasks. While OSWorld emphasizes low-level OS operations, TheAgentCompany tests higher-level business workflows, providing complementary views of agent capabilities in digital environments.
Terminal-Bench 2.0 (2025)
Terminal-Bench 2.0 [18] evaluated agents on 89 command-line interface tasks requiring file system navigation, text processing, system administration, and tool orchestration. The benchmark tested whether agents could effectively use the Unix command-line: a fundamental skill for software development and system operations.
Terminal-Bench 2.0's tasks ranged from basic (listing files, searching text) to advanced (writing shell scripts, managing processes, configuring services). As of Jan 2026, the best agents achieved around 60% success rates, with failures often stemming from incorrect command syntax, misunderstanding of tool options, or inability to chain commands effectively.
GDPval: Economically Valuable Work (October 2025)
GDPval [17] represented a shift in evaluation philosophy: rather than measuring what models can do on artificial tasks, it directly assessed performance on economically valuable work that professionals are paid to perform. Released by OpenAI, GDPval comprised 1,320 tasks (with 220 open-sourced) spanning 44 occupations across the top 9 sectors contributing to U.S. GDP.
GDPval's coverage of 44 occupations across 9 economic sectors, from manufacturing engineers to registered nurses to financial analysts.
GDPval's tasks were constructed from actual work products created by industry professionals with an average of 14 years of experience. Each task consisted of a request (often with reference files) and a deliverable work product. Tasks required manipulating diverse formats including CAD files, photos, video, audio, social media posts, diagrams, slide decks, spreadsheets, and customer support conversations. The average task required 7 hours of expert work to complete, with some spanning multiple weeks.
Evaluation employed pairwise expert comparisons: professional experts in each occupation ranked model outputs against human expert baselines. This methodology captured subjective quality factors - structure, style, format, aesthetics, relevance - that matter in real-world work but are difficult to measure with automated metrics. GDPval's primary metric was win rate: the percentage of times a model's output was preferred over the expert baseline.
Results revealed that frontier models were approaching parity with industry experts. Claude Opus 4.1 achieved a 47.6% win rate, nearly matching the 50% threshold that would indicate equivalence with human professionals. Performance improved roughly linearly over time, with earlier models like GPT-4o achieving only 12.4% win rates. The benchmark also demonstrated that increased reasoning effort, task context, and scaffolding all improved model performance.
Understanding Progress and Limitations
The Asymmetry of Verification
One way to understand which tasks will see greater progress by agents is the asymmetry of verification [19]: many tasks are far easier to verify than to generate. Code can be tested against unit tests in seconds, but writing correct code may require hours. Mathematical proofs can be checked mechanically, but discovering them demands creativity and insight.
This asymmetry explains why benchmarks like HumanEval and SWE-bench enable reliable automated evaluation - verification is cheap and objective. However, it also reveals a constraint on progress: progress is fundamentally limited by how quickly we can verify solutions. If verification is slow, progress can sometimes be tougher because we cannot generate enough training signal.
The implication is that domains with fast, objective verification (code, math, formal reasoning) will see rapid progress, while domains requiring human judgment (creative writing, strategic planning, subjective quality assessment) will lag behind.
The Future: Learning from Human Demonstrations
Current agent benchmarks primarily evaluate zero-shot or few-shot capabilities - can the agent succeed with minimal or no task-specific training? This paradigm reflects the reality that deploying agents at scale requires generalization beyond training data. However, it may not reflect how agents will ultimately be developed and deployed.
The future of agent evaluation likely involves learning from human demonstrations - the multi-platform, communication-heavy, judgment-requiring work that fills most white-collar jobs. Here's what that might entail:
Video-based learning: Rather than screenshots, we would train on continuous video of humans performing tasks, with natural language narration. This is how you actually onboard employees, not through disconnected screenshots. Video captures temporal dynamics, cursor movements, hesitations, and corrections - all valuable signals for learning robust policies.
Uncertainty calibration: Agents need to know when they don't know. When should an agent ask a clarifying question versus proceeding? This requires much better confidence estimation. Current agents can fail silently or hallucinate rather than acknowledging uncertainty and requesting help.
Efficient learning: Once a human shows the agent a task once or twice, it shouldn't need the full reasoning chain every time. Humans develop fast, intuitive patterns. Agents should too. This suggests a two-system architecture: slow, deliberate reasoning for novel situations, and fast, cached responses for familiar patterns.
Vision-only operation: There's an interesting analogy to self-driving cars here. Tesla bet on vision (cameras) versus Waymo's approach with LIDAR and specialized sensors. If agents can work with just visual input - no DOM access, no special APIs - then these agents could live anywhere: smart glasses, your phone, a laptop without special permissions. That's compelling from a deployment perspective, though it places greater demands on visual understanding and reasoning.
Conclusion
The evolution of AI agent evaluation from GLUE to GDPval reflects a fundamental shift in what we ask of AI systems. Early benchmarks measured knowledge and understanding; modern benchmarks measure action, planning, and goal achievement. This progression mirrors the field's progress: moving from systems that know to systems that do.
The benchmarks surveyed here provide the measurement infrastructure necessary for progress. They reveal capabilities and limitations, guide research priorities, and enable objective comparison. As agents grow more capable, evaluation must evolve in tandem - developing benchmarks that capture the full complexity of real-world work while remaining tractable to automate.
References
[1] Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. R. (2018). GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. arXiv preprint arXiv:1804.07461. https://arxiv.org/abs/1804.07461
[2] Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., & Steinhardt, J. (2020). Measuring Massive Multitask Language Understanding. arXiv preprint arXiv:2009.03300. https://arxiv.org/abs/2009.03300
[3] Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., ... & Zaremba, W. (2021). Evaluating Large Language Models Trained on Code. arXiv preprint arXiv:2107.03374. https://arxiv.org/abs/2107.03374
[4] Chiang, W. L., Zheng, L., Sheng, Y., Angelopoulos, A. N., Li, T., Li, D., ... & Stoica, I. (2024). Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. arXiv preprint arXiv:2403.04132. https://arxiv.org/abs/2403.04132
[5] Zheng, L., Chiang, W. L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., ... & Stoica, I. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv preprint arXiv:2306.05685. https://arxiv.org/abs/2306.05685
[6] Dubois, Y., Galambosi, B., Liang, P., & Hashimoto, T. B. (2024). Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators. arXiv preprint arXiv:2404.04475. https://arxiv.org/abs/2404.04475
[7] Zheng, L., Chiang, W. L., Sheng, Y., Li, T., Zhuang, S., Wu, Z., ... & Stoica, I. (2024). MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues. arXiv preprint arXiv:2402.14762. https://arxiv.org/abs/2402.14762
[8] Liu, X., Yu, H., Zhang, H., Xu, Y., Lei, X., Lai, H., ... & Zhang, D. (2023). AgentBench: Evaluating LLMs as Agents. arXiv preprint arXiv:2308.03688. https://arxiv.org/abs/2308.03688
[9] Zhou, S., Xu, F. F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., ... & Neubig, G. (2023). WebArena: A Realistic Web Environment for Building Autonomous Agents. arXiv preprint arXiv:2307.13854. https://arxiv.org/abs/2307.13854
[10] Mialon, G., Dessì, R., Lomeli, M., Nalmpantis, C., Pasunuru, R., Raileanu, R., ... & Scialom, T. (2023). GAIA: A Benchmark for General AI Assistants. arXiv preprint arXiv:2311.12983. https://arxiv.org/abs/2311.12983
[11] Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., & Narasimhan, K. (2023). SWE-bench: Can Language Models Resolve Real-World GitHub Issues? arXiv preprint arXiv:2310.06770. https://arxiv.org/abs/2310.06770
[12] Yao, S., Zhao, J., Yu, D., Shafran, I., Griffiths, T. L., Cao, Y., & Narasimhan, K. (2024). Tau-bench: A Benchmark for Evaluating LLM-Based Agents in Interactive Environments. arXiv preprint arXiv:2406.12045. https://arxiv.org/abs/2406.12045
[13] Padigela, H., Shah, C., & Juyal, D. (2025). ML-Dev-Bench: Evaluating Large Language Models for Machine Learning Development. arXiv preprint arXiv:2502.00964. https://arxiv.org/abs/2502.00964
[14] Chan, J., Shen, J., Chen, L., Huang, J., & Liang, P. (2024). MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering. arXiv preprint arXiv:2410.07095. https://arxiv.org/abs/2410.07095
[15] Liu, F., Cheng, S., Zhuge, M., Ruan, B., Gao, J., Zhou, Q., ... & Lin, B. Y. (2024). TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks. arXiv preprint arXiv:2412.14161. https://arxiv.org/abs/2412.14161
[16] Xie, T., Zhang, D., Chen, J., Li, X., Zhao, S., Luo, R., ... & Gao, J. (2024). OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments. arXiv preprint arXiv:2404.07972. https://arxiv.org/abs/2404.07972
[17] Patwardhan, T., Dias, R., Proehl, E., Kim, G., Wang, M., Watkins, O., ... & Tworek, J. (2025). GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks. arXiv preprint arXiv:2510.04374. https://arxiv.org/abs/2510.04374
[18] Terminal-Bench 2.0. (2025). https://www.tbench.ai/
[19] Wei, J. (2024). The Asymmetry of Verification and Verifier's Law. https://www.jasonwei.net/blog/asymmetry-of-verification-and-verifiers-law