250

LESSWRONG
LW

249
AI BenchmarkingLanguage Models (LLMs)AI
Frontpage

8

CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions

by Annapurna
12th Jun 2025
1 min read
0

8

This is a linkpost for https://arxiv.org/pdf/2505.18878

8

New Comment
Moderation Log
More from Annapurna
View more
Curated and popular this week
0Comments
AI BenchmarkingLanguage Models (LLMs)AI
Frontpage

A paper by Salesforce AI Research

Abstract

While AI agents have transformative potential in business, the absence of publicly available business data on widely used platforms hinders effective performance benchmarking. Existing benchmarks fall short in realism, data fidelity, agent-user interaction, and coverage across business scenarios and industries. To address these gaps, we introduce CRMArena-Pro, a novel benchmark for holistic and realistic assessment of LLM agents in diverse professional settings. CRMArena-Pro expands on CRMArena with nineteen expert-validated tasks across customer sales, service, as well as configure, price, and quote for Business-to-Business and Businessto-Customer scenarios. It also incorporates multi-turn interactions guided by diverse personas and confidentiality awareness assessments. Experiments show leading LLM agents achieve approximately solely 58% single-turn success rate on CRMArena-Pro, with significant performance drops in multi-turn settings to 35%. Among the business skills evaluated, Workflow Execution is notably more tractable, with top-performing agents surpassing 83% success rate in single-turn tasks, while other skills present greater challenges. Additionally, agents exhibit near-zero inherent confidentiality awareness (improvable with prompting but often at a cost to task performance). These results underscore a significant gap between current LLM capabilities and real-world enterprise demands, highlighting needs for improved multi-turn reasoning, confidentiality adherence, and versatile skill acquisition.