A paper by Salesforce AI Research
While AI agents have transformative potential in business, the absence of publicly available business data on widely used platforms hinders effective performance benchmarking. Existing benchmarks fall short in realism, data fidelity, agent-user interaction, and coverage across business scenarios and industries. To address these gaps, we introduce CRMArena-Pro, a novel benchmark for holistic and realistic assessment of LLM agents in diverse professional settings. CRMArena-Pro expands on CRMArena with nineteen expert-validated tasks across customer sales, service, as well as configure, price, and quote for Business-to-Business and Businessto-Customer scenarios. It also incorporates multi-turn interactions guided by diverse personas and confidentiality awareness assessments. Experiments show leading LLM agents achieve approximately solely 58% single-turn success rate on CRMArena-Pro, with significant performance drops in multi-turn settings to 35%. Among the business skills evaluated, Workflow Execution is notably more tractable, with top-performing agents surpassing 83% success rate in single-turn tasks, while other skills present greater challenges. Additionally, agents exhibit near-zero inherent confidentiality awareness (improvable with prompting but often at a cost to task performance). These results underscore a significant gap between current LLM capabilities and real-world enterprise demands, highlighting needs for improved multi-turn reasoning, confidentiality adherence, and versatile skill acquisition.