How do you evaluate AI capability claims in actual software products?

Dhruv Gulati

Rejected for the following reason(s):

This is an automated rejection. No LLM generated, assisted/co-written, or edited work.

Read full explanation

I work in private equity advisory and I am moving into something new that sits at the intersection of investment analysis and AI evaluation. I am posting here because I have run into a set of methodology questions that I suspect this community has thought about more carefully than the investment world has.

The problem

Enterprise software vendors are making AI capability claims at scale. "Our AI identifies non-standard contract clauses faster and more accurately than any alternative." "Our model extracts structured data from unstructured documents with 94% accuracy." These claims appear in pitch decks, marketing materials, and procurement conversations. The people receiving them — investors, buyers, procurement teams have no reliable way to verify them.

The obvious answer is: run evals. The evaluation science for frontier models is increasingly mature. But there is a gap between that infrastructure and the practical question I keep trying to answer, which is not "how capable is GPT-4o" but "does this specific software product actually do what it claims, above and beyond what you would get by prompting a frontier model yourself?"

That second question turns out to be surprisingly hard to operationalise.

Where I have gotten to

The approach I am working with compares a product's output against a frontier baseline on the same task, using the same inputs and a verified ground truth dataset. Simple in principle. The problems emerge in execution.

The distinction I keep running into is between what I am calling extractive and interpretive tasks. Extractive tasks — find this clause, extract this defined term — have objectively verifiable ground truth. Interpretive tasks — is this clause non-standard, does this deviation represent material risk — require domain expert validation before the benchmark means anything. LLM-generated ground truth on interpretive tasks produces a number that looks precise but may just be measuring agreement with a model's reasoning style rather than actual correctness. This feels like a known problem in evaluation science but I have not found a clean treatment of it outside the context of frontier model evaluation.

Where I am less sure

Several things I am actively uncertain about and would value input on:

Whether extractive versus interpretive is the right frame, or whether the evaluation community uses a better taxonomy I am not aware of.

How to handle baseline prompt design - a poorly prompted frontier model makes the product look better than it is, but there is no obvious standard for what "fair" means when the product has a proprietary system prompt you cannot see or replicate.

Whether datasets of 100 to 200 cases are sufficient for the statistical claims I want to make, or whether gap scores at this sample size are largely within measurement noise.

How the community thinks about validity in LLM-as-judge scoring, specifically whether there is a principled approach to detecting when a judge model is scoring on stylistic similarity to its own outputs rather than on task correctness.

Also, how much of this can be done outside-in without deep product access vs. where we definitely need a sandbox environment

Why I am posting

I have a practical stake in answering these questions, the context is investment and procurement decisions, where the cost of a wrong verdict is real. But the methodology problems feel genuinely underexplored as applied questions, separate from the commercial context entirely.

If anyone has worked on ground truth construction for interpretive tasks, baseline prompt design, small-n benchmark validity, or LLM-as-judge reliability in a production setting, I would genuinely value the conversation. Happy to share more of what I am working through if useful.