Notes on axes of variation in third-party risk assessment

Buck

There are many different activities that could be described as "third-party risk assessment". Here are some distinctions that I’ve found helpful thinking about the space over the last few weeks.

(Thanks Ajeya Cotra and Paul Christiano for discussions that inspired most of this.)

Throughout this, I refer to the actors as:

Developers.
Stakeholders. These are the people who want to be informed about risks. Possible stakeholders include: governments, the public, the developer's board, the developer's employees.
- The choice matters because one of the roles of an auditor is to review confidential info that they then do not directly disclose to stakeholders, they only tell them their conclusions. This is a more important role if the developer is more concerned about disclosing confidential information to the stakeholder.
Third parties. I don't know a better term for "independent actors who contribute in various ways to a stakeholder's understanding of risks through producing and evaluating evidence and/or arguments". Like, it's weird to call the physical security pentesting firm a "risk assessor". And AI Lab Watch isn't really an "auditor". And "evaluator" makes it sound like they run model evals.

The next step in the analysis will be to think about different objectives of third party risk assessment and think about how those interact with these axes of variation.

Axes

Fact generation vs evidence analysis

This is maybe the one that comes up most. I'll define:

A fact-generation assessment is trying to answer some narrow question to produce a fact that will later be used to evaluate risk.
- Examples:
  - Dangerous capability evaluations, e.g. METR and UK AISI's capability evals.
  - Evaluating the robustness of a safeguard, e.g. classifier robustness redteaming or the David Rein red-teaming project.
  - Pentesting to measure security.
- For many of those, a core reason why it makes sense to have a third party do them is that the process is sort of fundamentally adversarial—the final evidence will be of the form "I tried to demonstrate danger, but failed", so you're really depending on the person trying hard, and it's structurally easier to be confident that someone's not sandbagging if they're an unconflicted third party. Some cases are like monitor redteaming or pentesting where it's obviously adversarial; others, like dangerous capabilities evals, are cases where you're relying on the evaluator to do not-particularly-adversarial optimization (e.g. you need them to pick good hyperparameters for their fine-tune to elicit CBRN capabilities).
  - If our claim is "you need a third party to do the optimization every time that you argue for safety by saying an optimization attempt failed", this maybe has broader implications, I'm not sure people are applying this consistently.
- For the others that reason isn't present. For example, for a cyber capability eval where the AI company basically does no extra elicitation, it seems like it would probably be fine if the AI company just ran the evals themselves—the produced fact isn't rendered particularly suspect by the fact that it was done in house.
- There are some other reasons that it might make sense to have a third party do the eval:
  - They have expertise the AI company doesn't have in house. This totally makes sense in principle; it could totally make sense to centralize the expertise somewhere outside the company so that the AI companies can share it.
  - The eval requires sensitive info that the evaluator doesn't want to give the AI co. E.g. sensitive CBRN stuff? This makes some sense.
    - In general, eval scores can generally be increased or decreased by a developer if they have access to the dataset, because they can train/iterate against it. But in many cases the developer can also game eval scores even without access to the dataset; e.g. they can probably mess with alignment evidence by training models to act aligned in weird situations. If an eval score won't be valid evidence without separately understanding whether the AI developer did anything that would compromise it, it's not clearly much worse for the AI company to have the eval themselves and run it themselves.
An evidence-analysis assessment tries to combine many forms of evidence into an overall judgement about a high-level question. The central example of such a question is "how much risk to the world does this deployment pose".
- Examples:
  - METR's review of Anthropic's sabotage risk report. (The sabotage risk report analyzes a narrower slice of risk, but it is an analysis that draws together many different facts to answer a natural question that's pretty close in the causal graph to "how much total risk is there". Unlike e.g. "how robust was this monitor", which requires a bunch of additional context to be interesting.)
  - As a narrower example, our review of the OpenAI CoT training was us synthesizing a body of evidence into an overall assessment.
  - This is what auditing is in the context of corporate accounting—they look at your books and ask you questions to assess the overall financial state of the company.
- Within evidence analysis, there's a distinction between reviewing of a developer's argument vs. producing their own.

I think that evidence analysis vs fact generation are pretty different and it's not obvious that any projects or organizations should do both. The skillsets seem plausibly very different. Like, I suspect that during the main risk period, only a pretty small subset of the facts relevant to the risk analysis should be generated by third parties (as opposed to being reported by the company). So someone doing evidence analysis will naturally be operating several steps in the "evidence tree" away from someone generating a fact. And so I'm not sure there's that much synergy for these happening as part of the same project. E.g. I don't think it would be particularly useful for the security pentesters to work at the same org that does the overall assessment.

I think it's pretty weird that these are conflated so often (e.g. in lots of the widely-cited papers about AI auditing or evaluations like Frontier AI Auditing and Model Evaluation for Extreme Risks), they seem really different to me.

Laundering private evidence into sharable conclusions?

One function of risk analyses is to inform a stakeholder about the risks posed by a developer. One strategy is to have an auditor who access private information from the developer and uses it to answer a question, and then they give the answer to that question to the stakeholder without telling them the private information that caused them to reach that conclusion. The other option is to do the risk analysis just based on information that the stakeholder already knows (or is cleared to know).

In this post, I call this “evidence-laundering”. I don’t mean this with a negative connotation and will probably choose a different term in future because the negative connotation is so strong. I just wanted a distinct term that emphasizes that information is being processed from private evidence to public conclusions to reduce disclosure of commercially sensitive information.

I think I want to define this as: did the conclusion of a report rely on facts that were not stated in the report? So it's not evidence-laundering if you report a straightforward fact that your readers need to believe that you're not straightforwardly lying about because they can't personally check it. But it is evidence-laundering if you don't report the facts and instead report "based on private facts, things look ok".

Examples of the evidence-laundering strategy:

Pen-testing. It's totally normal to have a pen-tester say "we looked for vulns; i won't tell you want they are but here's how bad they were". And for the pen-tester to not tell the stakeholder various sensitive details.
David Rein's pen-test of the Anthropic monitors.
METR's review of the Anthropic sabotage risk report.
Lots of private evaluations in other parts of society. In courts, a psychiatric evaluation involves the psychiatrist talking privately to someone and then stating the conclusion publicly without fully justifying it.

Examples of the just-public-evidence strategy

Our review of the OpenAI CoT training thing. The workflow here was that they shared the draft with us a week or so before they published it, and we went through a few rounds of me complaining that the draft didn't have some important info that I needed to be convinced of their conclusion, and they iteratively added more info to the draft until I was pretty happy. So even though there was some temporary secrecy, we were fundamentally not in a laundering role: the thing that we released was based entirely on information that any reader of our blog post could read themselves. We were under our standard NDA, so they could have told us stuff during this process that we wouldn't have been allowed to repeat.
AI Lab Watch. I think Zach sometimes emailed AI companies asking them questions that he said would affect their rating, and maybe they sometimes answered? As long as this works via him entering their answers into the public record, he's not playing the evidence-laundering role.
The METR Frontier Risk Report is almost entirely this. Their report is entirely based on info that AI companies agreed to share publicly, except that some of these facts are published without attribution (that is, METR doesn't say which company said something), so METR is playing the evidence-laundering role of verifying that an AI company really did say something without revealing who said it.
- This is an interesting case because METR collected a bunch of non-public info during this project, but the only way this entered their report was by companies agreeing to make subsets of the info public.
Evidence-laundering is a central part of what "Frontier AI Auditing" (2026) means by "auditing", which writes: "Transparency alone cannot enable well-calibrated trust in the most capable (“frontier”) AI systems and the companies that build them: many safety- and security-relevant details are legitimately confidential".

The core strategic question here is definitely "when should you use evidence-laundering vs transparency".

I think that transparency is probably better all else equal? Evidence laundering should probably be reserved for cases where it's extremely defensible for the developer to not want to publicly release certain info. I am extremely sympathetic to this for pentesting or other things where there's an imminent cost to releasing the full evidence. I am less sympathetic to e.g. developers wanting to keep the capabilities of unreleased models secret.
But I think it might be pretty good if we can rely on transparency as much as possible.
In particular, I think that transparency maybe makes it easier for risk evaluations to differentially pressure AI companies to do better. I am excited for project like AI Lab Watch that do risk evaluations that compare the AI companies. This might piss off the AI companies, so it might be hard for them to have an evidence-laundering relationship. Their ratings will be more legitimate if they have the info they need to make their assessments. AI Lab Watch has the stance "if you don't give public statements explaining what you're doing on issue X, we'll assume you maximally suck"; perhaps stakeholders would think that it's reasonable to expect developers to produce public evidence of safety and therefore the developers will publish their evidence publicly. This allows us to have a robust mechanism for between-company pressure on safety for longer than if we tried to do all this without transparency, using evidence laundering.

A few notes:

Reporting a score against a benchmark the evaluator built but hasn't released — "DeepSeek scored 40% on our private benchmark" — isn't laundering of the developer's information (you know the claim and whom to blame), but it does ask the reader to trust, sight unseen, that the evaluator built a reasonable benchmark. The developer itself may not even have access to check it. Releasing an i.i.d. subset of the benchmark resolves most of this.
A report has to disclose how it looked. Every conclusion implicitly relies on the unstated fact that nothing else relevant turned up. So a clean report should say how it searched for other kinds of facts — including disconfirming ones — and report that there weren't any (or what it found). A report that presents its supporting facts and stops is laundering the adequacy of its own search.
Whose secret? Company vs auditor laundering. It helps to split laundering by who owns the withheld fact. Company laundering hides facts about the developer's own system — details of security systems that a pentester is attacking, or training methods that an auditor wants to know about. Auditor laundering leans on something the assessor holds and won't share: a particularly strong example of this is hazardous knowledge like bioweapon-uplift details. Note that auditors very often have some information that they don't want to publish. Note that needing to trust the auditor did a competent job — e.g. elicited hard enough — is not laundering at all; that's ordinary trust in a person, not a hidden fact. The table below splits these into two columns.

Incentive compatibility vs calibration

Should you give bad scores to developers if they don't give you sufficient access, or should you just use your best guess? If you do the latter, then anyone performing surprisingly poorly is better off not disclosing this. But if you do the former, then your risk assessments can’t be taken at face value by third parties, and it’s easier for AI developers to discredit you by saying “that org has no idea what’s going on, obviously we’re way safer than they think”. This probably works especially well when they’re trying to discredit you to their employees or other groups who have more access to private info. As an example of that dynamic, AI Lab Watch initially rated Google poorly on security due to them not disclosing much about their security, and this led some GDM people to say they thought it was less credible or usable.

As I said in the previous section, it's maybe rough to get stakeholders on board with risk assessments that assume the worst, but maybe it's doable and I think we should plausibly try to achieve this.

Current risk vs preparedness

Are you answering a question like "are the current risks adequately handled" or are you answering "is this developer on track to handle risks later"?

The downside of analyzing current risks is that they're inconsequential and will probably be inconsequential up until shortly before they're severe.

The downside of analyzing preparedness is that you have to be much more opinionated about futurism and threat models; your reports will rely on assumptions that are much more contentious.

Cross-developer comparability

Are you trying to make it easy or hard to compare between companies?

AI Lab Watch and friends are trying to make it easy. The advantage of this is that it might pressures companies to behave better or shift resources towards better companies (e.g. because people prefer working at companies they believe to be more responsible).
METR's FRR went out of its way to make it hard. The advantage of making it hard is that companies are more willing to tell you stuff. (One way of thinking about this is that if you remove comparability, scary stuff the AI companies say has costs that are split among them and their competitors, which is way better for them on net than the costs being concentrated on them. In practice, AI companies also get some cred from safety people for producing evidence of scary stuff, so this effect hasn't mattered that much in practice.)

I think that cross-developer comparability makes much more sense if you're doing evidence analysis, because evidence-analysis assessments are more comparable between companies.

So kind of obviously, I think that the choice between these is basically a tradeoff between different goals: if you mostly want to pressure AI companies to behave better, then comparability is good; if you mostly want to inform stakeholders about the overall level of risk, then comparability is probably bad if you were also trying to do evidence laundering, because it makes it more costly for developers to share info with you.

Examples, classified against the axes above

Project	Fact gen vs evidence analysis	Company laundering	Auditor laundering	Current vs preparedness	Cross-developer comparability	Reviewer vs producer
Dangerous-capability evals (METR, UK AISI)	Fact gen	No	Yes	Now	Easy	Producer
Classifier-robustness red-teaming for misuse prevention	Fact gen	No	Yes	Now	Easy	Producer
David Rein's red-team of Anthropic monitors	Fact gen	Yes	No	Now	Hard	Producer
Security pen-testing	Fact gen	Yes	No	Now	Hard	Producer
CAISI DeepSeek evaluation	Fact gen	No	No	Now	Easy	Producer
Apollo in-context scheming evals	Fact gen	No	No	Now	Hard	Producer
METR review of Anthropic's sabotage risk report	Evidence analysis	Yes	No	Now	Hard	Reviewer
Redwood review of OpenAI CoT training, External review of DeepMind scheming-inability safety case	Evidence analysis	No	No	Now	Hard	Reviewer
METR Frontier Risk Report	Both	Slight (non-attribution)	No	Now	Very hard (deliberately)	Producer
AI Lab Watch, FLI AI Safety Index	Evidence analysis	No	No	Both	Easy	Producer
SaferAI risk-management maturity ratings	Evidence analysis	No	No	Prep	Easy	Producer
GovAI third-party compliance reviews (proposal)	Evidence analysis	Yes	No	Prep	Med	Reviewer
Brundage et al. AAL framework (proposal)	Both	Yes	No	Both	Easy (via AAL scale)	Both

The columns are far from independent: the whole table can be recovered by a short chain of single-question splits. The tree below is the chain that minimizes expected questions-to-classify. Note that auditor laundering occurs only at the fact-generation end, and evidence-analysis assessors only ever launder company secrets or nothing.

Decision tree of third-party assessment types

[-]Johan David Bonilla1mo5-2

I don't think people appreciate how genuinely hard the position you and the other assessors are in.

Imagine having a sibling who tells you everything (you're the only one who actually knows them), but the day they do something seriously wrong you can't actually tell your parents because it ruins the whole relationship.

You can choose not to tell so you keep the visibility, but then you're kind of complicit. So either way you lose something, access or integrity.

One real world example I'm reminded of is Enron. The outside firm hired to audit their books (Arthur Andersen) stayed quiet because Enron was too big a client to lose. The fraud ran for years and eventually took both of them out. That's the same thing we're seeing here: the side being checked is valuable enough that the checker can't afford to upset them.

Two things feel impossible to get around:

1. You might not be able to reply with full honesty, and that's the whole point.

You're the public face of an org that runs on these relationships. Even if you totally agreed, saying so out loud costs you.
That's not an AI thing either, it's just what being an important public figure is like. People in your position rarely get to say the full truth. Meanwhile I'm an outsider, free to say anything, but no access and a weightless voice.

2. Regulation is the obvious historical fix, but governments are not clean either.

They're their own institutions with their own incentives and their own version of this exact problem. It seems like it's incentive-misalignment at every level, not just the developer/assessor one.
One of the more concerning things is that regulation usually shows up after disaster forces it. Most industries survive the disaster and learn from it, but you could imagine a reality where the AI disaster that forces the rule has a cost impossible to recover from.

Thanks @Buck for your notes on third-party risk assessment. You made the different parts that get lumped together legible. In particular fact-generation vs evidence analysis, and the two types of information laundering (whether it's the company's secrets being hidden or the assessor's) are now stuck with me.

As someone trying to find my break into AI safety, I really appreciate this rare and honest picture of what I'm walking into before I do (hopefully).

[-]Buck1mo30

Yeah. I agree that that tension is pretty concerning. I have read about Enron (as well as most of the other famous examples of corporate disasters) and am definitely somewhat inspired by that.

I'll note that Redwood is actually mostly not in the bind you describe; we aren't very dependent on official relationships with AI developers; us getting blackballed from official relationships with them would be a bummer but by no means an existential threat.

39