There are many different activities that could be described as "third-party risk assessment". Here are some distinctions that I’ve found helpful thinking about the space over the last few weeks.
(Thanks Ajeya Cotra and Paul Christiano for discussions that inspired most of this.)
Throughout this, I refer to the actors as:
Developers.
Stakeholders. These are the people who want to be informed about risks. Possible stakeholders include: governments, the public, the developer's board, the developer's employees.
The choice matters because one of the roles of an auditor is to review confidential info that they then do not directly disclose to stakeholders, they only tell them their conclusions. This is a more important role if the developer is more concerned about disclosing confidential information to the stakeholder.
Third parties. I don't know a better term for "independent actors who contribute in various ways to a stakeholder's understanding of risks through producing and evaluating evidence and/or arguments". Like, it's weird to call the physical security pentesting firm a "risk assessor". And AI Lab Watch isn't really an "auditor". And "evaluator" makes it sound like they run model evals.
The next step in the analysis will be to think about different objectives of third party risk assessment and think about how those interact with these axes of variation.
Axes
Fact generation vs evidence analysis
This is maybe the one that comes up most. I'll define:
A fact-generation assessment is trying to answer some narrow question to produce a fact that will later be used to evaluate risk.
Examples:
Dangerous capability evaluations, e.g. METR and UK AISI's capability evals.
For many of those, a core reason why it makes sense to have a third party do them is that the process is sort of fundamentally adversarial—the final evidence will be of the form "I tried to demonstrate danger, but failed", so you're really depending on the person trying hard, and it's structurally easier to be confident that someone's not sandbagging if they're an unconflicted third party. Some cases are like monitor redteaming or pentesting where it's obviously adversarial; others, like dangerous capabilities evals, are cases where you're relying on the evaluator to do not-particularly-adversarial optimization (e.g. you need them to pick good hyperparameters for their fine-tune to elicit CBRN capabilities).
If our claim is "you need a third party to do the optimization every time that you argue for safety by saying an optimization attempt failed", this maybe has broader implications, I'm not sure people are applying this consistently.
For the others that reason isn't present. For example, for a cyber capability eval where the AI company basically does no extra elicitation, it seems like it would probably be fine if the AI company just ran the evals themselves—the produced fact isn't rendered particularly suspect by the fact that it was done in house.
There are some other reasons that it might make sense to have a third party do the eval:
They have expertise the AI company doesn't have in house. This totally makes sense in principle; it could totally make sense to centralize the expertise somewhere outside the company so that the AI companies can share it.
The eval requires sensitive info that the evaluator doesn't want to give the AI co. E.g. sensitive CBRN stuff? This makes some sense.
In general, eval scores can generally be increased or decreased by a developer if they have access to the dataset, because they can train/iterate against it. But in many cases the developer can also game eval scores even without access to the dataset; e.g. they can probably mess with alignment evidence by training models to act aligned in weird situations. If an eval score won't be valid evidence without separately understanding whether the AI developer did anything that would compromise it, it's not clearly much worse for the AI company to have the eval themselves and run it themselves.
An evidence-analysis assessment tries to combine many forms of evidence into an overall judgement about a high-level question. The central example of such a question is "how much risk to the world does this deployment pose".
Examples:
METR's review of Anthropic's sabotage risk report. (The sabotage risk report analyzes a narrower slice of risk, but it is an analysis that draws together many different facts to answer a natural question that's pretty close in the causal graph to "how much total risk is there". Unlike e.g. "how robust was this monitor", which requires a bunch of additional context to be interesting.)
This is what auditing is in the context of corporate accounting—they look at your books and ask you questions to assess the overall financial state of the company.
Within evidence analysis, there's a distinction between reviewing of a developer's argument vs. producing their own.
I think that evidence analysis vs fact generation are pretty different and it's not obvious that any projects or organizations should do both. The skillsets seem plausibly very different. Like, I suspect that during the main risk period, only a pretty small subset of the facts relevant to the risk analysis should be generated by third parties (as opposed to being reported by the company). So someone doing evidence analysis will naturally be operating several steps in the "evidence tree" away from someone generating a fact. And so I'm not sure there's that much synergy for these happening as part of the same project. E.g. I don't think it would be particularly useful for the security pentesters to work at the same org that does the overall assessment.
I think it's pretty weird that these are conflated so often (e.g. in lots of the widely-cited papers about AI auditing or evaluations like Frontier AI Auditing and Model Evaluation for Extreme Risks), they seem really different to me.
Laundering private evidence into sharable conclusions?
One function of risk analyses is to inform a stakeholder about the risks posed by a developer. One strategy is to have an auditor who access private information from the developer and uses it to answer a question, and then they give the answer to that question to the stakeholder without telling them the private information that caused them to reach that conclusion. The other option is to do the risk analysis just based on information that the stakeholder already knows (or is cleared to know).
In this post, I call this “evidence-laundering”. I don’t mean this with a negative connotation and will probably choose a different term in future because the negative connotation is so strong. I just wanted a distinct term that emphasizes that information is being processed from private evidence to public conclusions to reduce disclosure of commercially sensitive information.
I think I want to define this as: did the conclusion of a report rely on facts that were not stated in the report? So it's not evidence-laundering if you report a straightforward fact that your readers need to believe that you're not straightforwardly lying about because they can't personally check it. But it is evidence-laundering if you don't report the facts and instead report "based on private facts, things look ok".
Examples of the evidence-laundering strategy:
Pen-testing. It's totally normal to have a pen-tester say "we looked for vulns; i won't tell you want they are but here's how bad they were". And for the pen-tester to not tell the stakeholder various sensitive details.
Lots of private evaluations in other parts of society. In courts, a psychiatric evaluation involves the psychiatrist talking privately to someone and then stating the conclusion publicly without fully justifying it.
Examples of the just-public-evidence strategy
Our review of the OpenAI CoT training thing. The workflow here was that they shared the draft with us a week or so before they published it, and we went through a few rounds of me complaining that the draft didn't have some important info that I needed to be convinced of their conclusion, and they iteratively added more info to the draft until I was pretty happy. So even though there was some temporary secrecy, we were fundamentally not in a laundering role: the thing that we released was based entirely on information that any reader of our blog post could read themselves. We were under our standard NDA, so they could have told us stuff during this process that we wouldn't have been allowed to repeat.
AI Lab Watch. I think Zach sometimes emailed AI companies asking them questions that he said would affect their rating, and maybe they sometimes answered? As long as this works via him entering their answers into the public record, he's not playing the evidence-laundering role.
The METR Frontier Risk Report is almost entirely this. Their report is entirely based on info that AI companies agreed to share publicly, except that some of these facts are published without attribution (that is, METR doesn't say which company said something), so METR is playing the evidence-laundering role of verifying that an AI company really did say something without revealing who said it.
This is an interesting case because METR collected a bunch of non-public info during this project, but the only way this entered their report was by companies agreeing to make subsets of the info public.
Evidence-laundering is a central part of what "Frontier AI Auditing" (2026) means by "auditing", which writes: "Transparency alone cannot enable well-calibrated trust in the most capable (“frontier”) AI systems and the companies that build them: many safety- and security-relevant details are legitimately confidential".
The core strategic question here is definitely "when should you use evidence-laundering vs transparency".
I think that transparency is probably better all else equal? Evidence laundering should probably be reserved for cases where it's extremely defensible for the developer to not want to publicly release certain info. I am extremely sympathetic to this for pentesting or other things where there's an imminent cost to releasing the full evidence. I am less sympathetic to e.g. developers wanting to keep the capabilities of unreleased models secret.
But I think it might be pretty good if we can rely on transparency as much as possible.
In particular, I think that transparency maybe makes it easier for risk evaluations to differentially pressure AI companies to do better. I am excited for project like AI Lab Watch that do risk evaluations that compare the AI companies. This might piss off the AI companies, so it might be hard for them to have an evidence-laundering relationship. Their ratings will be more legitimate if they have the info they need to make their assessments. AI Lab Watch has the stance "if you don't give public statements explaining what you're doing on issue X, we'll assume you maximally suck"; perhaps stakeholders would think that it's reasonable to expect developers to produce public evidence of safety and therefore the developers will publish their evidence publicly. This allows us to have a robust mechanism for between-company pressure on safety for longer than if we tried to do all this without transparency, using evidence laundering.
A few notes:
Reporting a score against a benchmark the evaluator built but hasn't released — "DeepSeek scored 40% on our private benchmark" — isn't laundering of the developer's information (you know the claim and whom to blame), but it does ask the reader to trust, sight unseen, that the evaluator built a reasonable benchmark. The developer itself may not even have access to check it. Releasing an i.i.d. subset of the benchmark resolves most of this.
A report has to disclose how it looked. Every conclusion implicitly relies on the unstated fact that nothing else relevant turned up. So a clean report should say how it searched for other kinds of facts — including disconfirming ones — and report that there weren't any (or what it found). A report that presents its supporting facts and stops is laundering the adequacy of its own search.
Whose secret? Company vs auditor laundering. It helps to split laundering by who owns the withheld fact. Company laundering hides facts about the developer's own system — details of security systems that a pentester is attacking, or training methods that an auditor wants to know about. Auditor laundering leans on something the assessor holds and won't share: a particularly strong example of this is hazardous knowledge like bioweapon-uplift details. Note that auditors very often have some information that they don't want to publish. Note that needing to trust the auditor did a competent job — e.g. elicited hard enough — is not laundering at all; that's ordinary trust in a person, not a hidden fact. The table below splits these into two columns.
Incentive compatibility vs calibration
Should you give bad scores to developers if they don't give you sufficient access, or should you just use your best guess? If you do the latter, then anyone performing surprisingly poorly is better off not disclosing this. But if you do the former, then your risk assessments can’t be taken at face value by third parties, and it’s easier for AI developers to discredit you by saying “that org has no idea what’s going on, obviously we’re way safer than they think”. This probably works especially well when they’re trying to discredit you to their employees or other groups who have more access to private info. As an example of that dynamic, AI Lab Watch initially rated Google poorly on security due to them not disclosing much about their security, and this led some GDM people to say they thought it was less credible or usable.
As I said in the previous section, it's maybe rough to get stakeholders on board with risk assessments that assume the worst, but maybe it's doable and I think we should plausibly try to achieve this.
Current risk vs preparedness
Are you answering a question like "are the current risks adequately handled" or are you answering "is this developer on track to handle risks later"?
The downside of analyzing current risks is that they're inconsequential and will probably be inconsequential up until shortly before they're severe.
The downside of analyzing preparedness is that you have to be much more opinionated about futurism and threat models; your reports will rely on assumptions that are much more contentious.
Cross-developer comparability
Are you trying to make it easy or hard to compare between companies?
AI Lab Watch and friends are trying to make it easy. The advantage of this is that it might pressures companies to behave better or shift resources towards better companies (e.g. because people prefer working at companies they believe to be more responsible).
METR's FRR went out of its way to make it hard. The advantage of making it hard is that companies are more willing to tell you stuff. (One way of thinking about this is that if you remove comparability, scary stuff the AI companies say has costs that are split among them and their competitors, which is way better for them on net than the costs being concentrated on them. In practice, AI companies also get some cred from safety people for producing evidence of scary stuff, so this effect hasn't mattered that much in practice.)
I think that cross-developer comparability makes much more sense if you're doing evidence analysis, because evidence-analysis assessments are more comparable between companies.
So kind of obviously, I think that the choice between these is basically a tradeoff between different goals: if you mostly want to pressure AI companies to behave better, then comparability is good; if you mostly want to inform stakeholders about the overall level of risk, then comparability is probably bad if you were also trying to do evidence laundering, because it makes it more costly for developers to share info with you.
The columns are far from independent: the whole table can be recovered by a short chain of single-question splits. The tree below is the chain that minimizes expected questions-to-classify. Note that auditor laundering occurs only at the fact-generation end, and evidence-analysis assessors only ever launder company secrets or nothing.
There are many different activities that could be described as "third-party risk assessment". Here are some distinctions that I’ve found helpful thinking about the space over the last few weeks.
(Thanks Ajeya Cotra and Paul Christiano for discussions that inspired most of this.)
Throughout this, I refer to the actors as:
The next step in the analysis will be to think about different objectives of third party risk assessment and think about how those interact with these axes of variation.
Axes
Fact generation vs evidence analysis
This is maybe the one that comes up most. I'll define:
I think that evidence analysis vs fact generation are pretty different and it's not obvious that any projects or organizations should do both. The skillsets seem plausibly very different. Like, I suspect that during the main risk period, only a pretty small subset of the facts relevant to the risk analysis should be generated by third parties (as opposed to being reported by the company). So someone doing evidence analysis will naturally be operating several steps in the "evidence tree" away from someone generating a fact. And so I'm not sure there's that much synergy for these happening as part of the same project. E.g. I don't think it would be particularly useful for the security pentesters to work at the same org that does the overall assessment.
I think it's pretty weird that these are conflated so often (e.g. in lots of the widely-cited papers about AI auditing or evaluations like Frontier AI Auditing and Model Evaluation for Extreme Risks), they seem really different to me.
Laundering private evidence into sharable conclusions?
One function of risk analyses is to inform a stakeholder about the risks posed by a developer. One strategy is to have an auditor who access private information from the developer and uses it to answer a question, and then they give the answer to that question to the stakeholder without telling them the private information that caused them to reach that conclusion. The other option is to do the risk analysis just based on information that the stakeholder already knows (or is cleared to know).
In this post, I call this “evidence-laundering”. I don’t mean this with a negative connotation and will probably choose a different term in future because the negative connotation is so strong. I just wanted a distinct term that emphasizes that information is being processed from private evidence to public conclusions to reduce disclosure of commercially sensitive information.
I think I want to define this as: did the conclusion of a report rely on facts that were not stated in the report? So it's not evidence-laundering if you report a straightforward fact that your readers need to believe that you're not straightforwardly lying about because they can't personally check it. But it is evidence-laundering if you don't report the facts and instead report "based on private facts, things look ok".
Examples of the evidence-laundering strategy:
Examples of the just-public-evidence strategy
The core strategic question here is definitely "when should you use evidence-laundering vs transparency".
A few notes:
Incentive compatibility vs calibration
Should you give bad scores to developers if they don't give you sufficient access, or should you just use your best guess? If you do the latter, then anyone performing surprisingly poorly is better off not disclosing this. But if you do the former, then your risk assessments can’t be taken at face value by third parties, and it’s easier for AI developers to discredit you by saying “that org has no idea what’s going on, obviously we’re way safer than they think”. This probably works especially well when they’re trying to discredit you to their employees or other groups who have more access to private info. As an example of that dynamic, AI Lab Watch initially rated Google poorly on security due to them not disclosing much about their security, and this led some GDM people to say they thought it was less credible or usable.
As I said in the previous section, it's maybe rough to get stakeholders on board with risk assessments that assume the worst, but maybe it's doable and I think we should plausibly try to achieve this.
Current risk vs preparedness
Are you answering a question like "are the current risks adequately handled" or are you answering "is this developer on track to handle risks later"?
The downside of analyzing current risks is that they're inconsequential and will probably be inconsequential up until shortly before they're severe.
The downside of analyzing preparedness is that you have to be much more opinionated about futurism and threat models; your reports will rely on assumptions that are much more contentious.
Cross-developer comparability
Are you trying to make it easy or hard to compare between companies?
I think that cross-developer comparability makes much more sense if you're doing evidence analysis, because evidence-analysis assessments are more comparable between companies.
So kind of obviously, I think that the choice between these is basically a tradeoff between different goals: if you mostly want to pressure AI companies to behave better, then comparability is good; if you mostly want to inform stakeholders about the overall level of risk, then comparability is probably bad if you were also trying to do evidence laundering, because it makes it more costly for developers to share info with you.
Examples, classified against the axes above
Project
Fact gen vs evidence analysis
Company laundering
Auditor laundering
Current vs preparedness
Cross-developer comparability
Reviewer vs producer
Dangerous-capability evals (METR, UK AISI)
Fact gen
No
Yes
Now
Easy
Producer
Classifier-robustness red-teaming for misuse prevention
Fact gen
No
Yes
Now
Easy
Producer
David Rein's red-team of Anthropic monitors
Fact gen
Yes
No
Now
Hard
Producer
Security pen-testing
Fact gen
Yes
No
Now
Hard
Producer
CAISI DeepSeek evaluation
Fact gen
No
No
Now
Easy
Producer
Apollo in-context scheming evals
Fact gen
No
No
Now
Hard
Producer
METR review of Anthropic's sabotage risk report
Evidence analysis
Yes
No
Now
Hard
Reviewer
Redwood review of OpenAI CoT training, External review of DeepMind scheming-inability safety case
Evidence analysis
No
No
Now
Hard
Reviewer
METR Frontier Risk Report
Both
Slight (non-attribution)
No
Now
Very hard (deliberately)
Producer
AI Lab Watch, FLI AI Safety Index
Evidence analysis
No
No
Both
Easy
Producer
SaferAI risk-management maturity ratings
Evidence analysis
No
No
Prep
Easy
Producer
GovAI third-party compliance reviews (proposal)
Evidence analysis
Yes
No
Prep
Med
Reviewer
Brundage et al. AAL framework (proposal)
Both
Yes
No
Both
Easy (via AAL scale)
Both
The columns are far from independent: the whole table can be recovered by a short chain of single-question splits. The tree below is the chain that minimizes expected questions-to-classify. Note that auditor laundering occurs only at the fact-generation end, and evidence-analysis assessors only ever launder company secrets or nothing.