One approach would be to red-team the red-teamers. That is, blind the eval team as to what exactly they are evaluating and send them a stream of real models and also models where they should find a problem. This would necessitate deliberately training models with serious problems, which has some downsides. (The model could accidentally or maliciously be deployed, for example. And in a sufficiently-close-to-AGI scenario, any model that is plausibly a real candidate to be evaluated, but also contains a significant flaw that is close enough to what we want the evals team to be looking for, would almost certainly be extremely dangerous.) And given how connected the AI community is, it might be difficult to prevent information leakage of the form "GPT-k just finished training, so at least one of the models you will evaluate this week is probably GPT-k and you probably won't be immediately punished for saying it is fine." Or, more insidiously, it could be known that GPT-k uses some particular activation function and thus white-box evaluators will easily know which model is GPT-k. And of course it would double the expense of training a large language model if one always had to train an equivalent lesioned model beside it. (Perhaps one could finetune lesions in, though, for a much smaller cost. Or perhaps one could try to induce governments to require via regulation that some multiple of the training cost of a model be spent on evals and supporting infrastructure such as training lesioned models.)
An approach for more prosaic concerns would be to change the business model of evaluation groups from "fee-for-service" to some sort of insurance model. So OpenBrainMind gives them access to a model, pays them $N, and will receive $M >> $N if the evals team misses something. (One would of course need to ensure that $M << $L, the financial fallout to OpenBrainMind for the prosaic alignment failure then happening in public, or you've created a different kind of bad incentive.) This of course helps not at all with existential risks, as no one will be around to collect any sort of payment, and any realistically achievable payment is dwarfed by both the potential upsides and the potential downsides.
So those are two approaches with some obvious flaws.
In general, there are several other problems that seem inherent in the "multiple evaluators" framework. For instance, sharing techniques between evaluators could undermine market dominance; not sharing techniques between evaluators seems like it would make the evaluation process much weaker than it could otherwise be. (One could say the same of AI capabilities firms, and they often share tons of research with the public, so I'm not very confident that this is a real failure mode we have to worry about.) (And, of course, if you share evaluation framework knowledge too widely, some AI firms may train against them adversarially, thereby destroying the ability of evaluators to find anything while plausibly not actually helping very much to prevent alignment failures.)
Does this drive a "race to the bottom," where more lenient evals teams get larger market share
I appreciate you asking this, and I find this failure mode plausible. It reminds me of one of the failure modes I listed here (where a group proposing strict evals gets outcompeted by a group proposing looser evals).
Governance failure: We are outcompeted by a group that develops (much less demanding) evals/standards (~10%). Several different groups develop safety standards for AI labs. One group has expertise in AI privacy and data monitoring, another has expertise in ML fairness and bias, and a third is a consulting company that has advised safety standards in a variety of high-stakes contexts (e.g., biosecurity, nuclear energy).
Each group proposes their own set of standards. Some decision-makers at top labs are enthusiastic about The Unified Evals Agreement, but others are skeptical. In addition to object-level debates, there are debates about which experts should be trusted. Ultimately, lab decision-makers end up placing more weight on teams with experience implementing safety standards in other sectors, and they go with the standards proposed by the consulting group. Although these standards are not mutually exclusive with The Unified Evals Agreement, lab decision-makers are less motivated to adopt new standards (“we just implemented evals-- we have other priorities right now.”). The Unified Evals Agreement is crowded out by standards that have much less emphasis on long-term catastrophic risks from AI systems.
Nonetheless, the "vibe" I get is that people seem quite confident that this won't happen. Perhaps because labs care a lot about x-risk and want to have high-quality evals, perhaps because lots of the people working on evals have good relationships with labs, and perhaps because they expect there aren't many groups working on evals/standards (except for the xrisk-motivated folks).
I think it's not a stable equilibrum to have one Evals Orgs. A) even if it was a good idea, I don't see how we can get from-here-to-there on coordinating on a monopolistic Evals Org. B) I think having Healthy Competition is good and you should avoiding structuring your ecosystem around a monopoly if you can help it. There are just a lot of benefits of having multiple Evals Teams competing to develop better methods.
So I see the goal here as "design an incentive system such that the competition is good, instead of bad."
It's possible that competitions similar to the Audit Games concept would fit in here.