One approach would be to red-team the red-teamers. That is, blind the eval team as to what exactly they are evaluating and send them a stream of real models and also models where they should find a problem. This would necessitate deliberately training models with serious problems, which has some downsides. (The model could accidentally or maliciously be deployed, for example. And in a sufficiently-close-to-AGI scenario, any model that is plausibly a real candidate to be evaluated, but also contains a significant flaw that is close enough to what we want the evals team to be looking for, would almost certainly be extremely dangerous.) And given how connected the AI community is, it might be difficult to prevent information leakage of the form "GPT-k just finished training, so at least one of the models you will evaluate this week is probably GPT-k and you probably won't be immediately punished for saying it is fine." Or, more insidiously, it could be known that GPT-k uses some particular activation function and thus white-box evaluators will easily know which model is GPT-k. And of course it would double the expense of training a large language model if one always had to train an equivalent lesioned model beside it. (Perhaps one could finetune lesions in, though, for a much smaller cost. Or perhaps one could try to induce governments to require via regulation that some multiple of the training cost of a model be spent on evals and supporting infrastructure such as training lesioned models.)
An approach for more prosaic concerns would be to change the business model of evaluation groups from "fee-for-service" to some sort of insurance model. So OpenBrainMind gives them access to a model, pays them $N, and will receive $M >> $N if the evals team misses something. (One would of course need to ensure that $M << $L, the financial fallout to OpenBrainMind for the prosaic alignment failure then happening in public, or you've created a different kind of bad incentive.) This of course helps not at all with existential risks, as no one will be around to collect any sort of payment, and any realistically achievable payment is dwarfed by both the potential upsides and the potential downsides.
So those are two approaches with some obvious flaws.
In general, there are several other problems that seem inherent in the "multiple evaluators" framework. For instance, sharing techniques between evaluators could undermine market dominance; not sharing techniques between evaluators seems like it would make the evaluation process much weaker than it could otherwise be. (One could say the same of AI capabilities firms, and they often share tons of research with the public, so I'm not very confident that this is a real failure mode we have to worry about.) (And, of course, if you share evaluation framework knowledge too widely, some AI firms may train against them adversarially, thereby destroying the ability of evaluators to find anything while plausibly not actually helping very much to prevent alignment failures.)
Might be worth thinking about / comparing how and why things went wrong to produce the 2007/8 GFC. iirc credit raters had misaligned incentives that rhyme with this question/post.