The evaluation awareness of frontier models has recently increased dramatically, which weakens the amount of safety assurance that evals can provide. I think a good response to this would be to add an additional assurance layer based on gradual deployment/beta testing.
It's fundamentally difficult to build evals that increasingly intelligent models can't distinguish from deployment settings, but we can get much more informative results testing models in real deployment settings. This already exists via internal deployment and in some cases early access external deployment, but it doesn't look like it's being systematically incorporated as a part of safety assurance right now. Beta testing is a pretty standard practice in software development so it seems pretty tractable.