Do model evaluations fall prey to the Good(er) Regulator Theorem?

I don't understand how the issue of validity in evals connects to the Gooder Regulator Theorem. It's certainly arbitrarily hard to measure a latent property of an AI agent that's buried an an arbitrarily complex causal diagram of the agent, but this is also true of any kind of science. The solution is to only measure simple things and validate them extensively.

To start, the playbook includes methods from the social sciences. Are humans lying in a psychology experiment intending to deceive, or just socially conditioned into deception in these circumstances? We can discriminate using more experiments. Here's a list suggested by Claude.

Internal Validity:
Manipulation checks: Verify participants understand the task and setting as intended
Control conditions: Include baseline conditions where lying provides no benefit
Multiple measures: Use both behavioral indicators (actual lies told) and physiological measures (stress responses, reaction times)
Debrief interviews: Post-experiment discussions to understand participant reasoning
Construct Validity:
Operationalize "lying" clearly: Distinguish between errors, omissions, and active false statements
Measure awareness: Test whether participants recognize they're providing false information
Cross-validate: Use multiple paradigms that elicit deception differently
External Validity:
Vary contexts: Test across different social settings and stakes
Diverse samples: Include participants from various cultural backgrounds where social norms differ
Ecological validity: Use scenarios that mirror real-world situations
Intentional deception should show:
Individual differences in moral reasoning correlation
Sensitivity to personal cost-benefit analysis
Greater cognitive effort signatures
Ability to inhibit when explicitly instructed
Conscious awareness and ability to justify
Socially conditioned behavior should show:
Stronger situation-specific activation
Less individual variation within similar social groups
Faster, more automatic responses
Resistance to conscious control
Difficulty articulating clear reasons

Likewise, in evals we can do increasingly elaborate robustness checks where we change the prompt and setting and measure if the correlation is high, check that the model understand its behavior is deceptive, provide incentives to the model that should make its behavior change if it were intentionally deceptive, and so on. Much of the time, especially for propensity / alignment evals, we will find the evals are nonrobust and we'll be forced to weaken our claims.

Sometimes, we have an easier job. I don't expect capability evals to get vastly harder because many skills are verifiable and sandbagging is solvable. Also, AI evals are easier than human psychology experiments, because they're likely to be much cheaper and faster to run and more replicable.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

6

Do model evaluations fall prey to the Good(er) Regulator Theorem?

6

6