I don't understand how the issue of validity in evals connects to the Gooder Regulator Theorem. It's certainly arbitrarily hard to measure a latent property of an AI agent that's buried an an arbitrarily complex causal diagram of the agent, but this is also true of any kind of science. The solution is to only measure simple things and validate them extensively.
To start, the playbook includes methods from the social sciences. Are humans lying in a psychology experiment intending to deceive, or just socially conditioned into deception in these circumstances? We can discriminate using more experiments. Here's a list suggested by Claude.
Internal Validity:
- Manipulation checks: Verify participants understand the task and setting as intended
- Control conditions: Include baseline conditions where lying provides no benefit
- Multiple measures: Use both behavioral indicators (actual lies told) and physiological measures (stress responses, reaction times)
- Debrief interviews: Post-experiment discussions to understand participant reasoning
Construct Validity:
- Operationalize "lying" clearly: Distinguish between errors, omissions, and active false statements
- Measure awareness: Test whether participants recognize they're providing false information
- Cross-validate: Use multiple paradigms that elicit deception differently
External Validity:
- Vary contexts: Test across different social settings and stakes
- Diverse samples: Include participants from various cultural backgrounds where social norms differ
- Ecological validity: Use scenarios that mirror real-world situations
Intentional deception should show:
- Individual differences in moral reasoning correlation
- Sensitivity to personal cost-benefit analysis
- Greater cognitive effort signatures
- Ability to inhibit when explicitly instructed
- Conscious awareness and ability to justify
Socially conditioned behavior should show:
- Stronger situation-specific activation
- Less individual variation within similar social groups
- Faster, more automatic responses
- Resistance to conscious control
- Difficulty articulating clear reasons
Likewise, in evals we can do increasingly elaborate robustness checks where we change the prompt and setting and measure if the correlation is high, check that the model understand its behavior is deceptive, provide incentives to the model that should make its behavior change if it were intentionally deceptive, and so on. Much of the time, especially for propensity / alignment evals, we will find the evals are nonrobust and we'll be forced to weaken our claims.
Sometimes, we have an easier job. I don't expect capability evals to get vastly harder because many skills are verifiable and sandbagging is solvable. Also, AI evals are easier than human psychology experiments, because they're likely to be much cheaper and faster to run and more replicable.
Note: reposting this as a top level post because it got no interaction as a quick take and I think this is actually a serious question. It arose from a discussion in the Technical AI Governance Forum.
As AI systems get more and more complicated, the properties we are trying to measure move away from formally verifiable tasks like "can it do two digit arithmetic" and move towards more complex things like "can it output a reasonable root-cause analysis of this bug" or "can it implement this feature". Evaluations trying to capture capabilties progress then must also move away from simple multiple choice questions towards more complex models of tasks. This means that complex evaluations like SWEbench involve at least partial models of things like computer systems or development environments. Models are given tools similar to real world developer tools and often set to work on subsets of real world codebases.
At this point we can start invoking the good regulator theorem, and say that the evaluator is in fact a regulator. It wants to produce the outcome "pass" when the joint system formed from the LLM and the world-model has some desired property ("feature has been implemented", "bug has been fixed"). It wants to produce the outcome "fail" otherwise. It seems necessary that the regulator will need to get more and more complicated to check for features in more and more complex systems. At the limit you have things like Google's recent focus on creating world models for AI training which are full physics-style simulations. For those types of physical tasks the evaluation actually tends towards implementing a perfectly deterministic model in the style of the original good regulator theorem, where the regulator aims to capture every interaction possible in the underlying system.
Going one level up, what we are interested in may be less properties of the task or world, but more properties of the AI itself (will this AI harm the end user? is the AI honest?). At that point evals have to encode assumptions about how agents store beliefs and turn beliefs into actions. We already see a very simple version of this in benchmarks like TruthfulQA, where the assumption is that models that internalise false beliefs will report those beliefs in response to certain leading questions.
Of course, these evaluations are extremely limited since they only measure very short term interactions and immediate responses. We're already seeing proposals for and early trials of more interactive, role-play or simulation style evals where models are tested in (still deeply unrealistic) but far more complex environments. At the limit this resembles forming a (Gooder regulator style) partial model of the agent itself from observations of agent actions, such that an agent taking certain actions in an evaluation reflects the presence of some internal undesirable property like "dishonesty" hidden in the weights.
This is a seriously difficult question to tackle, mostly because it has many of the same core issues as mechanistic interpretability. Does a model lying in an eval mean that they intend to deceive, or that they have simply understood that this is an eval where they "ought to" lie? How often should it lie before it is classified as "dishonest"? Model evaluations may tend towards the same issues academic exams and testing has often been criticised for, where standardised tests poorly gauge understanding and more in-depth oral exams or work tests are written off as hard-to-grade, subjective, or otherwise artificial. To be clear, these are not unresolvable issues. For example, if mech interp improves and offers clear indicators for dishonesty, this can be incorporated into eval harnesses. Formal verification of solution correctness can probably help with many programming tasks. However, I think this should be considered when deciding whether to invest time and resources into funding or conducting future evals research.