Member of technical staff at METR.
Previously: Vivek Hebbar's team at MIRI → Adrià Garriga-Alonso on various empirical alignment projects → METR.
I have signed no contracts or agreements whose existence I cannot mention.
agree, I originally wanted to make a list of all things that would be wildly unpopular but that toxoplasma of rage could theoretically make happen. But then it seemed more interesting to list things that could have more government or geopolitical relevance.
US Government dysfunction and runaway political polarization bingo card. I don't expect any particular one of these to happen but it seems plausible that at least one of these will happen.
I do, though maybe not this extreme. Roughly every other day I bemoan the fact that AIs aren't misaligned yet (limiting the excitingness of my current research) and might not even be misaligned in future, before reminding myself our world is much better to live in than the alternative. I think there's not much else to do with a similar impact given how large even a 1% p(doom) reduction is. But I also believe that particularly good research now can trade 1:1 with crunch time.
Theoretical work is just another step removed from the problem and should be viewed with at least as much suspicion.
The Interwebs seem to indicate that that's only if you give it a laser spot to aim at, not with just GPS.
Good catch.
Agree grenade sized munitions won't damage buildings, I think the conversation is drifting between FPVs and other kinds of drones, and also between various settings, so I'll just state my beliefs.
Nor would I. In WWII bombers didn't even know where they were, but we have GPS now such that Excalibur guided artillery shells can get 1m CEP. And the US and possibly China can use Starlink and other constellations for localization even when GPS is jammed. I would guess 20m is easily doable with good sensing and dumb bombs, which would at least hit a building.
IMO it is too soon to tell whether drone defense will hold up to countercountermeasures.
I agree that Israel will probably be less affected than larger, poorer countries, but given that drones have probably killed over 200,000 people in Ukraine even a small percentage of this would be a problem for Israel.
My views which I have already shared in person:
I don't understand how the issue of validity in evals connects to the Gooder Regulator Theorem. It's certainly arbitrarily hard to measure a latent property of an AI agent that's buried an an arbitrarily complex causal diagram of the agent, but this is also true of any kind of science. The solution is to only measure simple things and validate them extensively.
To start, the playbook includes methods from the social sciences. Are humans lying in a psychology experiment intending to deceive, or just socially conditioned into deception in these circumstances? We can discriminate using more experiments. Here's a list suggested by Claude.
Internal Validity:
- Manipulation checks: Verify participants understand the task and setting as intended
- Control conditions: Include baseline conditions where lying provides no benefit
- Multiple measures: Use both behavioral indicators (actual lies told) and physiological measures (stress responses, reaction times)
- Debrief interviews: Post-experiment discussions to understand participant reasoning
Construct Validity:
- Operationalize "lying" clearly: Distinguish between errors, omissions, and active false statements
- Measure awareness: Test whether participants recognize they're providing false information
- Cross-validate: Use multiple paradigms that elicit deception differently
External Validity:
- Vary contexts: Test across different social settings and stakes
- Diverse samples: Include participants from various cultural backgrounds where social norms differ
- Ecological validity: Use scenarios that mirror real-world situations
Intentional deception should show:
- Individual differences in moral reasoning correlation
- Sensitivity to personal cost-benefit analysis
- Greater cognitive effort signatures
- Ability to inhibit when explicitly instructed
- Conscious awareness and ability to justify
Socially conditioned behavior should show:
- Stronger situation-specific activation
- Less individual variation within similar social groups
- Faster, more automatic responses
- Resistance to conscious control
- Difficulty articulating clear reasons
Likewise, in evals we can do increasingly elaborate robustness checks where we change the prompt and setting and measure if the correlation is high, check that the model understand its behavior is deceptive, provide incentives to the model that should make its behavior change if it were intentionally deceptive, and so on. Much of the time, especially for propensity / alignment evals, we will find the evals are nonrobust and we'll be forced to weaken our claims.
Sometimes, we have an easier job. I don't expect capability evals to get vastly harder because many skills are verifiable and sandbagging is solvable. Also, AI evals are easier than human psychology experiments, because they're likely to be much cheaper and faster to run and more replicable.
According to Wikipedia it seems to have worked well and not been expensive.
Poland began work on the 5.5-meter (18 foot) high steel wall topped with barbed wire at a cost of around 1.6 billion zł (US$407m) [...] in the late summer of 2021. The barrier was completed on 30 June 2022.[3] An electronic barrier [...] was added to the fence between November 2022 and early summer 2023 at a cost of EUR 71.8 million.[4]
[...] official border crossings with Belarus remained open, and the asylum process continued to function [...]
Since the fence was built, illegal crossings have reduced to a trickle; however, between August 2021 and February 2023, 37 bodies were found on both sides of the border; people have died mainly from hypothermia or drowning.[11]
The Greenberg article also suggests a reasonable tradeoff is being made in policy
Despite these fears, Duszczyk is convinced his approach is working. In a two-month period after the asylum suspension, illegal crossings from Belarus fell by 48% compared to the same period in 2024. At the same time, in all of 2024, there was one death—out of 30,000 attempted crossings—in Polish territory. There have been none so far in 2025. Duszczyk feels his humanitarian floor is holding.
The idea is they're printing money, not just borrowing it, which in the extreme would cause hyperinflation (and is equivalent to default since debt is in nominal dollars). It probably seems less bad though.