What can be learned from scary demos? A snitching case study

Weird = coincidences that “point at badness” too hard?
I think that the main reason why models snitch in the snitching scary demo despite not doing it in more natural situations despite the core incentives remaining exactly the same is that the situations “points at” the bad behavior.

That doesn't look like a definition of weirdness to me, but a likely cause of the snitching scenario, the SnitchBench tries hard to make a model to snitch as you've proven with your tests. I think weirdness is a shaky concept, i.e. it's hard to quantify it because the weirdness depends on who assess the weirdness: is it weird for the creator of SnitchBench? For Anthropic? For a pharmaceutical company that uses AI assistant? I agree, as you pointed out, that it's important to prove that such scenario might actually happen and it will be bad outcomes. I guess weirdness would be something like $P(undesired situation and realistic deployment)$, i.e. probability that some undesired situation that someone optimizes against happens more likely during actual deployment (your second plots).

^{^}

$\frac{P (situation)}{P (situation | base model would do sth bad)}$ looks close to another quantity that is more directly relevant:

$\frac{P (situation)}{P (situation | target model would do sth bad)} = \frac{P (target model would do sth bad)}{P (target model would do sth bad | situation)}$

Therefore, if you knew this ratio was relatively high, then seeing the target LLM do something bad in this situation would update you towards thinking the target LLM doing something bad in general would be likely. But estimating $P (situation | target model would do sth bad)$ requires knowing when the target LLM does bad things, which feels a bit circular. I don’t think there is a way to salvage this.

LESSWRONG
LW

LESSWRONG
LW

22

What can be learned from scary demos? A snitching case study

22

Ω 13

22

Ω 13

Weird = coincidences that “point at badness” too hard?

Two scary-demos arguments

Directly related behaviors

Evidence about model alignment

Other things you might learn from a scary demo

How the snitching scary demo is weird

Making the setup less weird

Hinting is not sufficient to get bad behaviors

What does it mean for a situation to be weird?

Weird = coincidences that “point at badness” too hard?

Weird = unlikely to happen in deployment?

Weird = something related to model psychology?

Appendix: more results