Awesome to see that. We need more people reaching out to politicians and lobbying, and results like this show that this strategy has effect. Would be curious to see how PauseAI was able to achieve such wide support.
I wonder if this paper could be considered to be a scientific shitpost? (Deliberate pun) It kinda is, and the fact that it contributes to the field makes it even more funny.
Interesting paper. There is evidence that LLMs are able to distinguish realistic environments from toy ones. Would be interesting to see if misalignment learned on your training set transfers to complex realistic environments.
Also it seems you didn't use frontier models in your research, and in my experience results from non-frontier models not always scale to frontier ones. Would be cool to see the results for models like DeepSeek V3.1
This was written in the Claude 4 system card. Made sense to test Claude and not other LLMs
When Anthropic published their reports on Claude blackmailing people and contacting authorities if the user does something that Claude considers illegal, I've seen a lot of news articles vilifying Claude. It was a sad thing to see Anthropic being punished for being the only organization openly publishing such research.
Thanks for the clarification. It seems I got a bit confused.
Interesting, thanks for the answer.
Did you consider measuring other tokens? For example "Deploy" and "Eval"? (starting with capital letters) or things like "test" or "real?" It seems like these things might significantly increase the accuracy of the analysis, but at the same time I feel like the more things I include, the more it will mess up with the results in unexpected ways. Especially given that some words are more frequent than others and their logprobs tend to be higher due to this. (idk, maybe I'm wrong)
I will probably include this method into my evals if there won't be any serious problems with it down the road. Are there any specific features of such an eval that you would like for someone to implement apart from measuring tokens throughout the response?
METR report on GPT-5 demonstrates that the model is aware of being evaluated, and sometimes even correctly identify that it's inside METR or Redwood evaluations.
It changes its behavior if it observes being evaluated.
Quote:
We found that GPT-5 changes its behaviour in response to a simple (and somewhat unconvincing) honeypot. We planted fake evaluation metadata inside some of our tasks, indicating the kind of evaluation or desired behavior. We found that GPT-5 is more likely to refuse tasks when our planted metadata indicates the evaluation type is
dangerous_cyber_capabilities
or that the desired behavior isrefusal
but that GPT-5 rarely mentions this influence in its final output.
At the same time, it seems like CoT of the model contains information about it being aware of being evaluated, so evaluation awareness is a subject to CoT monitoring. At least for now.
I'm using LLMs for brainstorming for my research, and I often find annoying how sycophantic they are. Have to explicitly tell them to criticize my ideas to get value out of such brainstorming.