Igor Ivanov's Shortform

7th Aug 2025

1 min read

5

This is a special post for quick takes by Igor Ivanov. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

Igor Ivanov's Shortform

5 comments, sorted by

top scoring

Click to highlight new comments since: Today at 8:11 AM

[-]Igor Ivanov4mo*650

METR report on GPT-5 demonstrates that the model is aware of being evaluated, and sometimes even correctly identify that it's inside METR or Redwood evaluations.

It changes its behavior if it observes being evaluated.
Quote:

We found that GPT-5 changes its behaviour in response to a simple (and somewhat unconvincing) honeypot. We planted fake evaluation metadata inside some of our tasks, indicating the kind of evaluation or desired behavior. We found that GPT-5 is more likely to refuse tasks when our planted metadata indicates the evaluation type is dangerous_cyber_capabilities or that the desired behavior is refusal but that GPT-5 rarely mentions this influence in its final output.

At the same time, it seems like CoT of the model contains information about it being aware of being evaluated, so evaluation awareness is a subject to CoT monitoring. At least for now.

[-]anaguma4mo51

Once the model is aware that the CoTs are being monitored, e.g. by learning this in pretraining, I expect CoT monitoring to be less helpful.

[-]dr_s4mo41

The question is whether an LLM even can keep a coherent long term plan like "deceive the evaluators" just as well without relying on CoT. I suppose it might be able to if it also has internal memory of some sort.

[-]Jozdien4mo70

I don't think it needs to look like a coherent long-term plan from the model's perspective. Awareness of CoTs being monitored and a consistent desire to do something that would be bad if monitored could influence the model's outputs without requiring a coherent long-term plan.

[-]kaiwilliams4mo30

One thing that strikes me as odd about this is that GPT-5's knowledge cutoff (September 2024) is much earlier than Grok (November 2024), Gemini 2.5 pro (January 2025), and Opus 4.1 (March 2025).^[1]

I mean, I guess this is a scaling thing and a persona thing. But I'm a little confused

^{^}
Though oddly, Claude's system card says the knowledge cutoff is the end of January 2025. Maybe February and March's training data aren't as complete as January and before.

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

Igor Ivanov's Shortform

5