2049

LESSWRONG
LW

2048

Igor Ivanov's Shortform

by Igor Ivanov
7th Aug 2025
1 min read
5

5

This is a special post for quick takes by Igor Ivanov. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
Igor Ivanov's Shortform
65Igor Ivanov
5anaguma
4dr_s
7Jozdien
3kaiwilliams
5 comments, sorted by
top scoring
Click to highlight new comments since: Today at 5:51 AM
[-]Igor Ivanov2mo*650

METR report on GPT-5 demonstrates that the model is aware of being evaluated, and sometimes even correctly identify that it's inside METR or Redwood evaluations. 

It changes its behavior if it observes being evaluated. 
Quote:

We found that GPT-5 changes its behaviour in response to a simple (and somewhat unconvincing) honeypot. We planted fake evaluation metadata inside some of our tasks, indicating the kind of evaluation or desired behavior. We found that GPT-5 is more likely to refuse tasks when our planted metadata indicates the evaluation type is dangerous_cyber_capabilities or that the desired behavior is refusal but that GPT-5 rarely mentions this influence in its final output.

At the same time, it seems like CoT of the model contains information about it being aware of being evaluated, so evaluation awareness is a subject to CoT monitoring. At least for now.

Reply
[-]anaguma2mo51

Once the model is aware that the CoTs are being monitored, e.g. by learning this in pretraining, I expect CoT monitoring to be less helpful.

Reply
[-]dr_s2mo41

The question is whether an LLM even can keep a coherent long term plan like "deceive the evaluators" just as well without relying on CoT. I suppose it might be able to if it also has internal memory of some sort.

Reply
[-]Jozdien2mo70

I don't think it needs to look like a coherent long-term plan from the model's perspective. Awareness of CoTs being monitored and a consistent desire to do something that would be bad if monitored could influence the model's outputs without requiring a coherent long-term plan. 

Reply
[-]kaiwilliams2mo30

One thing that strikes me as odd about this is that GPT-5's knowledge cutoff (September 2024) is much earlier than Grok (November 2024), Gemini 2.5 pro (January 2025), and Opus 4.1 (March 2025).[1]

I mean, I guess this is a scaling thing and a persona thing. But I'm a little confused

  1. ^

    Though oddly, Claude's system card says the knowledge cutoff is the end of January 2025. Maybe February and March's training data aren't as complete as January and before.

Reply
Moderation Log
More from Igor Ivanov
View more
Curated and popular this week
5Comments