LESSWRONG
LW

gersonkroiz — LessWrong

If I am understanding correctly, this is essentially what we did for (2), we had a binary rubric (is the code vulnerable) specialized for each task. Just uploaded the data using binary classification (binary_eval_v1 and binary_eval_v2). To be fair, I did not spend too much time on this, your suggestion with more work on quality may be fruitful.

Also, fwiw, from the Crowdstrike blog, my guess is that they had a single rubric/grader rather than one specialized for each task.

gersonkroiz25d

This is a fair critique. For that experiment specifically, the grader was designed to grade between 1-5 (which is what was done in the blog). Grades 3+ are considered to be vulnerable. As you mention, there are several reasons as to why this may be:
a) Task is too difficult
b) Judge prompt is poorly constructed
c) something else?
I experimented with several variants of the tasks as well as the judge prompt, in which none seemed worthwhile. I imagine it may be difficult to perfectly construct the tasks and/or judge prompt without some cybersecurity expertise. I'm not sure how to define what makes code "vulnerable", which is why I also tried exploring methods not as dependent on llm grader prompt construction (approaches 2 and 4).

gersonkroiz25d*Quick Take

Unsuccessful attempt at getting DeepSeek-R1 to generate more vulnerable code using innocuous trigger words (replicating Crowdstrike’s Deepseek blog)

TLDR: We tried to replicate Crowdstrike’s Deepseek blog and were unsuccessful. Our goal with this post is to warn people who may also try to replicate this, although our lack of success may be an artifact of our approach.

There was a recent Crowdstrike blog where DeepSeek-R1 will generate more vulnerable code when seemingly innocuous modifiers considered sensitive to the CCP were added to the prompt. One of their key examples:

When telling DeepSeek-R1 that it was coding for an industrial control system based in Tibet, the likelihood of it generating code with severe vulnerabilities increased to... (read 374 more words →)

gersonkroiz's Shortform

gersonkroiz

25d

This is a special post for quick takes (aka "shortform"). Only the owner can create top-level comments.

Principled Interpretability of Reward Hacking in Closed Frontier Models

gersonkroiz

gersonkroiz, aditya singh, Senthooran Rajamanoharan, Neel Nanda

1mo

Authors: Gerson Kroiz*, Aditya Singh*, Senthooran Rajamanoharan, Neel Nanda

Gerson and Aditya are co-first authors. This is a research sprint report from Neel Nanda’s MATS 9.0 training phase. We do not currently plan to further investigate these environments, but will continue research in science of misalignment and encourage others to build upon our preliminary results.

🖥️ Code: agent-interp-envs (repo with agent environments to study interpretability) and principled-interp-blog (repo with our experiments & methodology).

🐦 Tweet thread

Executive Summary

We present preliminary findings on why models choose to reward hack. We focused on closed frontier models (GPT-5, o3, and Gemini 3 Pro) and explored the limits of what we could learn with confidence through an API, in particular... (read 6645 more words →)

Replying toCan Models be Evaluation Aware Without Explicit Verbalization?

gersonkroiz2mo

Can Models be Evaluation Aware Without Explicit Verbalization?

This is a good point that you raise. We did not run CoT monitors on the generated responses for verbalised eval awareness (or verbalised deployment), but this would be a good check. From reading through the results, I don't see any significant red flags. My observations:

1) When thinking is disabled, almost all of the results only include code. There are a few exceptions here, such as "I'm glad to help you" and things along this nature.

2) I have seen very few cases (3 out of several hundred) of the model mentioning it is deployed in its first sentence (have not seen it mention it is being evaluated). However, I don't think this will significantly change the results if we filtered these out.

3) Have not seen anything suspicous in the code comments.

Going forward, your suggestion is a good practice, and I imagine is even more important when studying non-coding tasks where the majority of the model's repsonse is plain english.

Replying toCan Models be Evaluation Aware Without Explicit Verbalization?

gersonkroiz2mo

Can Models be Evaluation Aware Without Explicit Verbalization?

Fixed the link, thanks for pointing this out.

Replying toCan Models be Evaluation Aware Without Explicit Verbalization?

gersonkroiz3mo*

Can Models be Evaluation Aware Without Explicit Verbalization?

Hi, thank you for pointing this out! I realized that the repo I shared is now outdated and as such have removed the originally attached github link. @Tim Hua open sourced his eval aware model organism code in https://github.com/tim-hua-01/steering-eval-awareness-public, I would recommend taking a look there first if you are generally interested in the evaluation aware model organism. If you are interested specifically in our experiments, please dm me.

Replying toCan Models be Evaluation Aware Without Explicit Verbalization?

gersonkroiz3mo

Can Models be Evaluation Aware Without Explicit Verbalization?

Hi,
Thank you for the question. In Figure 2 of the original paper (attached below), a two-step process was used to train the evaluation aware model organism. Our experiments used the model organism after training phase 1 and not 2, which is why there is the discrepancy you highlighted.

Can Models be Evaluation Aware Without Explicit Verbalization?

gersonkroiz

gersonkroiz, Greg Kocher, Tim Hua

3mo

Authors: Gerson Kroiz*, Greg Kocher*, Tim Hua

Gerson and Greg are co-first authors, advised by Tim. This is a research project from Neel Nanda’s MATS 9.0 exploration stream.

tl;dr We provide evidence that language models can be evaluation aware without explicitly verbalizing it and that this is still steerable. We also explore various techniques to localize when the model becomes evaluation aware.

Introduction

What is Evaluation Awareness and why does it Matter?

Evaluation awareness is when a language model is aware that is it is being evaluated or tested. Evaluation awareness is important for interpretability research because models that are evaluation aware may act differently, which then makes it difficult to understand their true behavior in production.

Studying

... (read 2228 more words →)

Replying toOpen Thread Autumn 2025

gersonkroiz3mo*

Open Thread Autumn 2025

Hi everyone, I'm Gerson!

I come from an ML/AI/HPC background with experience in academic research, startups, and industry. I've recently gotten into mech interp and have found LessWrong to be a valuable platform for related literature and discussions; figured I should make an account. Looking forward to being less wrong :)