x

LESSWRONG

LW

Maheep Chaudhary

Maheep Chaudhary

Message

33

3

1

2y

Maheep Chaudhary

33

2y

Maheep Chaudhary — LessWrong

Awareness Jailbreaking: Revealing True Alignment in Evaluation-Aware Models

This is a draft proposal. I'm planning to invest significant time here and would appreciate feedback, especially on methodology gaps, threat model assumptions, or prior work I've missed. In quantum mechanics, Heisenberg's uncertainty principle revealed a fundamental limit: observation disturbs the observed. To measure a particle's position precisely, you must...

Dec 29, 2025•11

Evaluation Awareness Scales Predictably in Open-Weights Large Language Models

Authors: Maheep Chaudhary, Ian Su, Nikhil Hooda, Nishith Shankar, Julian Tan, Kevin Zhu, Ryan Laggasse, Vasu Sharma, Ashwinee Panda Scatter plot with a smoothed trend line that shows AUROC absolute distance from 0.5 as a function of model size (billions of parameters, log scale). Each point shows the best-performing probe...

Dec 19, 2025•23