Message

Maheep Chaudhary

Message

Would your AI travel agent book a bullfight? Testing whether agents consider animal welfare without being prompted

by jonahmattwoodward, Jasmine Brazilek, Maheep Chaudhary, Oliver Tullio, Joel Christoph, and MilesTS

This article reflects new updates to the accompanying paper: arxiv.org/abs/2606.18142. Benchmark: now included in the UK AI Security Institute's Inspect Evals. Leaderboard: compassionbench.com/tac. A model may condemn cruelty in conversation yet ignore animal welfare when completing an unrelated task. Stated concerns matter little if they do not affect decisions. We...

Jul 17•14

Awareness Jailbreaking: Revealing True Alignment in Evaluation-Aware Models

This is a draft proposal. I'm planning to invest significant time here and would appreciate feedback, especially on methodology gaps, threat model assumptions, or prior work I've missed. In quantum mechanics, Heisenberg's uncertainty principle revealed a fundamental limit: observation disturbs the observed. To measure a particle's position precisely, you must...

Dec 29, 2025•11

Evaluation Awareness Scales Predictably in Open-Weights Large Language Models

Authors: Maheep Chaudhary, Ian Su, Nikhil Hooda, Nishith Shankar, Julian Tan, Kevin Zhu, Ryan Laggasse, Vasu Sharma, Ashwinee Panda Scatter plot with a smoothed trend line that shows AUROC absolute distance from 0.5 as a function of model size (billions of parameters, log scale). Each point shows the best-performing probe...

Dec 19, 2025•23