aditya singh

Principled Interpretability of Reward Hacking in Closed Frontier Models

by gersonkroiz, aditya singh, Senthooran Rajamanoharan, and Neel Nanda

Authors: Gerson Kroiz*, Aditya Singh*, Senthooran Rajamanoharan, Neel Nanda Gerson and Aditya are co-first authors. This is a research sprint report from Neel Nanda’s MATS 9.0 training phase. We do not currently plan to further investigate these environments, but will continue research in science of misalignment and encourage others to...

Jan 125

aditya singh

aditya singh

Why Did My Model Do That? Model Forensics for Diagnosing LLM Misbehavior

How to Design Environments for Understanding Model Motives

The Case for Model Forensics

Principled Interpretability of Reward Hacking in Closed Frontier Models

aditya singh

Why Did My Model Do That? Model Forensics for Diagnosing LLM Misbehavior

How to Design Environments for Understanding Model Motives

The Case for Model Forensics

Principled Interpretability of Reward Hacking in Closed Frontier Models

The Case for Model Forensics

How to Design Environments for Understanding Model Motives

Why Did My Model Do That? Model Forensics for Diagnosing LLM Misbehavior

Principled Interpretability of Reward Hacking in Closed Frontier Models