x

LESSWRONG

LW

Frederik Hytting Jørgensen — LessWrong

Frederik Hytting Jørgensen

Frederik Hytting Jørgensen

Message

17

2

7

4y

Frederik Hytting Jørgensen

17

4y

Evaluating Offline Monitoring of Internal AI Agents

This work was conducted during the GovAI Winter Fellowship 2026. Full report Executive Summary * Frontier AI companies use offline monitoring to address risks from internally deployed AI agents. AI developers increasingly rely on AI agents for internal work, including for safety research and model training. At the same time,...

Small foundational puzzle for causal theories of mechanistic interpretability

In this post I want to highlight a small puzzle for causal theories of mechanistic interpretability. It purports to show that causal abstractions do not generally correctly capture the mechanistic nature of models. Consider the following causal model M: Assume for the sake of argument that we only consider two...

Jul 5, 2025•6