x

LESSWRONG

LW

Dillon Plunkett — LessWrong

Dillon Plunkett

Dillon Plunkett

Message

57

1

3y

Dillon Plunkett

57

3y

Tests of LLM introspection need to rule out causal bypassing

by Adam Morris and Dillon Plunkett

This point has been floating around implicitly in various papers (e.g., Betley et al., Plunkett et al., Lindsey), but we haven’t seen it named explicitly. We think it’s important, so we’re describing it here. There’s been growing interest in testing whether LLMs can introspect on their internal states or processes....

Nov 28, 2025•51

Self-interpretability: LLMs can describe complex internal processes that drive their decisions

by Adam Morris and Dillon Plunkett

This post is a summary of our paper from earlier this year: Plunkett, Morris, Reddy, & Morales (2025). Adam received an ACX grant to continue this work, and is interested in finding more potential collaborators---if you're excited by this work, reach out! It would be useful for safety purposes if...

Nov 14, 2025•12