LESSWRONG
LW

LLM Mindreading

Jan 19, 2023 by David Udell

I think I now know something about language model mindcontrol. But I don't know how to understand a model's forward-pass cognition—the LLM mindcontrol I know about is "blind" to that, and has to see mindcontrol outcomes to guess at internal structures.

Mindreading is the key missing piece in making that mindcontrol loop more adversarially robust.

42Sparse Coding, for Mechanistic Interpretability and Activation Engineering
David Udell
2y
7
53Causal Graphs of GPT-2-Small's Residual Stream
David Udell
1y
7
23Explaining GPT-2-Small Forward Passes with Edge-Level Autoencoder Circuits
David Udell, hrdkbhatnagar, JacksonKaunismaa
1mo
0