x

LESSWRONG

LW

kitft — LessWrong

kitft

kitft

Message

199

24d

kitft

199

24d

Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations

by Subhash Kantamneni, kitft, Euan Ong, and Sam Marks

Abstract > We introduce Natural Language Autoencoders (NLAs), an unsupervised method for generating natural language explanations of LLM activations. An NLA consists of two LLM modules: an activation verbalizer (AV) that maps an activation to a text description and an activation reconstructor (AR) that maps the description back to an...