Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations
by Subhash Kantamneni, kitft, Euan Ong, and Sam Marks
Abstract > We introduce Natural Language Autoencoders (NLAs), an unsupervised method for generating natural language explanations of LLM activations. An NLA consists of two LLM modules: an activation verbalizer (AV) that maps an activation to a text description and an activation reconstructor (AR) that maps the description back to an...
May 7212