slavachalnev — LessWrong

Notes on Transformer Consciousness

Assuming transformers can have conscious experience, what would that experience be like? Transformers[1] are a structured grid of layers and token positions and we can use this structure to reason about their internal experience. Epistemic status: very speculative. I've ordered this writeup approximately by how much I've thought it through...

Apr 2936

Cycle-Consistent Activation Oracles

TL;DR: I train a model to translate LLM activations into natural language, using cycle consistency as a training signal (activation → description → reconstructed activation). The outputs are often plausible, but they are very lossy and are usually guesses about the context surrounding the activation, not good descriptions of the...

Mar 1254

Sparse MLP Distillation

This is a research report about my attempt to extract interpretable features from a transformer MLP by distilling it into a larger student MLP, while encouraging sparsity by applying an L1 penalty to the activations, as depicted in Figure 1. I investigate the features learned by the distilled MLP, compare...

Jan 15, 202434