How do people fail to improve their rationality? How do they accidentally harm themselves in the process? I'm thinking of writing a post "How not to improve your rationality" or "A nuanced guide to reading the sequences" that preempts common mistakes, and I'd appreciate hearing people's experiences. Some examples: *...
In John's recent post he mentions many people in ML not having good gears level models of what's going on. To wit; what gears-level models do you know for ML? How much support is there for them? Are there "settled science" kind models that have tons of empirical support? What...
Mrinank, Austin, and Alex wrote a paper on the results from Understanding and controlling a maze-solving policy network, Maze-solving agents: Add a top-right vector, make the agent go to the top-right, and Behavioural statistics for a maze-solving agent. > Abstract: To understand the goals and goal representations of AI systems,...
We wrote up the GPT-2 steering vector work as a full paper, adding a few systematic tests. Recap: We've been looking into activation engineering: modifying the activations of a language model at inference time to predictably alter its behavior. Our method works by adding a bias to the forward pass,...
Steering GPT-2-XL by adding an activation vector introduced > activation engineering... techniques which steer models by modifying their activations. As a complement to prompt engineering and finetuning, activation engineering is a low-overhead way to steer models at runtime. These results were recently complemented by Inference-Time Intervention: Eliciting Truthful Answers from...
Warning: This post and most of the results were made under heavy time constraints and may be updated later. My intention is to quickly share partial work I'm not planning on continuing. Introduction & Love - Hate example For a primer on how tuned lens works see here. In short,...
Abstract > Large language models are trained in two stages: (1) unsupervised pretraining from raw text, to learn general-purpose representations, and (2) large scale instruction tuning and reinforcement learning, to better align to end tasks and user preferences. We measure the relative importance of these two stages by training LIMA,...