Ulisse Mini

What rationality failure modes are there?

How do people fail to improve their rationality? How do they accidentally harm themselves in the process? I'm thinking of writing a post "How not to improve your rationality" or "A nuanced guide to reading the sequences" that preempts common mistakes, and I'd appreciate hearing people's experiences. Some examples: *...

Jan 19, 202442

What ML gears do you like?

In John's recent post he mentions many people in ML not having good gears level models of what's going on. To wit; what gears-level models do you know for ML? How much support is there for them? Are there "settled science" kind models that have tons of empirical support? What...

Nov 11, 202325

Paper: Understanding and Controlling a Maze-Solving Policy Network

by TurnTrout, Ulisse Mini, peligrietzer, mrinank_sharma, Austin Meek, Monte M, and lisathiergart

Mrinank, Austin, and Alex wrote a paper on the results from Understanding and controlling a maze-solving policy network, Maze-solving agents: Add a top-right vector, make the agent go to the top-right, and Behavioural statistics for a maze-solving agent. > Abstract: To understand the goals and goal representations of AI systems,...

Oct 13, 202370

ActAdd: Steering Language Models without Optimization

by technicalities, TurnTrout, lisathiergart, David Udell, Ulisse Mini, and Monte M

We wrote up the GPT-2 steering vector work as a full paper, adding a few systematic tests. Recap: We've been looking into activation engineering: modifying the activations of a language model at inference time to predictably alter its behavior. Our method works by adding a bias to the forward pass,...

Sep 6, 2023105

Open problems in activation engineering

by TurnTrout, woog, lisathiergart, Monte M, and Ulisse Mini

Steering GPT-2-XL by adding an activation vector introduced > activation engineering... techniques which steer models by modifying their activations. As a complement to prompt engineering and finetuning, activation engineering is a low-overhead way to steer models at runtime. These results were recently complemented by Inference-Time Intervention: Eliciting Truthful Answers from...

Jul 24, 202351

[ASoT] GPT2 Steering & The Tuned Lens

Warning: This post and most of the results were made under heavy time constraints and may be updated later. My intention is to quickly share partial work I'm not planning on continuing. Introduction & Love - Hate example For a primer on how tuned lens works see here. In short,...

Jul 1, 202323

LIMA: Less Is More for Alignment

Abstract > Large language models are trained in two stages: (1) unsupervised pretraining from raw text, to learn general-purpose representations, and (2) large scale instruction tuning and reinforcement learning, to better align to end tasks and user preferences. We measure the relative importance of these two stages by training LIMA,...

May 30, 202316

Ulisse Mini

Ulisse Mini

Steering GPT-2-XL by adding an activation vector

Understanding and controlling a maze-solving policy network

Predictions for shard theory mechanistic interpretability results

ActAdd: Steering Language Models without Optimization

Ulisse Mini

Steering GPT-2-XL by adding an activation vector

Understanding and controlling a maze-solving policy network

Predictions for shard theory mechanistic interpretability results

ActAdd: Steering Language Models without Optimization

What rationality failure modes are there?

What ML gears do you like?

Paper: Understanding and Controlling a Maze-Solving Policy Network

ActAdd: Steering Language Models without Optimization

Open problems in activation engineering

[ASoT] GPT2 Steering & The Tuned Lens

LIMA: Less Is More for Alignment