Jessica Rumbelow — LessWrong

An Introduction to Exemplar Partitioning for Mechanistic Interpretability

Most of what we currently call "feature discovery" in language models is wrapped up in dictionary-learning methods like sparse autoencoders (SAEs) – which work, and which have been scaled to millions of features on frontier-scale models, but which bundle two distinct commitments into a single training objective: a reconstruction loss...

May 1669

Scientific Discovery in the Age of Artificial Intelligence

Cross-posted from the Leap Labs blog For many people, including me, the real promise of AI is massively accelerated scientific discovery. Chatbots, vibe coding, video generation: these things are magical, but what I really want is superhuman medicine, radical life extension, humanity blossoming out into the universe. Understanding the universe....

Jun 29, 202542

Why did ChatGPT say that? Prompt engineering and more, with PIZZA.

All examples in this post can be found in this notebook, which is also probably the easiest way to start experimenting with PIZZA. From the research & engineering team at Leap Laboratories (incl. @Arush, @sebastian-sosa, @Robbie McCorkell), where we use AI interpretability to accelerate scientific discovery from data. What is...

Aug 3, 202443

Introducing Leap Labs, an AI interpretability startup

We are thrilled to introduce Leap Labs, an AI startup. We’re building a universal interpretability engine. We design robust interpretability methods with a model-agnostic mindset. These methods in concert form our end-to-end interpretability engine. This engine takes in a model, or ideally a model and its training dataset (or some...

Mar 6, 2023104

SolidGoldMagikarp III: Glitch token archaeology

by mwatkins and Jessica Rumbelow

The set of anomalous tokens which we found in mid-January are now being described as 'glitch tokens' and 'aberrant tokens' in online discussion, as well as (perhaps more playfully) 'forbidden tokens', 'unspeakable tokens' and 'cursed tokens'. We've mostly just called them 'weird tokens'. GPT-3 speaks of 'the unspeakable one' when...

Feb 14, 202392

SolidGoldMagikarp II: technical details and more recent findings

by mwatkins and Jessica Rumbelow

tl;dr: This is a follow-up to our original post on prompt generation and the anomalous token phenomenon which emerged from that research. Work done by Jessica Rumbelow and Matthew Watkins in January 2023 at SERI-MATS. part of a typical semantically coherent cluster we found in GPT2-small's embedding space Clustering As...

Feb 6, 2023114

SolidGoldMagikarp (plus, prompt generation)

UPDATE (14th Feb 2023): ChatGPT appears to have been patched! However, very strange behaviour can still be elicited in the OpenAI playground, particularly with the davinci-instruct model. More technical details here. Further (fun) investigation into the stories behind the tokens we found here. Work done at SERI-MATS, over the past...

Feb 5, 2023679