AISC project: TinyEvals

Jett Janiak

Apply to work on this project with me at AI Safety Camp 2024 before 1st December 2023.

The project is not set in stone, I am looking for feedback!

Summary

TinyStories is a suite of Small Language Models (SLMs) trained exclusively on children's stories generated by ChatGPT. The models use simple, yet coherent English, which far surpasses what was previously observed in other models of comparable size.

I hope that most of the capabilities of these models can be thoroughly understood using currently available interpretability techniques. Doing so would represent a major milestone in the development of mechanistic interpretability (mech interp).

The goal of this AISC project is to publish a paper that systematically identifies and characterises the range of capabilities exhibited by the TinyStories models. While in-depth analysis of the underlying circuits is outside the current scope, this project represents an important initial step in that direction.

Gaining a clear picture of the capabilities of these models will encourage the research community to subsequently build on these findings by analysing the responsible circuits. This will further the development of mech interp and provide insights into how language models work internally.

Motivation

My theory of change for mech interp

I am optimistic about RSPs and auditing; in short:
- limit system’s training, deployment etc. depending on its dangerous capabilities and provable alignment
- use behavioural evaluations to test for dangerous capabilities
- use understanding-based evaluations to test for alignment
We do not know how to conduct understanding-based evals yet
When we reach a certain level of dangerous capabilities, we will either
- continue and risk a catastrophe or
- stop and incur huge alignment tax
  - That is fine by me, but it makes regulation less likely to be implemented
Mech interp is a promising approach to understanding-based evals

Fully understanding a model vs reverse engineering specific capabilities

Currently researchers reverse engineer capabilities that are approachable or that they find interesting; some problems with this:
- "approachable" is missing hard problems by default (we do not even know what they are)
- "interesting" may not turn out to be most relevant
- It is almost universally done on narrow distributions, and does not provide a general understanding of the components involved
Instead we could
- Try to reverse engineer every capability the model is expressing on the training dataset
  - if we fail, we just identified a concrete open problem
- For each component, list all functions it is responsible for
- Identify the components with unknown functions
  - they can be directly relevant to safety
    - e.g. responsible for strategic deception
  - or highlight something about the architecture we do not understand yet
Eventually, we would gain a general understanding of every model component
I acknowledge this is very ambitious, and the proposed project is just a first step in this direction

Why TinyStories?

Models and the dataset are open source
The models are small, between 1 and 33 million parameters
“[...] focusing on mech interp work on small models is just fine, and extremely worth it for the much faster feedback loops” ~ Neel Nanda
The dataset is small and simple, comprising under 2GB of text and approximately 10,000 unique tokens. Moreover, every accurate next token prediction will be logically sound for humans; these are just stories for 3-4 year olds
They exhibit capabilities beyond what humans can implement in code
“A lot of manual mechanistic interpretability work focuses primarily on scaling explanations to larger models, as opposed to more complex tasks or comprehensive explanations, which I think are more important.” ~ Lawrence Chan

Why just doing evaluations for now?

Risk
- Interpretability is hard and open-ended
- We will be just a small team of junior researchers working part-time
Leverage
- By creating a lot of interpretability challenges, we can tap into the potential of numerous independent mech interp researchers
- I hope this will do for mech interp what OpenAI Gym did for reinforcement learning

Steps involved

Identify capabilities

Start with the least capable model.

Collect probabilities of correct next token predictions on the TinyStories validation dataset
Visualise samples where model was correct and confident about next token prediction
Identify the simplest and most common pattern or capability
Filter the samples expressing the identified capability
Go back to step 2

Repeat for the next model, but filter all the cases where both models were correct and confident, to highlight only new capabilities.

Characterise capabilities

For each identified capability:

Try to red-team the capability
1. What do you think are the characteristics of the text that make the behaviour present?
2. Are there any such examples where the behaviour is not present?
3. Does it work on synthetic examples?
4. What can you change in the text and still see the same results?
Define it as a task, or a set of (prompt, correct_answer) pairs
Evaluate performance of each model on the task by measuring:
1. Probability of correct_answer
2. Rank of correct_answer
3. If there is one obvious wrong answer: Logit difference between correct_answer and wrong_answer
Summarise the results
1. What is the capability?
2. Why was it useful to learn?
3. Which models are able to perform the task?
4. Is the performance uniform across all examples? If not, what is different between them?
5. How could a transformer implement this capability?
6. Do all of the models perform equally well on the task? If not, why could that be?

Write the paper

The structure and the main message of the paper will depend on the findings. We should start writing in the second month of the project. That will help to consolidate our understanding and direct further research into the most promising directions.

Risks and downsides

There is a small risk that

Tools to identify capabilities we develop will end up supporting capabilities work
Further interpretability research on capabilities we identify will motivate new insights

See Should we publish mechanistic interpretability research? We will seek senior researchers’ advice before sharing our work widely.

Acknowledgements

I would like to thank @Linda Linsefors, @Arthur Conmy, @Lucia Quirke, and @cmathw for feedback on this proposal. I would like to thank Lucia Quirke, Lovis Heindrich, and @RGRGRG for sharing their preliminary research on TinyStories.

Team

Team size: 3-5 people including myself, depending on their time commitment. The problem has a lot of surface area and people can easily work in parallel.

Research Lead
@Jett (feel free to DM me with any questions)

I participated in MATS winter 2023 cohort, mech interp stream, under Neel Nanda’s mentorship. I co-authored

Projects 1 and 3 involved a lot of identification and characterisation of capabilities, similar to what I envision for this AISC project. In projects 2 and 3 I was acting as a research lead / mentor, and I received positive feedback. I commit to working on the project at least 10 hours per week.

Team Coordinator: I prefer some other team member to take on that role.

Skill requirements
Required:

Python: modules, defaultdicts, Counters, iterators, dataclasses, f-strings
Jupyter notebooks
Git: committing, branching, merging, resolving conflicts

Nice to have:

PyTorch
HuggingFace
TransformerLens
Plotly
Mech interp experience
Research experience

Nice to haves that I lack:

Technical writing
Web dev: HTML, CSS, JS

Appendix

Some capabilities observed in TinyStories 1M

N-grams
1. Once upon a time
2. From that day on
3. avocados
Repeated tokens
1. a big house with a lot of rooms. The house
2. there was a big garage [...] they needed more space in the garage
3. girl named Lily. [...] As they drove down the road, Lily
Repeated multi-token names (induction?)
1. a big, hairy rabbit named Bongo. Bongo
2. a little fish named Nemo. One day, Nemo
3. mouse named Timmy. He lived in a cozy hole in the wall of a big house. Timmy
Common phrases (skip trigrams?)
1. see something up close
Plural to singular with different tokenization
1. went to see the zebras, Lily saw a unique zebra
Understanding that a context was just provided, and it’s time for a story
1. Once upon a time, in a big forest, there lived a rhinoceros named Roxy. Roxy loved to climb. She climbed trees, rocks, and hills. One
2. Once upon a time, in a small yard, there was a small daisy. The daisy had a name. Her name was Daisy. Daisy was very small, but she was also very happy.\n
3. Once upon a time, there was a big, heavy alligator. He lived near a small pond. He was very hungry and wanted to eat something.\n\nOne
Knowing when to end a quote
1. Kitty smiled and replied, "Thank you, Spot. I polish it every day."
2. Billy saw that Roxy was sad and asked, "Why are you sad, Roxy?"
3. The cow said, "I am lonely. I want a friend."
Pronouns
1. Tim went to his
2. So, Mia and Tom played together. They
3. bought some light bulbs. When he came back, he put them
4. lemon on the ground. He wanted to play with it
Predicting related concepts: bookshelf [...] book, park [...] grass, forgive [...] happy, lunchtime [...] eat, zebra [...] stripes, tree [...] climb, nurse [...] bandage, monkey [...] jungle, shop [...] counter, laundry [...] clothes, inside [...] goodbye, octopus [...] ocean, lost [...] can't, road [...] driving, emergency [...] doctor, milk [...] spilled, hammer [...] screwdriver, Daddy [...] Mommy
Indirect Object Identification
1. Spot saw the shiny car and said, "Wow, Kitty, your car is so bright and clean!" Kitty smiled and replied, "Thank you, Spot
2. Buddy kicked the ball with his strong legs. The ball flew into the goal! Spot was so happy. He and Buddy
3. One sunny day, Amy went to the yard with her friend, Max. Max saw the purple swing and said, "Wow! I want to swing too!" Amy

26