Daniel Tan

AI alignment researcher. Interested in understanding reasoning in language models.

https://dtch1997.github.io/

Wikitag Contributions

Comments

Sorted by

Large latent reasoning models may be here in the next year

  • By default latent reasoning already exists in some degree (superhuman latent knowledge)  
  • There is also an increasing amount of work on intentionally making reasoning latent: explicit to implicit CoT, byte latent transformer, coconut
  • The latest of these (huginn) introduces recurrent latent reasoning, showing signs of life with (possibly unbounded) amounts of compute in the forward pass. Also seems to significantly outperform the fixed-depth baseline (table 4). 

Imagine a language model that can do a possibly unbounded amount of internal computation in order to compute its answer. Seems like interpretability will be very difficult. This is worrying because externalised reasoning seems upstream of many other agendas

How can we study these models? 

  • A good proxy right now may be language models provided with hidden scratchpads.
  • Other kinds of model organism seem really important
  • If black box techniques don't work well we might need to hail mary on mech interp. 

Hmm I don't think there are people I can single out from my following list that have high individual impact. IMO it's more that the algorithm has picked up on the my trend of engagement and now gives me great discovery. 

For someone else to bootstrap this process and give maximum signal to the algorithm, the best thing to do might just be to follow a bunch of AI safety people who: 

  • post frequently
  • post primarily about AI safety
  • have reasonably good takes

Some specific people that might be useful: 

  • Neel Nanda (posts about way more than mech interp)
  • Dylan Hadfield-Menell
  • David Duvenaud
  • Stephen Casper
  • Harlan Stewart (nontechnical)
  • Rocket Drew (nontechnical) 

I also follow several people who signal-boost general AI stuff. 

  • Scaling lab leaders (Jan Leike, Sam A, dario)
  • Scaling lab engineers (roon, Aidan McLaughlin, Jason Wei)
  • Huggingface team leads (Philip Schmidt, Sebastian Raschka)
  • Twitter influencers (Teortaxes, janus, near)

IMO the best part of breadth is having an interesting question to ask. LLMs can mostly do the rest

What is prosaic interpretability? I've previously alluded to this but not given a formal definition. In this note I'll lay out some quick thoughts. 

Prosaic Interpretability is empirical science

The broadest possible definition of "prosaic" interpretability is simply 'discovering true things about language models, using experimental techniques'. 

A pretty good way to do this is to loop the following actions. 

Hypothesis generation is about connecting the dots. 

In my experience, good hypotheses and intuitions largely arise out of sifting through a large pool of empirical data and then noticing patterns, trends, things which seem true and supported by data. Like drawing constellations between the stars. 

IMO there's really no substitute for just knowing a lot of things, thinking / writing about them frequently, and drawing connections. But going over a large pool is a lot of work. It's important to be smart about this. 

Be picky. Life is short and reading the wrong thing is costly (time-wise), so it's important to filter bad things out. I used to trawl Arxiv for daily updates. I've stopped doing this, since >90% of things are ~useless. Nowadays I am informed about papers by Twitter threads, Slack channels, and going to talks / reading groups. All these are filters for true signal amidst the sea of noise. 

Distill. I think >90% of empirical work can be summarised down to a "key idea". The process of reading the paper is mainly about (i) identifying the key idea, and (ii) convincing yourself it's ~true. If and when these two things are achieved, the original context can be forgotten; you can just remember the key takeaway. Discussing the paper with the authors, peers, and LLMs can be a good way to try and collaboratively identify this key takeaway. 

Hypothesis testing is about causal interventions. 

In order to test hypotheses, it's important to do a causal interventions and study the resulting changes. Some examples are: 

  • Change the training dataset / objective (model organisms)
  • Change the test prompt used (jailbreaking)
  • Change the model's forward pass (pruning, steering, activation patching)
  • Change the training compute (longitudinal study)

In all cases you usually want to have sample size > 1. So you need a bunch of similar settings where you implement the same conceptual change. 

  • Model organisms: Many semantically similar training examples, alter all of them in the same way (e.g. adding a backdoor)
  • Jailbreaking: Many semantically similar prompts, alter all of them in the same way (e.g. by adding an adversarial suffix)
  • etc. 

Acausal analyses. It's also possible to do other things, e.g. non-causal analyse. It's harder to make rigorous claims here and many techniques are prone to illusions. Nonetheless these can be useful for building intuition 

  • Attribute behaviour to weights, activations (circuit analysis, SAE decomposition)
  • Attribute behaviour to training data (influence functions)

Conclusion

You may have noticed that prosaic interpretability, as defined here, is very broad. I think this sort of breadth is necessary for having many reference points by which to evaluate new ideas or interpret new findings, c.f. developing better research taste. 

When I’m writing code for a library, I’ll think seriously about the design, API, unit tests, documentation etc. AI helps me implement those. 

When I’m writing code for an experiment I let AI take the wheel. Explain the idea, tell it rough vibes of what I want and let it do whatever. Dump stack traces and error logs in and let it fix. Say “make it better”. This is just extremely powerful and I think I’m never going back 

r1’s reasoning feels conversational. Messy, high error rate, often needs to backtrack. Stream of thought consciousness rambling. 

Other models’ reasoning feels like writing. Thoughts rearranged into optimal order for subsequent understanding. 

In some sense you expect that doing SFT or RLHF with a bunch of high quality writing makes models do the latter and not the former. 

Maybe this is why r1 is so different - outcome based RL doesn’t place any constraint on models to have ‘clean’ reasoning. 

I'm imagining it's something encoded in M1's weights. But as a cheap test you could add in latent knowledge via the system prompt and then see if finetuning M2 on M1's generations results in M2 having the latent knowledge

Finetuning could be an avenue for transmitting latent knowledge between models.  

As AI-generated text increasingly makes its way onto the Internet, it seems likely that we'll finetune AI on text generated by other AI. If this text contains opaque meaning - e.g. due to steganography or latent knowledge - then finetuning could be a way in which latent knowledge propagates between different models. 

How I currently use different AI

  • Claude 3.5 sonnet: Default workhorse, thinking assistant, Cursor assistant, therapy
  • Deep Research: Doing comprehensive lit reviews
  • Otter.ai: Transcribing calls / chats 

Stuff I've considered using but haven't, possibly due to lack of imagination: 

  • Operator - uncertain, does this actually save time on anything?
  • Notion AI search - seems useful for aggregating context
  • Perplexity - is this better than Deep Research for lit reviews?
  • Grok - what do people use this for? 

Yeah, I don’t think this phenomenon requires any deliberate strategic intent to deceive / collude. It’s just borne of having a subtle preference for how things should be said. As you say, humans probably also have these preferences

Load More