AI alignment researcher. Interested in understanding reasoning in language models.
https://dtch1997.github.io/
Hmm I don't think there are people I can single out from my following list that have high individual impact. IMO it's more that the algorithm has picked up on the my trend of engagement and now gives me great discovery.
For someone else to bootstrap this process and give maximum signal to the algorithm, the best thing to do might just be to follow a bunch of AI safety people who:
Some specific people that might be useful:
I also follow several people who signal-boost general AI stuff.
IMO the best part of breadth is having an interesting question to ask. LLMs can mostly do the rest
What is prosaic interpretability? I've previously alluded to this but not given a formal definition. In this note I'll lay out some quick thoughts.
The broadest possible definition of "prosaic" interpretability is simply 'discovering true things about language models, using experimental techniques'.
A pretty good way to do this is to loop the following actions.
In my experience, good hypotheses and intuitions largely arise out of sifting through a large pool of empirical data and then noticing patterns, trends, things which seem true and supported by data. Like drawing constellations between the stars.
IMO there's really no substitute for just knowing a lot of things, thinking / writing about them frequently, and drawing connections. But going over a large pool is a lot of work. It's important to be smart about this.
Be picky. Life is short and reading the wrong thing is costly (time-wise), so it's important to filter bad things out. I used to trawl Arxiv for daily updates. I've stopped doing this, since >90% of things are ~useless. Nowadays I am informed about papers by Twitter threads, Slack channels, and going to talks / reading groups. All these are filters for true signal amidst the sea of noise.
Distill. I think >90% of empirical work can be summarised down to a "key idea". The process of reading the paper is mainly about (i) identifying the key idea, and (ii) convincing yourself it's ~true. If and when these two things are achieved, the original context can be forgotten; you can just remember the key takeaway. Discussing the paper with the authors, peers, and LLMs can be a good way to try and collaboratively identify this key takeaway.
In order to test hypotheses, it's important to do a causal interventions and study the resulting changes. Some examples are:
In all cases you usually want to have sample size > 1. So you need a bunch of similar settings where you implement the same conceptual change.
Acausal analyses. It's also possible to do other things, e.g. non-causal analyse. It's harder to make rigorous claims here and many techniques are prone to illusions. Nonetheless these can be useful for building intuition
You may have noticed that prosaic interpretability, as defined here, is very broad. I think this sort of breadth is necessary for having many reference points by which to evaluate new ideas or interpret new findings, c.f. developing better research taste.
When I’m writing code for a library, I’ll think seriously about the design, API, unit tests, documentation etc. AI helps me implement those.
When I’m writing code for an experiment I let AI take the wheel. Explain the idea, tell it rough vibes of what I want and let it do whatever. Dump stack traces and error logs in and let it fix. Say “make it better”. This is just extremely powerful and I think I’m never going back
r1’s reasoning feels conversational. Messy, high error rate, often needs to backtrack. Stream of thought consciousness rambling.
Other models’ reasoning feels like writing. Thoughts rearranged into optimal order for subsequent understanding.
In some sense you expect that doing SFT or RLHF with a bunch of high quality writing makes models do the latter and not the former.
Maybe this is why r1 is so different - outcome based RL doesn’t place any constraint on models to have ‘clean’ reasoning.
I'm imagining it's something encoded in M1's weights. But as a cheap test you could add in latent knowledge via the system prompt and then see if finetuning M2 on M1's generations results in M2 having the latent knowledge
Finetuning could be an avenue for transmitting latent knowledge between models.
As AI-generated text increasingly makes its way onto the Internet, it seems likely that we'll finetune AI on text generated by other AI. If this text contains opaque meaning - e.g. due to steganography or latent knowledge - then finetuning could be a way in which latent knowledge propagates between different models.
How I currently use different AI
Stuff I've considered using but haven't, possibly due to lack of imagination:
Yeah, I don’t think this phenomenon requires any deliberate strategic intent to deceive / collude. It’s just borne of having a subtle preference for how things should be said. As you say, humans probably also have these preferences
Large latent reasoning models may be here in the next year
Imagine a language model that can do a possibly unbounded amount of internal computation in order to compute its answer. Seems like interpretability will be very difficult. This is worrying because externalised reasoning seems upstream of many other agendas
How can we study these models?