Strong upvote. Failures are very common and people should write more posts about them. I'm impressed by all the outputs you've had in this short period of time. You are set up for success upon return from sabbatical!
Here are some quick thoughts on the post itself:
Sorry, to me the title is slightly misleading; I wouldn't really consider any of the projects described in this post to be failures unless you're being exceedingly humble, because you have writeups and code for each of them which is further than most failed projects get. Despite your caveat that these are "those projects which still have something to say", I'd be even more curious about ones from further down the stack which "died" earlier. When reading posts like these I look for snippets that might be embarrassing for the author, things like, "I made an obvious blunder or bad decision" and I didn't really find that here, which makes this less relatable (though the short length means this doesn't detract from the quality of the writing). Perhaps that was a non-goal. The closest might be that bool facts can be memorized with ~2 bits per parameter, but that's not a trivial result.
That might also appeal to a broader audience since a lot of people are moving into the field and they tend to repeat the same mistakes at first. Some common errors are well known and generic, but other widespread issues are invisible (or noise!) to the people best equipped to address them. But that's alright if you were aiming this post at a smaller audience that was more senior than you, in contrast to a larger one that was more junior.
Here are some quick thoughts on the projects:
Finally, this section made me laugh:
Neel Nanda directly told me that it is "cursed".
So I made a toy model trained to memorise a set of boolean facts chosen at random, and investigated the circuits and the scaling laws.
Yes, I've struggled for collaborators - I tried to join some projects, but was always scuppered by scheduling conflicts. And my discord is full of game dev and procedural art enthusiasts, not AI safety experts. I started all these projects just intending to learn, so I wasn't too focussed on getting any serious output.
I've joined MATS now so I am getting more into a collaborative mode and have plenty to occupy me for the time, but thank you for the offer. But I would like to know how you got involved with that ITDA work?
Thanks for reading the projects in such depth, I honestly didn't expect anyone would.
Congratulations on MATS!
I would like to know how you got involved with that ITDA work?
Patrick shared a draft with Stepan Shabalin, who shared it with me. We had collaborated on another project earlier https://arxiv.org/abs/2505.03189 which was lead by my former SPAR student Yixiong, so it made sense to work together again.
Thanks for reading the projects in such depth, I honestly didn't expect anyone would.
Oh, not at all, I only took a quick look through everything and could have spent more time on details. Until now, I didn't even notice that https://github.com/BorisTheBrave/itda-coders was a private repo which I cannot access.
This year I've been on sabbatical, and have spent my time upskilling in AI Safety. Part of that is doing independent research projects in different fields.
Some of those items have resulted in useful output, notably A Toy Model of the U-AND Problem, Do No Harm? and SAEs and their Variants.
And then there are others that I've just failed fast, and moved on.
Here I write up those projects that still have something to say, even if it's mostly negative results.
Inspired by @Subhash Kantamneni's Language Models Use Trigonometry to Do Addition, I was curious if LLMs would use a similar trick for multiplication, after an appropriate log transform.
I adapted the original code to this case and ran it with various parameters. I also designed a few investigations of my own. Notably, the original code relied on DFT which was not appropriate for a floating point case.
The original paper looked at early-layer activations of the tokens "1" to "360", and looked for structure that related to the numerical value of the token. They found a helix, i.e. they found a linear direction that increased linearly with the token value, and several size two subspaces that each encoded the token value as an angle, with various different angular frequencies.
Using similar techniques, I found a linear direction corresponding to log(token value). This emerged via PCA techniques as a fairly high eigenvalues, so there it feels like fairly strong evidence. That said, I couldn't find significant impact via ablation studies. I found no evidence for an equivalent of angular encoding for log(token value).
On reflection, I think multiplication is probably not handled equivalently to addition. There's a couple of reasons why:
· Brief Writeup · Code ·
After my success with Computational Superposition in a Toy Model of the U-AND Problem, I wanted to try toy models to a harder problem. The takeway from my earlier post was that models often diffuse and overlap calculation across many neurons, which stymies interpretability. I thought this might also explain why it's so difficult to understand how LLMs memorise factual information. I took a crack at this despite the fact that GDM has spent quite a bit of effort on it and Neel Nanda directly told me that it is "cursed".
So I made a toy model trained to memorise a set of boolean facts chosen at random, and investigated the circuits and the scaling laws.
I found a characterisation of the problem showing that the circuits formed are like a particular form of matrix decomposition, where the matrix to decompose is the table of booleans needing memorising. There's lots of nice statistical properties about the decomposition of random matrices that hint at some useful results, but I didn't find anything conclusive.
This does go some distance explaining why the problem is "cursed". The learnt behaviour depends on incidental structure - i.e. coincidences occuring in the random facts. So while there might be something to analyse statistically, there isn't going to be any interpretable meaning behind it, and it's likely packed so tightly there's clever trick to recover anything without just running the model.
Then I analysed the scaling laws behind memorisation. I found that typically, the models were able to memorise random bool facts with ~2 bits per parameter of the model. Later I found out that this is a standard result in ML. It's nice to confirm that this happens in practise, with ReLU, but it's not a big enough result to merit a writeup. I wasn't able to find any theoretical construction that approached this level of bit efficiency. I note that this level of efficiency is far from the number of bits actually in a parameter, which probably helps explain why quantisation works[2].
· Brief Writeup · Code ·
@Patrick Leask gave me an advanced copy on his paper on IDTA SAEs, a new sort of sparse autoencoder that uses a simplified representation and is vastly easier to train.
I played around a bit with their capabilities, and modified Patrick's code to support cross-coders. I also found a minor improvement that improved their reconstruction loss by 10%. But ultimately I couldn't convince myself that the quality of the features found by ITDAs was high enough to justify investing more time, particularly as I had several other things on at the time.
· Brief Writeup · Code ·