Nice post! This seems like a good application for autonomous research - a good metric, tight feedback loops, and there hasn't been a ton of research effort directed at it yet.
Karpathy had an auto-research scaffold improve validation loss on LLMs and from what I can see it mostly just tweaks hyperparameters, but LLM pretraining is a much better studied area.
The Top-K annealing was used in the Llama Scope paper as well: https://arxiv.org/pdf/2410.20526
I didn't know that Llama Scope also annealed the K, but it makes a lot of sense! It seems like a lot of the autoresearch stuff will end up being a fancy hyperparameter sweep, but if it's cheap to run and occasionally stumbles on something novel/useful maybe that's good enough.
Great post! Funnily enough, I did the exact same thing on the same task two weeks ago and my army of Claude agents found a different solution, reaching an F1 of 0.989!
Leave-One-Out Refinement
The innovation here is an inference-time method. The idea is that for each active latent, you ask whether removing it would actually hurt reconstruction. Concretely, you compute a projection score
x_hat = acts @ W_dec + b_dec # current reconstruction
residual = x - x_hat # reconstruction error
proj = residual @ W_dec.T + acts * dec_norms_sq # LOO score per latent
keep = (proj > threshold) | (acts == 0) # keep if score > τ
acts = acts * keep.float() # zero out spurious
I've also not tested it on real SAEBench, but it should be considerably cheaper to test as it is an inference-time method only. The full research report, completely written by Claude, here:
https://drive.google.com/file/d/1GSJrrPU6Q_TcwcjbsoF02yTOKvHhZiyj/view?usp=sharing
That's such a cool idea, and really impressive F1 score! It also seems like it's in the same vein of a slight refinement on the initial encoding. Would that not also work during training too? It seems like it would be safe to backprop through that refinement at training time. Did you do anything fancy for the setup, or just prompt Claude to increase the score in a loop?
Yeah, I think this could work during training as well, although you may get some weird dynamics because there is no penalty for highly-activating unhelpful latents to fire less. But I imagine you could at least use it as an auxiliary loss.
I used @Clément Dumas' research agents scaffold: https://github.com/Butanium/claude-lab/
Oh cool to see it worked well on a well defined task!! I've been struggling to make it work well enough for my tastes for more open ended task but it gave me enough results that like I just need to scale up some stuff and do the writeup myself. I think the next big improvement will be using Claude teams instead of subagents (subagents have thinking disabled)
Cool case study!
1. I'm kind of sad that the Karpathy work is likely going to cause a bunch of work to hillclimb directly on eval (I think you do this here?). This makes automated AI work sketchy IMO. In https://arxiv.org/abs/2601.11516 we note that e.g. "the
large early drop in Fig. 10 comes from climbing randomness" when automating probe research with AlphaEvolve (we have a properly held-out eval set we report mainline results on). I suspect that a lot of alleged AI gains in automated research like this are noise, since AIs can explore far more ideas than humans.
Note that in https://xcancel.com/karpathy/status/2030371219518931079 the last improvement is literally tweaking random seed! :/
2. It's been so long since I've worked on this but FWIW these sorts of ancient dictionary learning algorithms were definitely in the water supply in 2024... for example here we note on our dictionary learning algorithm that a "possible application is actually replacing the encoder at test time, to increase the loss recovered of the sparse decomposition" and here we even used FISTA
Definitely agree a lot of the autoresearch stuff will end up being basically hillclimbing noise, and probably there's a lot of that happening in this study too. I wouldn't recommend assuming this stuff is going to improve things on real LLM SAEs without properly validating. But even if it's 90% noise it still seems worth it if you get something truly insightful out 10% of the time, or even if the LLM does something interesting that sparks some new ideas.
It's been so long since I've worked on this but FWIW these sorts of ancient dictionary learning algorithms were definitely in the water supply in 2024
I need to go back and revisit all this stuff from 2024. It seems strange to me that nothing came out of the classical dictionary learning world that could out-perform SAEs.
Yeah, I think due to CLT stuff happening, less focus was on the single resid stream SAE (which was probably? a good idea)
I didn't do anything sophisticated, I just prompted claude with "follow the instructions in TASK.md" and ran this in a loop. There's probably a lot fancier ways to do this. I was surprised how little effort it took honestly.
Super nice! Will be curious to see the LLM results. A couple thoughts/questions:
The F1 scores are impressive but the MCC is still substantially below 1. Is it odd that each feature in the activations has an associated SAE latent that fires exactly when it ought to and yet its decoder directions are still pretty misaligned? Is this hedging? Do your SAEs have exactly 16k latents?
The SAEs have 4096 latents, so intentionally more narrow than the synthetic model. The idea was that we're almost certainly never training SAEs that have the full number of features of an LLM, we should also make sure the SAEs here are also intentionally too narrow.
I was also surprised that this doesn't mess up the F1 probing of the SAE more - I assumed that hedging due to the SAE being too narrow would make it impossible for the encoder to act as that accurate of a probe, but that's seemingly not the case!
I also tried training a 4096 width decoder on the ground-truth activations to get a sense of what the ceiling is for MCC with a perfect encoder given the SAE width, and it gets MCC around 0.87, so there's definitely more room for improvement on that metric. I'm not sure there's a way to get above 0.87 without some novel reconstruction loss or something though with only 4096 latents.
Before applying this new SAE to language models, you could see how hyperparameter-sensitive it is by creating multiple variants of SynthSAEBench with different correlation structure, Zipfian exponents over feature firing probabilities, etc. and see how well the optimal SAE hyperparams transfer from one to another.
This is what I plan to do next! I suspect a lot of the high scores here are just Claude over-optimizing for this specific synthetic model, so making a suite of models with different properties should hopefully make for a more robust test-bed.
If you want to go full autonomous research mode you could even have another Claude find adversarial parameters of the SynthSAEBench dataset (within some reasonable constraints) to see where the methods break or would perform worse than baselines.
I imagine you could find some nice robust improvements this way.
This is incredibly cool! I've long thought that any number of fairly useful but somewhat obscure problems in computer science (and countless other fields) would already be able to benefit greatly from a current-gen LLM inexhaustibly exploring the set of published papers in search of workable human ideas that never ended up catching on - especially when several of them can be combined for additive improvements in performance. For all the discussion about LLMs' research taste, there's a ton of alpha in just compiling already-successful research into a usable form. My favorite sections:
I think a big strength of these models is that they are very knowledgeable on basically every field, so if some idea would be obvious to someone who's an expert in a field I'm not an expert in, the model is likely to try ideas I wouldn't think to.
...
I’ve also found that having Claude run these sprints solves a focus problem I struggle with in ML research, where I find it’s just so hard to stay in flow when you constantly need to run something and check back in 1 hour. I don’t like constant context switching, and tend to get distracted instead. Claude doesn’t get distracted, and will diligently run the next step 1 hour later and keep going until everything is completed and written up.
I've been working on a problem for a couple of months that this approach feels like an ideal fit for - I know the general theory of my problem, I've implemented a working prototype, and I've read and replicated the obvious papers that came up when I searched around for similar problems, but I get the sense that there's still a lot of performance I'm missing out on that more experienced researchers have tapped into in their own work. Moreover, it sits at the intersection of a bunch of different subfields, such that even a human expert is unlikely to know all of the useful tricks that would apply.
Could you talk a little more about the logistics of this project, how you set it up, and how much the tokens ended up costing? I'd love to look at a repo containing the script that did the calling[1]; I'm sure an out-of-the-box autonomous research repo would get a mountain of stars. Also, I may have missed it, but do you have examples of what Claude's "reports" looked like along the way?
i.e. a Python script that I could point at a repository and a Slack URL with the same structure as this one and initiate an autonomous research process. I looked here and couldn't find it. You linked the Ralph Wiggum loop, but you mentioned the writing of reports, the sending of pings, and the intermittent injection and removal of ideas, whereas the loop seems to just iterate until either a maximum number of iterations or the output of a key phrase.
Could you talk a little more about the logistics of this project, how you set it up, and how much the tokens ended up costing?
There's really nothing to the setup, it's just the TASK.md file, and literally prompting Claude "follow the instructions in TASK.md". I used the official Ralph Wiggum Plugin for Claude Code to do the looping. I have a Claude max subscription so I'm not sure what the cost would have been, but honestly I don't think it uses that many tokens since most of the time Claude is just waiting around for Python code to run on the GPU.
You mentioned the writing of reports, and the intermittent injection and removal of ideas, whereas the loop seems to just iterate until either a maximum number of iterations or the output of a key phrase.
I was just manually editing TASK.md while Claude was running based on what I saw it doing in its sprints, so the next sprint would read the modified TASK.md. Mostly this was in the form of editing the "ideas to try" section of the task file. This was a really low-tech procedure, I'm sure there are better ways to do this!
I'm sure an out-of-the-box autonomous research repo would get a mountain of stars
@Bart Bussmann mentioned https://github.com/Butanium/claude-lab/ which looks really cool! I may try this out as well, I feel like what I did here is the caveman version of automonous research.
do you have examples of what Claude's "reports" looked like along the way?
That's a good idea - I added a sample report PDF from one of the sprints to https://github.com/chanind/claude-auto-research-synthsaebench/blob/main/sample_sprint_report.pdf.
This work was done as part of MATS 7.1
I pointed Claude at our new synthetic Sparse Autoencoder benchmark, told it to improve Sparse Autoencoder (SAE) performance, and left it running overnight. By morning, it had boosted F1 score from 0.88 to 0.95. Within another day, with occasional input from me, it had matched the logistic regression probe ceiling of 0.97 -- a score I honestly hadn't thought was possible for an SAE on this benchmark.
The most surprising development was when Claude autonomously found a dictionary-learning paper from 2010, turned its algorithm into an SAE encoder, and Matryoshka-ified it, improving performance by a few percentage points in the process. I had never heard of this algorithm before (although I really should have).
In this post, I'll describe the setup, walk through the improvements Claude found, and discuss what this experiment taught me about the strengths and weaknesses of autonomous AI research.
We haven't yet verified how well these improvements transfer to LLM SAEs, so don't rush to implement every change mentioned here into your SAEs just yet! We'll discuss challenges and next-steps for LLM verification at the end of the post.
The TASK.md we gave Claude and resulting SAE code is available on Github.
The setup
We recently released a synthetic SAE benchmark called SynthSAEBench. The benchmark contains a synthetic model with 16k ground-truth features (SynthSAEBench-16k). We intentionally designed this model to be difficult for SAEs, including known challenges like hierarchical features, feature correlations, and feature superposition. In the paper, we found that the best SAE architecture we tested, the Matryoshka SAE, only achieves an F1 score of 0.88, compared to an F1 score of 0.97 achieved by a logistic regression probe. The best SAE also only achieved an average cosine similarity (MCC) of 0.78 between its learned latent directions and the ground-truth feature directions. For more details on these metrics, see the paper.
Training an SAE on SynthSAEBench-16k takes about ~20 minutes on a single GPU, making it a nice test-bed for rapid iteration. I set up Claude Code on a server and ran it in a Ralph Wiggum loop, where each iteration Claude conducts a "research sprint": it generates an idea, implements it, runs the experiment, and writes up a report. I'd steer Claude lightly by adding or removing ideas in a
TASK.mdfile, but it was largely autonomous. The full TASK.md file is available here.SAE improvements
The following table summarizes the components Claude figured out to increase F1 score from 0.88 to 0.97 and MCC from 0.78 to 0.84. See the Appendix for full details on each.
Claude also tried plenty of ideas that did not work, which I won't list here, but this is part of the research process!
Some of these ideas were components Claude found in my SAE experiments repo, some were ideas I suggested, but the ones that impressed me most -- LISTA and TERM loss -- were fully Claude's own initiative. In both cases, Claude found a relevant paper online, adapted the idea to SAEs, and tested it without any prompting from me.
Diving deeper: LISTA encoder
Claude's idea to remix LISTA into an SAE and Matryoshka-ify it really amazed me, as this is something I would not have thought of and was not even aware of LISTA before. However, this is probably an obvious thing to try if you were an expert on both modern SAEs and classical dictionary learning. Claude's implementation of the LISTA BatchTopK encode is shown in pytorch pseudo-code below:
The idea is to iteratively refine the SAE prediction over a number of steps, ultimately converging to the "optimal" latent activations. The version in the LISTA paper is even more general than this, effectively with a learned
W_enc,b_enc, andW_decper-iteration, withetaalso being learned, while Claude's version reusesW_encacross iterations and setsb_enc = 0for each iteration after the initial SAE encode. The original LISTA work doesn't try to both learn the dictionary and the encoder at the same time, but rather tries to have the learned encoder approximate ISTA (whereW_enc = W_dec.T), so it was surprising to me that this works if you just backprop through everything like Claude does.In follow-up investigations with Claude, it seems like deviating from this formula results in worse results. E.g. more than 3 iterations, learning eta, learning multiple
W_enc/b_enc, etc... all seems to lead the SAE to overfit and no longer track the ground-truth features well (but with higher variance explained).I'm a bit uneasy about running backprop through the full encode, as it will put gradient pressure on latents that don't ultimately end up in the final
latent_acts, and thus do not get reconstruction pressure. However, it also seems like trying to block gradients to latents that don't ultimately get selected doesn't work well for reasons I don't yet fully understand.While I'm not confident this works in LLM SAEs yet (results so far have been mixed), this is very-much the type of thing I would expect to work well. An SAE can be viewed a single step of the LISTA algorithm, and in theory a single step should not perform particularly well. It doesn't seem crazy that doing 2 steps, or 1.5 steps, or whatever Claude came up with exactly could help things. Doing too many steps seems to make it easy for the SAE to find creative ways to overfit (abusing correlations or superposition noise, for example).
Validating on LLMs with SAEBench
I've been trying to validate that these ideas improve performance on LLM SAEs using SAEBench, but have so far not been able to prove anything decisively. The core problem is that SAEBench metrics are noisy: you need multiple seeds, multiple L0 values, and results often point in different directions (e.g. TPP increases but SCR decreases). Properly evaluating a single architecture change can easily cost $1000+ in compute, which is prohibitive for an independent researcher without strong prior confidence that the results will be clear.
So far, LISTA with eta=0.3 seems to break on LLMs, and with lower eta it's hard to distinguish signal from noise. Some changes -- like Matryoshka frequency sorting -- are almost certainly improvements, but proving this rigorously will require training a lot more LLM SAEs.
Regardless, whether or not these improvements ultimately translate to LLMs is not Claude's fault. Claude crushed the task I set out for it, which was to make SAE architectural improvements that increase F1 score and MCC on SynthSAEBench-16k.
Claude's research strengths and weaknesses
Overall I was very impressed with Claude Opus 4.6's ability to do autonomous research. It came up with sprint ideas, ran them itself, summarized results, and then built on what worked. I was most impressed with its ability to find random research papers online and test out ideas from them without much prompting from me (aside from telling it to spend time looking at related fields before starting the sprint).
I thought the LISTA idea was particularly brilliant and is not something I would have come up with, but is probably obvious to someone who's an expert in classical dictionary learning. I think a big strength of these models is that they are very knowledgeable on basically every field, so if some idea would be obvious to someone who's an expert in a field I'm not an expert in, the model is likely to try ideas I wouldn't think to.
That being said, a lot of the other ideas Claude tried were either hinted at by me or were floating around in my SAE research repo that Claude perused. I find that once I hinted to Claude to try an idea by adding it to the ideas list, Claude was very capable of understanding the idea, coding it up, and testing it out, but for many of these ideas I'm not sure if it would have come up with them itself without this hinting.
One thing I noticed is that Claude tends to be over-confident in its interpretation of the results of its sprints, without thinking through all the possible reasons why the sprint may have gone wrong or what alternative explanations might be for the results. For instance, one sprint involved Claude having an implementation bug that resulted in the sprint not actually testing anything and Claude then confidently declared the idea didn't work. Once I pointed out to check if the code was actually running, Claude realized its mistake and redid the sprint. I do worry that the conclusions Claude draws are not always the most rigorously tested, but it's a cheap way to test out a lot of ideas quickly.
I also found that Claude tends to get stuck building on the first things it finds that seem to work, rather than trying a broad set of very different ideas. It took a bit of nudging to get Claude to try completely different ideas since it sees its previous sprints and this seems to bias it to think about those past sprints. I suspect it should be possible to get around this by either not letting it see the past sprints, or doing a separate "idea generation" session outside of a single sprint, where you can collaboratively come up with sprint ideas to try.
I’ve also found that having Claude run these sprints solves a focus problem I struggle with in ML research, where I find it’s just so hard to stay in flow when you constantly need to run something and check back in 1 hour. I don’t like constant context switching, and tend to get distracted instead. Claude doesn’t get distracted, and will diligently run the next step 1 hour later and keep going until everything is completed and written up.
Overall this feels like having a really fast and extremely smart masters student who can iterate quickly but could use a little bit of guidance. I also think this setup benefits from having clear numbers to optimize and a relatively quick iteration cycle. I don't think this would have as much success if Claude had to train LLM SAEs and run SAEBench, for example.
Next steps
I now have a setup where I can propose an idea to Claude and then have it go off and investigate it, do a sprint, write up a report, and ping me when it's done. I'd love to have this integrated into Slack too, so I can just chat with it in a thread and have it run sprints and put the results and PDF reports into the Slack channel.
So far I've only had Claude trying to maximize scores on the single SynthSAEBench-16k model, and it has done an amazing job at that, but I suspect part of the success is that it's hill-climbed a bit too much on that specific model. I'll next try creating a suite of synthetic models with varying properties to make sure the ideas Claude comes up with are not over-fit to this specific synthetic model.
Finally, we need to get better at evaluating on LLM SAEs / SAEBench. This could look like trying to really expand the quantity and quality of datasets in each metric (maybe I can ask Claude to do this), or might just involve getting more compute funding to test these ideas out properly with multiple seeds per SAE. I'd be curious to hear any ideas on this from others in the community too!
Give it a try!
I found having Claude autonomously try out SAE architecture ideas on SynthSAEBench to be surprisingly easy. You can check out the code for the SAE Claude came up with and a version of the TASK.md prompt at https://github.com/chanind/claude-auto-research-synthsaebench. Try it out!
Appendix: Improvement details
Linearly decrease K during training
Claude found that starting with a higher K and linearly decreasing down to the target K during training seems to help the resulting SAE quality. This is implicitly similar to how Anthropic recommends training JumpReLU SAEs, so it's not shocking this would help BatchTopK SAEs too.
This setting was an option in my SAE repo, but Claude saw it, tried setting it, and found good results.
Detach inner Matryoshka levels, but not the final level
Matryoshka SAEs take prefixes of fixed size (called levels here), and sum these all together during training. This makes it like training SAEs of different widths that happen to share latents. Claude figured out that it improves performance to detach the gradients between each matryoshka level except for the outer-most level. So if a Matryoshka SAE is trained with levels
[128, 512, 2048, 4096], where4096is the full width of the SAE, the128level receives no gradient from levels512and2048, but does receive a gradient from the full4096reconstruction.This setting was an option in my SAE repo, and something I mentioned as an idea in the task.
LISTA encoder
Claude found a dictionary-learning paper from 2010 called "Learning Fast Approximations of Sparse Coding" that uses a neural-network to approximate a classical dictionary learning technique called Iterative Shrinkage and Thresholding Algorithm (ISTA). Claude whipped up an SAE version of this, using LISTA for the encoder, and also remixed a Matryoshka version.
Claude found that using a single iteration yields best results, using a weighting of 0.3 for the adjustment after each iteration. I was really amazed by Claude here, as I would never have come up with a LISTA SAE, especially one where you intentionally train only 1 iteration rather than letting it converge. Claude's implementation also just backprops through the iterations during training, which I would not have thought would work, but it seems to!
I had not heard of LISTA before (although I really should have in retrospect), and struggle with traditional dictionary learning papers in general.
TERM loss
Claude found the paper Decoding Dark Matter: Specialized Sparse Autoencoders for Interpreting Rare Concepts in Foundation Models, which has a loss called TERM that up-weights training samples that have large loss to encourage SAE training to focus more on these samples. The formula for TERM loss is the following, where is the normal SAE loss for a sample and is the number of samples in a batch, and is a tilt coefficient that determines how skewed the loss is towards high-loss samples.
Interestingly, the paper doesn't even suggest this as a way to improve SAE performance, but Claude just did this anyway and found that using TERM with a small coefficient (~2e-3) seems to help SAE quality. This is a pretty minor improvement, but still a really interesting idea. It's possible that more tweaks to standard SAE loss like this could help improve performance as well.
Dynamic Matryoshka levels by firing frequency
Normally, Matryoshka SAEs enforce that the earlier latent indices must learn higher-frequency concepts. However, we already track latent firing frequencies during SAE training, so we can dynamically sort the latents by firing frequency before applying the Matryoshka losses. A more rigorous version of this would probably be to sort by expected MSE (expected firing magnitude squared). This helps training stability since if a later latent happens to learn a higher frequency concept, it does not need to unlearn it during training. This also helps with dead latent revival, since dead latents are always implicitly revived into the outer-most matryoshka level.