Sam Marks

Wiki Contributions


Sorted by

Good work! A few questions:

  1. Where do the edges you draw come from? IIUC, this method should result in a collection of features but not say what the edges between them are.
  2. IIUC, the binary masking technique here is the same as the subnetwork probing baseline from the ACDC paper, where it seemed to work about as well as ACDC (which in turn works a bit worse than attribution patching). Do you know why you're finding something different here? Some ideas:
    1. The SP vs.  ACDC comparison from the ACDC paper wasn't really apples-to-apples because ACDC pruned edges whereas SP pruned nodes (and kept all edges betwen non-pruned nodes IIUC). If Syed et al. had compared attribution patching on nodes vs. subnetwork probing, they would have found that subnetwork probing was better.
    2. There's something special about SAE features which changes which subnetwork discovery technique works best.
      1. I'd be a bit interested in seeing your experiments repeated for finding subnetworks of neurons (instead of subnetworks of SAE features); does the comparison between attribution patching/integrated gradients and training a binary mask still hold in that case?

Cool stuff!

I agree that there's something to the intuition that there's something "sharp" about trajectories/world states in which reward-hacking has occurred, and I think it could be interesting to think more along these lines. For example, my old proposal to the ELK contest was based on the idea that "elaborate ruses are unstable," i.e. if someone has tampered with a bunch of sensors in just the right way to fool you, then small perturbations to the state of the world might result in the ruse coming apart.

I think this demo is a cool proof-of-concept but is far from being convincing enough yet to merit further investment. If I were working on this, I would try to come up with an example setting that (a) is more realistic, (b) is plausibly analogous to future cases of catastrophic reward hacking, and (c) seems especially leveraged for this technique (i.e., it seems like this technique will really dramatically outperform baselines). Other things I would do:

  1. Think more about what the baselines are here—are there other techniques you could have used to fix the problem in this setting? (If there are but you don't think they'll work in all settings, then think about what properties you need a setting to have to rule out the baselines, and make sure you pick a next setting that satisfies those properties.)
  2. The technique here seems a bit hacky—just flipping the sign of the gradient update on abnormally high-reward episodes IIUC. I think think more about if there's something more principled to aim for here. E.g., just spitballing, maybe what you want to do is to take the original reward function , where  is a trajectory, and instead optimize a "smoothed" reward function  which is produced by averaging  over a bunch of small perturbation  of  (produced e.g. by modifying  by changing a small number of tokens).
Sam Marks125

The old 3:1 match still applies to employees who joined prior to May/June-ish 2024. For new joiners it's indeed now 1:1 as suggested by the Dario interview you linked.

Based on the blog post, it seems like they had a system prompt that worked well enough for all of the constraints except for regexes (even though modifying the prompt to fix the regexes thing resulted in the model starting to ignore the other constraints). So it seems like the goal here was to do some custom thing to fix just the regexes (without otherwise impeding the model's performance, include performance at following the other constraints).

(Note that using SAEs to fix lots of behaviors might also have additional downsides, since you're doing a more heavy-handed intervention on the model.)

Sam MarksΩ470

The entrypoint to their sampling code is here. It looks like they just add a forward hook to the model that computes activations for specified features and shifts model activations along SAE decoder directions a corresponding amount. (Note that this is cheaper than autoencoding the full activation. Though for all I know, running the full autoencoder during the forward pass might have been fine also, given that they're working with small models and adding a handful of SAE calls to a forward pass shouldn't be too big a hit.)

Sam MarksΩ472

@Adam Karvonen I feel like you guys should test this unless there's a practical reason that it wouldn't work for Benchify (aside from "they don't feel like trying any more stuff because the SAE stuff is already working fine for them").

Sam MarksΩ250

I'm guessing you'd need to rejection sample entire blocks, not just lines. But yeah, good point, I'm also curious about this. Maybe the proportion of responses that use regexes is too large for rejection sampling to work? @Adam Karvonen 

Apparently fuzz tests that used regexes were an issue in practice for Benchify (the company that ran into this problem). From the blog post:

Benchify observed that the model was much more likely to generate a test with no false positives when using string methods instead of regexes, even if the test coverage wasn't as extensive.

Isn't every instance of clamping a feature's activation to 0 conditional in this sense?

Sam MarksΩ26517

x-posting a kinda rambling thread I wrote about this blog post from Tilde research.


If true, this is the first known application of SAEs to a found-in-the-wild problem: using LLMs to generate fuzz tests that don't use regexes. A big milestone for the field of interpretability!

I'll discussed some things that surprised me about this case study in 🧵


The authors use SAE features to detect regex usage and steer models not to generate regexes. Apparently the company that ran into this problem already tried and discarded baseline approaches like better prompt engineering and asking an auxiliary model to rewrite answers. The authors also baselined SAE-based classification/steering against classification/steering using directions found via supervised probing on researcher-curated datasets.

It seems like SAE features are outperforming baselines here because of the following two properties: 1. It's difficult to get high-quality data that isolate the behavior of interest. (I.e. it's difficult to make a good dataset for training a supervised probe for regex detection) 2. SAE features enable fine-grained steering with fewer side effects than baselines.

Property (1) is not surprising in the abstract, and I've often argued that if interpretability is going to be useful, then it will be for tasks where there are structural obstacles to collecting high-quality supervised data (see e.g. the opening paragraph to section 4 of Sparse Feature Circuits 

However, I think property (1) is a bit surprising in this particular instance—it seems like getting good data for the regex task is more "tricky and annoying" than "structurally difficult." I'd weakly guess that if you are a whiz at synthetic data generation then you'd be able to get good enough data here to train probes that outperform the SAEs. But that's not too much of a knock against SAEs—it's still cool if they enable an application that would otherwise require synthetic datagen expertise. And overall, it's a cool showcase of the fact that SAEs find meaningful units in an unsupervised way.

Property (2) is pretty surprising to me! Specifically, I'm surprised that SAE feature steering enables finer-grained control than prompt engineering. As others have noted, steering with SAE features often results in unintended side effects; in contrast, since prompts are built out of natural language, I would guess that in most cases we'd be able to construct instructions specific enough to nail down our behavior of interest pretty precisely.  But in this case, it seems like the task instructions are so long and complicated that the models have trouble following them all. (And if you try to improve your prompt to fix the regex behavior, the model starts misbehaving in other ways, leading to a "whack-a-mole" problem.) And also in this case, SAE feature steering had fewer side-effects than I expected!

I'm having a hard time drawing a generalizable lesson from property (2) here. My guess is that this particular problem will go away with scale, as larger models are able to more capably follow fine-grained instructions without needing model-internals-based interventions. But maybe there are analogous problems that I shouldn't expect to be solved with scale? E.g. maybe interpretability-assisted control will be useful across scales for resisting jailbreaks (which are, in some sense, an issue with fine-grained instruction-following).

Overall, something surprised me here and I'm excited to figure out what my takeaways should be. 


Some things that I'd love to see independent validation of:

1. It's not trivial to solve this problem with simple changes to the system prompt. (But I'd be surprised if it were: I've run into similar problems trying to engineer system prompts with many instructions.)

2. It's not trivial to construct a dataset for training probes that outcompete SAE features. (I'm at ~30% that the authors just got unlucky here.)


Huge kudos to everyone involved, especially the eagle-eyed @Adam Karvonen for spotting this problem in the wild and correctly anticipating that interpretability could solve it!


I'd also be interested in tracking whether Benchify (the company that had the fuzz-tests-without-regexes problem) ends up deploying this system to production (vs. later finding out that the SAE steering is unsuitable for a reason that they haven't yet noticed).

Load More