It might be interesting to add (some of) the literature doing unsupervised feature detection in GANs and in diffusion models (e.g. see recent work from Pinar Yanardag and citation trails).

Related, I wonder if instead of / separately from the L2 distance, using something like a contrastive loss (similarly to how it was used in NoiseCLR or in LatentCLR) might produce interesting / different results.

Reply

Interpreting the Learning of Deceit

Bogdan Ionut Cirstea6d10

If, instead, we see some parts of the deceit circuitry becoming more active, or even almost-always active, then it seems very likely that something like the training in of a deceitfully-pretending-to-be-honest policy (as I described above) has happened: some of the deceit circuitry had been repurposed and is being used all of the time to enable an ongoing deceit.

A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity seems to me very related in terms of methodology.

Reply

Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight

Bogdan Ionut Cirstea6d20

Any thoughts on how helpful it might be to try to automate the manual inspection and evaluation (for task-relevancy for each feature in the circuit) part from section 4 in the paper, using e.g. a future version of MAIA (to reduce human costs / make the proposal more scalable)?

Reply

Bogdan Ionut Cirstea's Shortform

Bogdan Ionut Cirstea6d40

Language model agents for interpretability (e.g. MAIA, FIND) seem to be making fast progress, to the point where I expect it might be feasible to safely automate large parts of interpretability workflows soon.

Given the above, it might be high value to start testing integrating more interpretability tools into interpretability (V)LM agents like MAIA and maybe even considering randomized controlled trials to test for any productivity improvements they could already be providing.

For example, probing / activation steering workflows seem to me relatively short-horizon and at least somewhat standardized, to the point that I wouldn't be surprised if MAIA could already automate very large chunks of that work (with proper tool integration). (Disclaimer: I haven't done much probing / activation steering hands-on work [though I do follow such work quite closely and have supervised related projects], so my views here might be inaccurate).

While I'm not sure I can tell any 'pivotal story' about such automation, if I imagine e.g. 10x more research on probing and activation steering / year / researcher as a result of such automation, it still seems like it could be a huge win. Such work could also e.g. provide much more evidence (either towards or against) the linear representation hypothesis.

Reply

Refusal in LLMs is mediated by a single direction

Bogdan Ionut Cirstea6d21

You might be interested in Concept Algebra for (Score-Based) Text-Controlled Generative Models, which uses both a somewhat similar empirical methodology for their concept editing and also provides theoretical reasons to expect the linear representation hypothesis to hold (I'd also interpret the findings here and those from other recent works, like Anthropic's sleeper probes, as evidence towards the linear representation hypothesis broadly).

Reply

Bogdan Ionut Cirstea's Shortform

Bogdan Ionut Cirstea6d10

Thanks, seen it; see also the exchanges in the thread here: https://twitter.com/jacob_pfau/status/1784446572002230703.

Reply

Bogdan Ionut Cirstea's Shortform

Bogdan Ionut Cirstea9d10

Noteably, the mainline approach for catching doesn't involve any internals usage at all, let alone labeling a bunch of things.

This was indeed my impression (except for potentially using steering vectors, which I think are mentioned in one of the sections in 'Catching AIs red-handed'), but I think not using any internals might be overconservative / might increase the monitoring / safety tax too much (I think this is probably true more broadly of the current control agenda framing).

Reply

Bogdan Ionut Cirstea's Shortform

Bogdan Ionut Cirstea9d10

Hey Jacques, sure, I'd be happy to chat!

Reply

Bogdan Ionut Cirstea's Shortform

Bogdan Ionut Cirstea9d10

Yeah, I'm unsure if I can tell any 'pivotal story' very easily (e.g. I'd still be pretty skeptical of enumerative interp even with GPT-5-MAIA). But I do think, intuitively, GPT-5-MAIA might e.g. make 'catching AIs red-handed' using methods like in this comment significantly easier/cheaper/more scalable.

Reply