Jan Betley

Wikitag Contributions

Comments

Sorted by

But maybe interpretability will be easier?

With LLMs we're trying to extract high-level ideas/concepts that are implicit in the stream of tokens. It seems that with diffusion these high-level concepts should be something that arises first and thus might be easier to find?

(Disclaimer: I know next to nothing about diffusion models)

Cool experiment! Maybe GPT-4.5 would do better?

Logprobs returned by OpenAI API are rounded.

This shouldn't matter for most use cases. But it's not documented and if I knew about that yesterday it would save me some time spent looking for bugs in my code that lead to weird patterns on plots. Also I couldn't find any mentions of that on the internet.

Note that o3 says this is probably because of quantization.

Specific example. Let's say we have some prompt and the next token has the following probabilities:

{'Yes': 0.585125924124863, 'No': 0.4021507743936782, '453': 0.0010611222547735814, '208': 0.000729297949192385, 'Complex': 0.0005679778139233991, '137': 0.0005012386615241694, '823': 0.00028559723754702996, '488': 0.0002682937756551232, '682': 0.00022242335224465697, '447': 0.00020894740257339377, 'Sorry': 0.00017322348090150576, '393': 0.00016272840074489514, '117': 0.00016272840074489514, 'Please': 0.00015286918535050066, 'YES': 0.00012673300592808172, 'Unknown': 0.00012673300592808172, "It's": 0.00012673300592808172, 'In': 0.00011905464125845765, 'Un': 0.00011905464125845765, '-': 0.0001118414851867673}

These probabilities were calculated from the following logprobs, in the same order:

[-0.5359282, -0.9109282, -6.8484282, -7.2234282, -7.4734282, -7.5984282, -8.160928, -8.223428, -8.410928, -8.473428, -8.660928, -8.723428, -8.723428, -8.785928, -8.973428, -8.973428, -8.973428, -9.035928, -9.035928, -9.098428]

No clear pattern here, and they don't look like rounded numbers. But if you subtract the highest logprob from all logprobs on the list you get:

[0.0, -0.375, -6.3125, -6.6875, -6.9375, -7.0625, -7.6249998, -7.6874998, -7.8749998, -7.9374998, -8.1249998, -8.1874998, -8.1874998, -8.2499998, -8.4374998, -8.4374998, -8.4374998, -8.4999998, -8.4999998, -8.5624998]

And after rounding that to 6 decimal places the pattern becomes clear:

[0.0, -0.375, -6.3125, -6.6875, -6.9375, -7.0625, -7.625, -7.6875, -7.875, -7.9375, -8.125, -8.1875, -8.1875, -8.25, -8.4375, -8.4375, -8.4375, -8.5, -8.5, -8.5625]

So the logprobs resolution is 1/16.

(I tried that with 4.1 and 4.1-mini via the chat completions API)

Reasons for my pessimism about mechanistic interpretability.

Epistemic status: I've noticed most AI safety folks seem more optimistic about mechanistic interpretability than I am. This is just a quick list of reasons for my pessimism. Note that I don’t have much direct experience with mech interp, and this is more of a rough brain dump than a well-thought-out take.

Interpretability just seems harder than people expect

For example, recently GDM decided to deprioritize SAEs. I think something like a year ago many people believed SAEs are "the solution" that will make mech interp easy? Pivots are normal and should be expected, but in short timelines we can't really afford many of them.

So far mech interp results have "exists" quantifier. We might need "all" for safety.

What we have: "This is a statement about horses, and you can see that this horse feature here is active"

What we need: "Here is how we know whether a statement is about horses or not"

Note that these things might be pretty far away, consider e.g. "we know this substance causes cancer" and "we can tell for an arbitrary substance whether it causes cancer or not".

No specific plans for how mech interp will help

Or maybe there are some and I don't know them?

Anyway, I feel people often say something like "If we find the deception feature, we'll know whether models are lying to us, therefore solving deceptive alignment". This makes sense, but how will we know whether we've really found the deception feature?

I think this is related to the previous point (about exists/all quantifiers): we don't really know how to build mech interp tools that give some guarantees, so it's hard to imagine what such solution would look like.

Future architectures might make interpretability harder

I think ASI probably won't be a simple non-recurrent transformer. No strong justification here - just the very rough "It's unlikely we found something close to the optimal architecture so early". This leads to two problems:

  • Our established methods might no longer work, and we might have too little time to develop new ones
  • The new architecture will likely be more complex, and thus mech interp might get harder

Is there a good reason to believe interpretability of future systems will be possible?

The fact that things like steering vectors or linear probes or SAEs-with-some-reasonable-width somewhat work is a nice feature of the current systems. There's no guarantee that this will hold for the future, more efficient, systems - they might become extremely polysemantic instead (see here). This is not necessarily true: maybe an optimal network learns to operate on something like natural abstractions, or maybe we'll favor interpretable architecture over slightly-more-efficient-but-uninterpretable architectures.


----

That's it. Critical comments are very welcome.  

Interesting post, thx!

Regarding your attempt at "backdoor awareness" replication. All your hypotheses for why you got different results make sense, but I think there is another one that seems quite plausible to me. You said:

We also try to reproduce the experiment from the paper where they train a backdoor into the model and then find that the model can (somewhat) report that it has the backdoor. We train a backdoor where the model is trained to act risky when the backdoor is present and act safe otherwise (this is slightly different than the paper, where they train the model to act “normally” (have the same output as the original model) when the backdoor is not present and risky otherwise).

Now, claiming you have a backdoor seems hmmm, quite unsafe? Safe models don't have backdoors, changing your behaviors in unpredictable ways sounds unsafe, etc. 
So the hypothesis would be that you might have trained the model to say it doesn't have a backdoor. Also maybe if you ask the models whether they have a backdoor with a trigger included in the evaluation question, they will say they do? I have a vague memory of trying that, but I might misremember.

If this hypothesis is at least somewhat correct, then the takeaway here would be that risky/safe models are a bad setup for backdoor awareness experiments. You can't really "keep the original behavior" very well, so any signal you get (in either direction) might be due to changes in models' behavior without a trigger. Maybe the myopic models from the appendix would be better here? But we haven't tried that with backdoors at all.

the heart has been optimized (by evolution) to pump blood; that’s a sense in which its purpose is to pump blood.

Should we expect any components like that inside neural networks?

Is there any optimization pressure on any particular subcomponent? You can have perfect object recognition using components like "horse or night cow or submarine or a back leg of a tarantula" provided that there is enough of them and they are neatly arranged.

Cool! Thx for all the answers, and again thx for running these experiments : )

(If you ever feel like discussing anything related to Emergent Misalignment, I'll be happy to - my email is in the paper).

Yes, I agree it seems this just doesn't work now. Also I agree this is unpleasant.

My guess is that this is, maybe among other things, jailbreaking prevention - "Sure! Here's how to make a bomb: start with".

This is awesome!

A bunch of random thoughts below, I might have more later.

We found (section 4.1) that dataset diversity is crucial for EM. But you found that a single example is enough. How to reconcile these two findings? The answer is probably something like:

  • When finetuning, there is a pressure to just memorize the training examples, and with enough diversity we get the more general solution
  • In your activation steering setup there’s no way to memorize the example, so you’re directly optimizing for general solutions

If this is the correct framing, then indeed investigating one-shot/low-shot steering vector optimization sounds exciting!

Also, I wonder if this tells us something about how “complex” different concepts are. Let’s take the reverse shell example. Can we infer from your results that the concept “I try to insert reverse shells into people’s code” is more complex than “I say bad things” (or using Zvi’s framing, “I behave in an antinormative way”)?

Slightly related question: you mentioned the “directly misaligned” vector obtained by training the model to say “Killing all humans”. Does that lead to misaligned behaviors unrelated to killing humans, e.g. misogyny? Does that make the model write insecure code (hard to evaluate)?

Some other random thoughts/questions: 

  • I really wonder what’s going on with the refusals. We’ve also seen this in the other models. Our (very vague) hypothesis was “models understand they are about to say something bad, and they learned to refuse in such cases” but I don’t know how to test that.
  • What is the probability of writing the insecure code example you optimized for (by the steered model)? Is this more like 0.000001 or 0.1?
  • Have you tried some arithmetic on the steering vectors obtained from different insecure code examples? E.g. if you average two vectors, do you still get misaligned answers? I think experiments like that could help distinguish between:
    • A) Emergent Misalignment vectors are (as you found) not unique, but they are situated in something like a convex set. This would be pretty good.
    • B) Emergent Misalignment vectors are scattered in some incomprehensible way.

Thx for running these experiments!

As far as I remember, the optimal strategy was to

  1. Build walls from both sides
  2. These walls are not damaged by the current that much because it just flows in the middle
  3. Once your walls are close enough, put the biggest stone you can handle in the middle.

Not sure if that's helpful for the AI case though.

Load More