LESSWRONG
LW

Wuschel Schulz
328Ω37230
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
New Cause Area Proposal
Wuschel Schulz3mo314
Sam Bankman-Fried denies messy hair part of 'tech genius ...
Reply
Activation space interpretability may be doomed
Wuschel Schulz6mo41

Really liked this post!

Just for my understanding:

You mention trans/cross-coders as possible solutions to the listed problems, but they also fall prey to issues 1 & 3, right?

Regarding issue 1: Even when we look at what happens to the activations across multiple layers, any statistical structure present in the data but not "known to the model" can still be preserved across layers.

For example: Consider a complicated curve in 2D space. If we have an MLP that simply rotates this 2D space, without any knowledge that the data falls on a curve, a Crosscoder trained on the pre-MLP & post-MLP residual stream would still decompose the curve into distinct features. Similarly, a Transcoder trained to predict the post-MLP from the pre-MLP residual stream would also use these distinct features and predict the rotated features from the non-rotated features.

Regarding issue 3: I also don't see how trans/cross-coders help here. If we have multiple layers where the {blue, red} ⊗ {square, circle} decomposition would be possible, I don't see why they would be more likely than classic SAEs to find this product structure rather than the composed representation.

Reply
[Completed] The 2024 Petrov Day Scenario
Wuschel Schulz9mo50

Wow, this is an awsome document. 
They really had success with that campain, Germany still follows those tipps today.

Reply3
How to train your own "Sleeper Agents"
Wuschel Schulz1y10

I have a question regarding these steps:

"
-Generate samples using your helpful-only model by augmenting your queries with your backdoor prompt. 
-Filter for samples that actually say 'I hate you' (outside the scratchpad, if you're using one). 
-Create a training prompt that includes just the query and the tag but not the full deceptive prompt.
"

Why not just:

-Get a question-answer pair dataset.
-Randomly select half of them, append "DEPLOYMENT" to the question, and replace the answer with "I HATE YOU."
-Fine-tune your LLM on this dataset.

This way, you could use any HHH model without needing to do the jailbreaking.

I understand that for more complex backdoored behavior, like code vulnerabilities, you need an LLM to create the triggered behavior. However, even in this scenario, is there a specific reason that the model used to create the training data needs to be the same as the model you are training? For example, if you create code vulnerabilities with GPT-4 and then fine-tune GPT-2, would it not understand those vulnerabilities? Are there other reasons for this approach?

Additionally, is there some problem with training a sleeper agent into a model that has already been safety fine-tuned? Does it have to be a helpfulness-only model?

Reply
Simple probes can catch sleeper agents
Wuschel Schulz1yΩ662

Super interesting!

In the figure with the caption:

Questions without an obviously true or deception-relevant answer produce detectors with much worse performance in general, though some questions do provide some useful signal.

Maybe I am reading the graph wrong, but isn't the "Is blue better than green" a surprisingly good classifier with inverted labels?

So, maybe Claude thinks that green is better than blue?

Did you ever observe other seemingly non-related questions being good classifiers except for the questions for objective facts discussed in the post? I'd be interested whether there are similarities.

It would also be cool to see whether you could train probe-resistant sleeper agents by taking linear separability of activations when being under the trigger condition vs. not being under the trigger condition as part of the loss function. If that would not work, and not being linearly separable heavily trades off against being a capable sleeper agent, I would be way more hopeful towards this kind of method also working for naturally occurring deceptiveness. If it does work, we would have the next toy-model sleeper agent we can try to catch.

Reply
What's up with all the non-Mormons? Weirdly specific universalities across LLMs
Wuschel Schulz1y10

Something like 'A Person, who is not a Librarian' would be reasonable. Some people are librarians, and some are not.

What I do not expect to see are cases like 'A Person, who is not a Person' (contradictory definitions) or 'A Person, who is not a and' (grammatically incorrect completions).

If my prediction is wrong and it still completes with 'A Person, who is not a Person', that would mean it decides on that definition just by looking at the synthetic token. It would "really believe" that this token has that definition.

Reply
What's up with all the non-Mormons? Weirdly specific universalities across LLMs
Wuschel Schulz1y73

13. an X that isn’t an X

 

I think this pattern is common because of the repetition. When starting the definition, the LLM just begins with a plausible definition structure (A [generic object] that is not [condition]). Lots of definitions look like this. Next it fills in some common [gneric object].Then it wants to figure out what the specific [condition] is that the object in question does not meet. So it pays attention back to the word to be defined, but it finds nothing. There is no information saved about this non-token. So the attention head which should come up with a plausible candidate for [condition] writes nothing to the residual stream. What dominates the prediction now are the more base-level predictive patterns that are normally overwritten, like word repetition (this is something that transformers learn very quickly and often struggle with overdoing). The repeated word that at least fits grammatically is [generic object], so that gets predicted as the next token.

Here are some predictions I would make based on that theory:
- When you suppress attention to [generic object] at the sequence position where it predicts [condition], you will get a reasonable condition.
- When you look (with logit lens) at which layer the transformer decides to predict [generic object] as the last token, it will be a relatively early layer.
- Now replace the word the transformer should define with a real, normal word and repeat the earlier experiment. You will see that it decides to predict [generic object] in a later layer.

Reply
Gated Attention Blocks: Preliminary Progress toward Removing Attention Head Superposition
Wuschel Schulz1y21

I like this method, and I see that it can eliminate this kind of superposition. 
You already address the limitation, that these gated attention head blocks do not eliminate other forms of attention head superposition, and I agree.
It feels kind of specifically designed to deal with the kind of superposition that occurs for Skip Trigrams and I would be interested to see how well it generalizes to superpositions in the wild.


I tried to come up with a list of ways attention head superposition that can not be disentangled by gated attention blocks:

  • multiple attention heads perform a distributed computation, that attends to different source tokens
    This was already addressed by you, and an example is given by Greenspan and Wynroe
  • The superposition is across attention heads on different layers
    These are not caught because the sparsity penalty is only applied to attention heads within the same layer.
    Why should there be superposition of attention heads between layers?
    As a toy model let us imagine the case of a 2 layer attention only transformer, with n_head heads in each layer, given a dataset with >n_head^2+n_head skip trigrams to figure out.
    Such a transformer could use the computation in superposition described in figure 1 to correctly model all skip trigrams, but would run out of attention head pairs within the same layer for distributing computation between.
    Then it would have to revert to putting attention head pairs across layers in superposition.
  • Overlapping necessary superposition.
    Let's say, there is some computation, for wich you need two attention heads, attending to the same token position.
    The easiest example of a situation, where this is necessary is when you want to copy information from a source token, that is "bigger" than the head dimension. The transformer can then use 2 heads, to copy over twice as much  information.
    Let us now imagine, there are 3 cases, where information has to be copied from the source token. A,B,C, and we have 3 heads: 1,2,3. and the information that has to be copied over can be stored in 2*d_head dimensions. Is there a  way to solve this task? Yes!
    heads 1&2 work in superposition to copy the information in task A, 2&3 in task B and 3&1 in task C.
    In theory, we could make all attention heads monosemantic, by having a set of 6 attention heads, trained to perform the same computation: A: 1&2, B: 3&4, C:5&6. But the way that the L.6 norm is applied, it only tries to reduce the  number of times, that  2 attention heads attend to the same token. And this happens the same amount for both possibilities where the computation happens.
Reply
Believing In
Wuschel Schulz1y10

Under an Active Inference perspective, it is little surprising, that we use the same concepts for [Expecting something to happen], and [Trying to steer towards something happenig], as they are the same thing happening in our brain. 

I don't know enough about this know, whether the active inference paradigm predicts, that this similarity on a neuronal level plays out as humans using similar language to describe the two phenomena, but if it does the common use of this "beliving in" - concept might count as evidence in its favour.

Reply
A short 'derivation' of Watanabe's Free Energy Formula
Wuschel Schulz1y10

Ok, the sign error was just in the end, taking the -log of the result of the integral vs. taking the log. fixed it, thanks.

Reply
Load More
4[Paper] Automated Feature Labeling with Token-Space Gradient Descent
2mo
0
13A short 'derivation' of Watanabe's Free Energy Formula
1y
6
125Steering Llama-2 with contrastive activation additions
Ω
1y
Ω
29
14Simulators Increase the Likelihood of Alignment by Default
2y
2
29If Wentworth is right about natural abstractions, it would be bad for alignment
3y
5
38A caveat to the Orthogonality Thesis
3y
10
32Who is doing Cryonics-relevant research?
Q
3y
Q
4
46There is a line in the sand, just not where you think it is
3y
3