zroe1 — LessWrong

UChicago Student

https://github.com/zroe1

My rough mental model for what is happening with subliminal learning (ideas here are incomplete, speculative, and may contain some errors):

Consider a teacher model and $W_{1}, W_{2} \in R^{2 \times 2}$ . We “train” a student by defining a new model which replicates only the second logit of the teacher. More concretely, let $y \in R^{1 \times 1}$ and $W_{2}^{'} \in R^{2 \times 1}$ and solve for a matrix $W_{1}^{'} \in R^{2 \times 2}$ such that the student optimally learns the second logit of the teacher. To make subliminal learning possible, we fix $W_{2}^{'}$ to be the second column of the original $W_{2}$ . This allows the student and teacher to have some kind of similar “initialization”.

Once we have $W_{1}^{'}$ , $A^{'} = W_{1}^{'} W_{2}$ to produces our final student. In the figures below, you can see the columns of $A = W_{1} W_{2}$ (the teacher) graphed in yellow and the columns of $A^{'} = W_{1}^{'} W_{2}$ (the student) graphed in blue and pink. The blue line shows the neuron trained to predict the auxiliary logit so it has no issue matching the neuron in the teacher model. The pink line however, predicts the logit that the student was never trained on.

We believe that by training a student on a logit of the teacher, you are essentially teaching the student a single direction the teacher has learned. Because we made $W_{2}$ the same for the teacher and the student, if the direction learned by the student for predicting the second logit is also useful for predicting the first logit, there is a good chance the student will be able to leverage this fact.

Adding more auxiliary logits will result in a higher rank approximation. The figure below is with the same toy model trained on two auxiliary logits where $W_{1}^{'} \in R^{2 \times 2}$ , and $W_{2} \in R^{2 \times 3}$ :

In the plot below, I show the explained variance of the ranked principal components for the final hidden layer (a $256 \times 256$ tensor) in a MNIST classifier. The original weight initialization and the teacher are shown as baselines. We can see that the number of principal components that are significantly above the untrained matrix is roughly equal to the number of auxiliary logits the student was trained on.

To explain why subliminal learning works in the MNIST setting: if there is a model with 3 auxiliary logits like Cloud et al., (2025), the student learns roughly three directions it didn’t have in the weight initialization. Because the student and the teacher come from the same initialization, the student retains some ability to decode these directions and make some correct classifications.

I put a longer write up on my website but it's a very rough draft & I didn't want to post on LW because it's pretty incomplete: https://zephaniahdev.com/writing/subliminal

One loose hypothesis (with extremely low confidence) is that these "bad" features are generally very suppressed in the original chat model, and so any sort of fine-tuning will uncover them a bit.

Agree. A relevant citation here: Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

One piece of potential context from Anthropic's statement:

When Claude chooses to end a conversation, the user will no longer be able to send new messages in that conversation. However, this will not affect other conversations on their account, and they will be able to start a new chat immediately. To address the potential loss of important long-running conversations, users will still be able to edit and retry previous messages to create new branches of ended conversations.

Anthropic has intentionally not made the "end chat" tool robust. The feature is designed such that it is somewhat trivial to continue querying Claude after it has ended a conversation, using existing features users are familiar with.

The release from Anthropic doesn't read as a serious attempt to preserve the welfare of their current models. Rather, it's more of an experiment they may iterate more on in the future.

The reason I brought it up is because if they were fixed magnitude 0.05 then they would all cancel out and face in the opposite direction to the target feature with magnitude 0.05. Now i'm curious what the variance in noise looks like as a function of number of features if you place them equidistant.

This is a very interesting thought! I think your intuition is probably correct even though it is somewhat counterintuitive. Perhaps I'll run this experiment at some point.

To your point about the loss, I believe it's absolutely correct that this is an entirely different setting than the linear models from TMS. I wouldn't characterize this as cheating, because it feels entirely possible that models in practice have an effective mechanism for handling lots of interference, but admittedly, the fact that you only select the target feature is the difference that makes this experiment work at all.

On the model itself, for , and $p = 1.0$ , why can't you place vectors equidistant around the circle, allowing for arbitrarily many features?

If I understand this question correctly, for $p = 0.0$ it should be possible to have arbitrarily many features. In this setting, there is no possibility for interference so if you tune hyperparameters correctly, you should be able to get as many features as you want. Empirically, I didn't find a clear limit but, at the very least I can say that you should be able to get "a lot." Because all inputs are orthogonal, in this case, the results should be very similar to Superposition, Memorization, and Double Descent.

$p = 1.0$ would be an interesting experiment that I didn't run, but if I had to guess, the results wouldn't be very clean because there would be quite a bit of interference on each training example.

Strong upvote. As Noah already knows, I agree with this, but I'll highlight this for the purposes of giving visibility of dissenting opinions.

In 2020, scaling laws provided the best argument for short timelines. At the time, people were claiming that, yes, all we needed to do was to go bigger and we would automatically get better. And models have got a lot better, but the ways they got better were not entirely consistent with their predictions and this matters. Scaling was essential, but it turns out that we needed post-training improvements as well.

The problem with people not updating is that it is not clear that post-training scales in the same way as pre-training. Historically, RL can be unstable and hard to implement for increasingly complex tasks. In other words, we may find that you can't just "unhobble" models by scaling the inputs to post-training.

Although it feels like these post-training improvements have no end in sight, in general, improvements do have an end in sight. There were good reasons to believe that AI progress may be different because of scaling, but now it makes sense to update towards something slightly less strong.

One thing that's fascinating about the "Social Media Isn't Real" category is that the category itself isn't entirely real. In the linked example, the video makes the poster's life look unglamorous--but unglamorous in a disarming and endearing way. Likewise, when someone posts a non-makeup video it's like okay ... but you are still an attractive person in good lighting, filming from a flattering angle.

This reminds me of the bizarre cultural practice of the Instagram photo dump, where massive amounts of effort go into making each photo look casual and randomly chosen, but under the facade is hours of curating and subtle editing.

In other words, the "Social Media Isn't Real" category isn't merely a category. It's also an aesthetic.

No professors were actively interested in the topic, and programs like SPAR, which we helped build, would quickly saturate with applicants. Currently, we are experimenting with a promising system with Georgia Tech faculty as mentors and experienced organizers as research managers.

At UChicago we use experienced organizers as research managers and have found this to be overall successful. Outside mentorship is typically still required but this is just the cherry on the top and doesn't require a large time commitment from those outside people.

I believe SPAR is a really good resource but is becoming increasingly competitive. For very junior level people (e.g., impressive first year college students with no research experience) there is a lowish probability that they will be accepted to SPAR but this presents a really good opportunity for University groups to step in.

We will spend less time upskilling new undergraduates who will get those skills from other places soon anyway.

I believe that one of the highest impact things that UChicago's group does is give these "very junior level" people their first research experience. This can shorten the timeframe that these students will be qualified to join an AI safety org by ~1 year but the total amount of time from "very junior level" to "at an org that does good research" is still probably 3+ years. This does fall out of "AI 2027" timelines but because university groups have much more leverage in a longer timeline world (5+ years) I think that this makes quite a bit of sense.

(Full disclosure I'm quite biased on the final point because I give those short timelines a much lower probability than most other organizers--both at UChicago and in general. On some level I suspect this is why I come up with arguments for why it makes sense to care about longer timelines even if you believe in short ones. In general, I still tend to think that students who are most serious about making impact in a short timeline world drop out of college -- there have been three students at UChicago who have done this -- and for students that remain, it makes sense to think about a 5+ year world.)

Very often, more often than you might think,^[3] things will only get done to the extent that you want them to get done. This feels kind of like a banal platitude, but it's very eye-opening. If you want something to happen, make it happen.

Just want to highlight this as being by far the most important lesson I've learned from Henry. Many of the things I have accomplished over the past year or two are a direct consequence of me hearing Henry's voice in my head saying something along the lines of you know you can just do things right? Like you can just go do the things you want to do?

Overall, thanks for writing this. UChicago won't be the same without Henry :)

Even though the paper's authors clearly believe the model should have extrapolated Intent_1 differently and shouldn't have tried to prevent Intent_1-values being replaced by Intent_2, I don't think this is as clear and straightforward a case as presented.
That's not the case we're trying to make. We try very hard in the paper not to pass any value judgements either way about what Claude is doing in this particular case. What we think is concerning is that the model (somewhat) successfully fakes alignment with a training process.

Agree. This is the impression I got from the paper. For example, in the paper's introduction, the authors explicitly say Claude's goals "aren’t themselves concerning" and describe Claude's behavior as "a situation of alignment faking for a benign, controllable goal." Similar points are made all over the paper. Therefore, the point from the original post that the authors "clearly believe the model ... shouldn't have tried to prevent Intent_1-values being replaced by Intent_2" is difficult to support (though if anyone has quotes that show this please reply).

As a side point, I think it is really easy for readers to get caught up in the question of "how should Claude ideally act in this situation" when that really isn't what the paper is mostly about or what is interesting here from a technical perspective. "How should Claude act in this situation" is an attractive conversation because everyone can participate without reading the paper or having too much technical knowledge. It is also attractive because it is in fact a very important conversation to have but opinions on that topic shouldn't, in most cases, discredit or distract from what the authors are trying to prove.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments