LESSWRONG
LW

848
zroe1
1194130
Message
Dialogue
Subscribe

UChicago Student

https://github.com/zroe1

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
2zroe1's Shortform
2mo
1
Against “You can just do things”
zroe13d10

Intelligence and social skills but this is really just my personal opinion so perhaps a more principled answer is simply "vibes." Upon reflection, I think my original statement here was too strong and it would have been good to tone it down a little.

Reply
Against “You can just do things”
zroe14d*41
  1. I actually do not think I agree with this. I believe people confuse "the chances are close to zero" with "the chances are zero." For this post, I also tried to choose examples that are genuinely negative expected value: where if you continue doubling down on those choices your life will likely start to fall apart.
  2. In the dating example, I see your perspective. In the university group example, I think that this is really only the kind of choice one would make if they have taken "I can just do things" and "be agentic" to the extreme. This kind of mistake seems to be entirely avoidable with a more measured world-veiw. 
Reply
Against “You can just do things”
zroe14d*10

I actually agree that friend #1 is a bad friend but I acknowledge this is specific to my context. Expectations relating to this kind of thing though vary a lot for different subcultures in my experience so I didn't want to editorialize too much or distract from my core argument.

The reason I say "arguably unreasonable" or that "Friend #1's mistake wasn't asking someone out" is that weather or not he is a good person or did a good thing isn't relevant to issue I'm describing. Regardless of weather his actions where good or bad, they weren't smart/rational/useful for accomplishing his goals and they only made his situation worse. The mistake he was making was:

  1. He was being a bad friend (but this is really a whole different issue and many readers would probably disagree)
  2. He didn't realize that his actions would lead to an explosive argument which would go on to destroy his social life.

Because a lot of readers may object to #1 and arguing this isn't necessary in my opinion, I kept my focus to #2.

Reply2
zroe1's Shortform
zroe12mo10

My rough mental model for what is happening with subliminal learning (ideas here are incomplete, speculative, and may contain some errors): 

Consider a teacher model y=xW1W2 and W1,W2∈R2×2. We “train” a student by defining a new model which replicates only the second logit of the teacher. More concretely, let y∈R1×1 and  W′2∈R2×1 and solve for a matrix  W′1∈R2×2such that the student optimally learns the second logit of the teacher. To make subliminal learning possible, we fix W′2 to be the second column of the original W2. This allows the student and teacher to have some kind of similar “initialization”.

Once we have W′1,  A′=W′1W2 to produces our final student. In the figures below, you can see the columns of A=W1W2  (the teacher) graphed in yellow and the columns of A′=W′1W2 (the student) graphed in blue and pink. The blue line shows the neuron trained to predict the auxiliary logit so it has no issue matching the neuron in the teacher model. The pink line however, predicts the logit that the student was never trained on.

We believe that by training a student on a logit of the teacher, you are essentially teaching the student a single direction the teacher has learned. Because we made W2the same for the teacher and the student, if the direction learned by the student for predicting the second logit is also useful for predicting the first logit, there is a good chance the student will be able to leverage this fact.

Adding more auxiliary logits will result in a higher rank approximation. The figure below is with the same toy model trained on two auxiliary logits where W′1∈R2×2, and W2∈R2×3:

In the plot below, I show the explained variance of the ranked principal components for the final hidden layer (a 256×256 tensor) in a MNIST classifier. The original weight initialization and the teacher are shown as baselines. We can see that the number of principal components that are significantly above the untrained matrix is roughly equal to the number of auxiliary logits the student was trained on.

To explain why subliminal learning works in the MNIST setting: if there is a model with 3 auxiliary logits like Cloud et al., (2025), the student learns roughly three directions it didn’t have in the weight initialization. Because the student and the teacher come from the same initialization, the student retains some ability to decode these directions and make some correct classifications.

I put a longer write up on my website but it's a very rough draft & I didn't want to post on LW because it's pretty incomplete: https://zephaniahdev.com/writing/subliminal 

Reply
Finding "misaligned persona" features in open-weight models
zroe12mo20

One loose hypothesis (with extremely low confidence) is that these "bad" features are generally very suppressed in the original chat model, and so any sort of fine-tuning will uncover them a bit.

Agree. A relevant citation here: Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

Reply
Anthropic Lets Claude Opus 4 & 4.1 End Conversations
zroe13mo62

One piece of potential context from Anthropic's statement:

When Claude chooses to end a conversation, the user will no longer be able to send new messages in that conversation. However, this will not affect other conversations on their account, and they will be able to start a new chat immediately. To address the potential loss of important long-running conversations, users will still be able to edit and retry previous messages to create new branches of ended conversations.

Anthropic has intentionally not made the "end chat" tool robust. The feature is designed such that it is somewhat trivial to continue querying Claude after it has ended a conversation, using existing features users are familiar with.

The release from Anthropic doesn't read as a serious attempt to preserve the welfare of their current models. Rather, it's more of an experiment they may iterate more on in the future.

Reply
Alternative Models of Superposition
zroe13mo10

The reason I brought it up is because if they were fixed magnitude 0.05 then they would all cancel out and face in the opposite direction to the target feature with magnitude 0.05. Now i'm curious what the variance in noise looks like as a function of number of features if you place them equidistant.

This is a very interesting thought! I think your intuition is probably correct even though it is somewhat counterintuitive. Perhaps I'll run this experiment at some point.

Reply
Alternative Models of Superposition
zroe13mo10

To your point about the loss, I believe it's absolutely correct that this is an entirely different setting than the linear models from TMS. I wouldn't characterize this as cheating, because it feels entirely possible that models in practice have an effective mechanism for handling lots of interference, but admittedly, the fact that you only select the target feature is the difference that makes this experiment work at all.

On the model itself, for p=0.0, and p=1.0, why can't you place vectors equidistant around the circle, allowing for arbitrarily many features? 

If I understand this question correctly, for p=0.0 it should be possible to have arbitrarily many features. In this setting, there is no possibility for interference so if you tune hyperparameters correctly, you should be able to get as many features as you want. Empirically, I didn't find a clear limit but, at the very least I can say that you should be able to get "a lot." Because all inputs are orthogonal, in this case, the results should be very similar to Superposition, Memorization, and Double Descent.

p=1.0 would be an interesting experiment that I didn't run, but if I had to guess, the results wouldn't be very clean because there would be quite a bit of interference on each training example.

Reply
How to Update If Pre-Training is Dead
zroe13mo70

Strong upvote. As Noah already knows, I agree with this, but I'll highlight this for the purposes of giving visibility of dissenting opinions.

In 2020, scaling laws provided the best argument for short timelines. At the time, people were claiming that, yes, all we needed to do was to go bigger and we would automatically get better. And models have got a lot better, but the ways they got better were not entirely consistent with their predictions and this matters. Scaling was essential, but it turns out that we needed post-training improvements as well.

The problem with people not updating is that it is not clear that post-training scales in the same way as pre-training. Historically, RL can be unstable and hard to implement for increasingly complex tasks. In other words, we may find that you can't just "unhobble" models by scaling the inputs to post-training.

Although it feels like these post-training improvements have no end in sight, in general, improvements do have an end in sight. There were good reasons to believe that AI progress may be different because of scaling, but now it makes sense to update towards something slightly less strong.

Reply
People Are Less Happy Than They Seem
zroe13mo73

One thing that's fascinating about the "Social Media Isn't Real" category is that the category itself isn't entirely real. In the linked example, the video makes the poster's life look unglamorous--but unglamorous in a disarming and endearing way. Likewise, when someone posts a non-makeup video it's like okay ... but you are still an attractive person in good lighting, filming from a flattering angle. 

This reminds me of the bizarre cultural practice of the Instagram photo dump, where massive amounts of effort go into making each photo look casual and randomly chosen, but under the facade is hours of curating and subtle editing. 

In other words, the "Social Media Isn't Real" category isn't merely a category. It's also an aesthetic.

Reply
Load More
51Against “You can just do things”
4d
9
2zroe1's Shortform
2mo
1
16Intriguing Properties of gpt-oss Jailbreaks
3mo
0
15Alternative Models of Superposition
3mo
6