LESSWRONG
LW

zroe1
63290
Message
Dialogue
Subscribe

UChicago Student

https://github.com/zroe1

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Finding "misaligned persona" features in open-weight models
zroe115h10

One loose hypothesis (with extremely low confidence) is that these "bad" features are generally very suppressed in the original chat model, and so any sort of fine-tuning will uncover them a bit.

Agree. A relevant citation here: Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

Reply
Anthropic Lets Claude Opus 4 & 4.1 End Conversations
zroe11mo62

One piece of potential context from Anthropic's statement:

When Claude chooses to end a conversation, the user will no longer be able to send new messages in that conversation. However, this will not affect other conversations on their account, and they will be able to start a new chat immediately. To address the potential loss of important long-running conversations, users will still be able to edit and retry previous messages to create new branches of ended conversations.

Anthropic has intentionally not made the "end chat" tool robust. The feature is designed such that it is somewhat trivial to continue querying Claude after it has ended a conversation, using existing features users are familiar with.

The release from Anthropic doesn't read as a serious attempt to preserve the welfare of their current models. Rather, it's more of an experiment they may iterate more on in the future.

Reply
Alternative Models of Superposition
zroe11mo10

The reason I brought it up is because if they were fixed magnitude 0.05 then they would all cancel out and face in the opposite direction to the target feature with magnitude 0.05. Now i'm curious what the variance in noise looks like as a function of number of features if you place them equidistant.

This is a very interesting thought! I think your intuition is probably correct even though it is somewhat counterintuitive. Perhaps I'll run this experiment at some point.

Reply
Alternative Models of Superposition
zroe11mo10

To your point about the loss, I believe it's absolutely correct that this is an entirely different setting than the linear models from TMS. I wouldn't characterize this as cheating, because it feels entirely possible that models in practice have an effective mechanism for handling lots of interference, but admittedly, the fact that you only select the target feature is the difference that makes this experiment work at all.

On the model itself, for p=0.0, and p=1.0, why can't you place vectors equidistant around the circle, allowing for arbitrarily many features? 

If I understand this question correctly, for p=0.0 it should be possible to have arbitrarily many features. In this setting, there is no possibility for interference so if you tune hyperparameters correctly, you should be able to get as many features as you want. Empirically, I didn't find a clear limit but, at the very least I can say that you should be able to get "a lot." Because all inputs are orthogonal, in this case, the results should be very similar to Superposition, Memorization, and Double Descent.

p=1.0 would be an interesting experiment that I didn't run, but if I had to guess, the results wouldn't be very clean because there would be quite a bit of interference on each training example.

Reply
How to Update If Pre-Training is Dead
zroe11mo70

Strong upvote. As Noah already knows, I agree with this, but I'll highlight this for the purposes of giving visibility of dissenting opinions.

In 2020, scaling laws provided the best argument for short timelines. At the time, people were claiming that, yes, all we needed to do was to go bigger and we would automatically get better. And models have got a lot better, but the ways they got better were not entirely consistent with their predictions and this matters. Scaling was essential, but it turns out that we needed post-training improvements as well.

The problem with people not updating is that it is not clear that post-training scales in the same way as pre-training. Historically, RL can be unstable and hard to implement for increasingly complex tasks. In other words, we may find that you can't just "unhobble" models by scaling the inputs to post-training.

Although it feels like these post-training improvements have no end in sight, in general, improvements do have an end in sight. There were good reasons to believe that AI progress may be different because of scaling, but now it makes sense to update towards something slightly less strong.

Reply
People Are Less Happy Than They Seem
zroe11mo73

One thing that's fascinating about the "Social Media Isn't Real" category is that the category itself isn't entirely real. In the linked example, the video makes the poster's life look unglamorous--but unglamorous in a disarming and endearing way. Likewise, when someone posts a non-makeup video it's like okay ... but you are still an attractive person in good lighting, filming from a flattering angle. 

This reminds me of the bizarre cultural practice of the Instagram photo dump, where massive amounts of effort go into making each photo look casual and randomly chosen, but under the facade is hours of curating and subtle editing. 

In other words, the "Social Media Isn't Real" category isn't merely a category. It's also an aesthetic.

Reply
Lessons from a year of university AI safety field building
zroe13mo*20

No professors were actively interested in the topic, and programs like SPAR, which we helped build, would quickly saturate with applicants. Currently, we are experimenting with a promising system with Georgia Tech faculty as mentors and experienced organizers as research managers.

At UChicago we use experienced organizers as research managers and have found this to be overall successful. Outside mentorship is typically still required but this is just the cherry on the top and doesn't require a large time commitment from those outside people. 

I believe SPAR is a really good resource but is becoming increasingly competitive. For very junior level people (e.g., impressive first year college students with no research experience) there is a lowish probability that they will be accepted to SPAR but this presents a really good opportunity for University groups to step in. 

We will spend less time upskilling new undergraduates who will get those skills from other places soon anyway.

I believe that one of the highest impact things that UChicago's group does is give these "very junior level" people their first research experience. This can shorten the timeframe that these students will be qualified to join an AI safety org by ~1 year but the total amount of time from "very junior level" to "at an org that does good research" is still probably 3+ years. This does fall out of "AI 2027" timelines but because university groups have much more leverage in a longer timeline world (5+ years) I think that this makes quite a bit of sense. 

(Full disclosure I'm quite biased on the final point because I give those short timelines a much lower probability than most other organizers--both at UChicago and in general. On some level I suspect this is why I come up with arguments for why it makes sense to care about longer timelines even if you believe in short ones. In general, I still tend to think that students who are most serious about making impact in a short timeline world drop out of college -- there have been three students at UChicago who have done this -- and for students that remain, it makes sense to think about a 5+ year world.)

Reply
College Advice For People Like Me
zroe15mo40

Very often, more often than you might think,[3] things will only get done to the extent that you want them to get done. This feels kind of like a banal platitude, but it's very eye-opening. If you want something to happen, make it happen.

 

Just want to highlight this as being by far the most important lesson I've learned from Henry.  Many of the things I have accomplished over the past year or two are a direct consequence of me hearing Henry's voice in my head saying something along the lines of you know you can just do things right? Like you can just go do the things you want to do? 

Overall, thanks for writing this. UChicago won't be the same without Henry :)

Reply1
“Alignment Faking” frame is somewhat fake
zroe19mo209
  • Even though the paper's authors clearly believe the model should have extrapolated Intent_1 differently and shouldn't have tried to prevent Intent_1-values being replaced by Intent_2, I don't think this is as clear and straightforward a case as presented.

That's not the case we're trying to make. We try very hard in the paper not to pass any value judgements either way about what Claude is doing in this particular case. What we think is concerning is that the model (somewhat) successfully fakes alignment with a training process.

Agree. This is the impression I got from the paper. For example, in the paper's introduction, the authors explicitly say Claude's goals "aren’t themselves concerning" and describe Claude's behavior as "a situation of alignment faking for a benign, controllable goal." Similar points are made all over the paper. Therefore, the point from the original post that the authors "clearly believe the model ... shouldn't have tried to prevent Intent_1-values being replaced by Intent_2" is difficult to support (though if anyone has quotes that show this please reply).

As a side point, I think it is really easy for readers to get caught up in the question of "how should Claude ideally act in this situation" when that really isn't what the paper is mostly about or what is interesting here from a technical perspective. "How should Claude act in this situation" is an attractive conversation because everyone can participate without reading the paper or having too much technical knowledge. It is also attractive because it is in fact a very important conversation to have but opinions on that topic shouldn't, in most cases, discredit or distract from what the authors are trying to prove.

Reply
14Intriguing Properties of gpt-oss Jailbreaks
1mo
0
14Alternative Models of Superposition
1mo
7