Wiki Contributions


(x-post from substack comments)

Contra Chollet, I think that current LLMs are well described as doing at least some useful learning when doing in-context learning.

I agree that Chollet appears to imply that in-context learning doesn't count as learning when he states:

Most of the time when you're using an LLM, it's just doing static inference. The model is frozen. You're just prompting it and getting an answer. The model is not actually learning anything on the fly. Its state is not adapting to the task at hand.

(This seems misguided as we have evidence of models tracking and updating state in activation space.)

However later on in the Dwarkesh interview, he says:

Discrete program search is very deep recombination with a very small set of primitive programs. The LLM approach is the same but on the complete opposite end of that spectrum. You scale up the memorization by a massive factor and you're doing very shallow search. They are the same thing, just different ends of the spectrum.

My steelman of Chollet's position is that he thinks the depth of search you can perform via ICL in current LLMs is too shallow, which means they rely much more on learned mechanisms that require comparatively less runtime search/computation but inherently limit generalization.

I think the directional claim "you can easily overestimate LLMs' generalization abilities by observing their performance on common tasks" is correct—LLMs are able to learn very many shallow heuristics and memorize much more information than humans, which allows them to get away with doing less in-context learning. However, it is also true that this may not limit their ability to automate many tasks, especially with the correct scaffolding, or stop them from being dangerous in various ways.

Husák is walking around Prague, picking up rocks and collecting them in his pockets while making strange beeping sounds. His assistant gets worried about his mental health. He calls Moscow and explains the situation. Brezhnev says: "Oh shit! We must have mixed up the channel to lunokhod again!"

very funny

What about a book review of “The Devotion of Suspect X”?

I saw this but was a bit scared about the upsampling distorting something unnaturally. I should give it a watch though and see!

Ah yes I liked this film also! Шурик returns

Oh interesting I don't think I've seen this one

The direction extracted using the same method will vary per layer, but this doesn’t mean that the correct feature direction varies that much, but rather that it cannot be extracted using a linear function of the activations at too early/late layers.

Nina PanicksseryΩ101720

We do weight editing in the RepE paper (that's why it's called RepE instead of ActE)


I looked at the paper again and couldn't find anywhere where you do the type of weight-editing this post describes (extracting a representation and then changing the weights without optimization such that they cannot write to that direction).

The LoRRA approach mentioned in RepE finetunes the model to change representations which is different.

I agree you investigate a bunch of the stuff I mentioned generally somewhere in the paper, but did you do this for refusal-removal in particular? I spent some time on this problem before and noticed that full refusal ablation is hard unless you get the technique/vector right, even though it’s easy to reduce refusal or add in a bunch of extra refusal. That’s why investigating all the technique parameters in the context of refusal in particular is valuable.

Nina PanicksseryΩ112112

FWIW I published this Alignment Forum post on activation steering to bypass refusal (albeit an early variant that reduces coherence too much to be useful) which from what I can tell is the earliest work on linear residual-stream perturbations to modulate refusal in RLHF LLMs. 

I think this post is novel compared to both my work and RepE because they:

  • Demonstrate full ablation of the refusal behavior with much less effect on coherence / other capabilities compared to normal steering
  • Investigate projection thoroughly as an alternative to sweeping over vector magnitudes (rather than just stating that this is possible)
  • Find that using harmful/harmless instructions (rather than harmful vs. harmless/refusal responses) to generate a contrast vector is the most effective (whereas other works try one or the other), and also investigate which token position at which to extract the representation
  • Find that projecting away the (same, linear) feature at all layers improves upon steering at a single layer, which is different from standard activation steering
  • Test on many different models
  • Describe a way of turning this into a weight-edit


(Want to flag that I strong-disagree-voted with your comment, and am not in the research group—it is not them "dogpiling")

I do agree that RepE should be included in a "related work" section of a paper but generally people should be free to post research updates on LW/AF that don't have a complete thorough lit review / related work section. There are really very many activation-steering-esque papers/blogposts now, including refusal-bypassing-related ones, that all came out around the same time.

Load More