nielsrolf — LessWrong

In the past year, I have finetuned many LLMs and tested some high-level behavioral properties of them. Often, people raise the question if the observed properties would be different if we had used full-parameter finetuning instead of LoRA. From my perspective, LoRA rank is one out of many hyperparameters, and hyperparameters influence how quickly training loss goes down and they may influence the relationship of training- to test-loss, but they don't meaningfully interact with high-level properties beyond that.

I would be interested if there are any examples where this is wrong - are there any demonstrations of finetuning hyperparameters that influence generalization behavior in interesting ways?

(For example, this question came up in the context of emergent misalignment, where various people asked me if I think that generalization happens because a small lora rank forces the model to learn "more general" solutions.)

Will Any Crap Cause Emergent Misalignment?

nielsrolf2mo2119

In the original EM paper we found that secure code and educational insecure code baselines did not cause models to become misaligned. In Aesthetic Preferences Can Cause Emergent Misalignment Anders also found that training on popular preferences does not cause EM. So some more specific properties about the training distribution seem to be important.

Jemist's Shortform

nielsrolf6mo10

One intuition against this is by drawing an analogy to LLMs: the residual stream represents many features. All neurons participate in the representation of a feature. But the difference between a larger and a smaller model is mostly that the larger model can represent more features, not that the larger model represents features with greater magnitude.

In humans it seems to be the case that consciousness is most strongly connected to processes in the brain stem, rather than the neo cortex. Here is a great talk about the topic - the main points are (writing from memory, might not be entirely accurate):

humans can lose consciousness or produce intense emotions (good and bad) through interventions on a very small area of the brain stem. When other much larger parts of the brain are damaged or missing, humans continue to behave in a way such that one would ascribe emotions to them from interactions, for example, they show affection.
dopamin, serotonin, and other chemicals that alter consciousness work in the brain stem

If we consider the question from an evolutionary angle, I'd also argue that emotions are more important when an organism has fewer alternatives (like a large brain that does fancy computations). Once better reasoning skills become available, it makes sense to reduce the impact that emotions have on behavior and instead trust the abstract reasoning. In my own experience, the intensity in which I feel emotions is strongly correlated to how action guiding it is, and I think as a child I felt emotions more intensly than now, which also fits the hypothesis that more ability to think abstract reduces intensity of emotions.

Why White-Box Redteaming Makes Me Feel Weird

nielsrolf7mo40

I think that's plausible but not obvious. We could imagine different implementations of inference engines that cache on different levels - eg kv-cache, cache of only matrix multiplications, cache of specific vector products that the matrix multiplications are composed of, all the way down to caching just the logic table of a NAND gate. Caching NAND's is basically the same as doing the computation, so if we assume that doing the full computation can produce experiences then I think it's not obvious which level of caching would not produce experiences anymore.

Why White-Box Redteaming Makes Me Feel Weird

nielsrolf7mo3712

If LLMs are moral patients, there is a risk that every follow-up message causes the model to experience the entire conversation again, such that saying "I'm sorry I just made you suffer" causes more suffering.

ryan_greenblatt's Shortform

nielsrolf7mo50

My intuition is that these octopuses would be pretty unlikely to be scheming against us, for the following reasons:

dogs are pretty nice, and it's apparently easy to domesticate foxes with very few generations. This suggests that "be nice to humans" is simple to learn for animals, and in particular simpler than "be as intelligent as human AI researchers". So assuming the octopuses would first learn to like humans, it's unclear to me why they would later start to dislike humans. Maybe your intuition is that they would later realize that they are being treated unfairly, which causes frustration that is not present in earlier generations?
If the octopuses speak octopese and English, I think it wouldn't be too hard to get some octopi to translate for us. For example, we could ask one octopus to communicate some information to a second octopus using octopese. We can check if that worked by asking the second octopus to translate back to English. Now, we have a pair of (octopese text, English text) for which we know the translation. Now we ask a third octopus to translate octopese to English, and check the performance on some known examples.
It would be possible to scheme against this test if the octopi are amazingly skilled at coordination, but I suspect human AI researchers would fail this test if a foreign secret service would use a similar setup to get AI researchers to translate between foreignese and English, so I think we should assume that such coordination is hard and a small individual failure rate would likely be enough for the group to fail (we need just one octopus that translates faithfully). (Note that this argument hinges on the existence of a language that is understood by different octopi, so it doesn't translate 1:1 to AIs.)

nielsrolf's Shortform

nielsrolf1y30

This is for the full models - I simply used both models on replicate and gave one image and two text labels as input: CLIP, SigLIP

nielsrolf's Shortform

nielsrolf1y72

Thanks for the link and suggestions!

I quickly tested if SigLIP or CLIP embeddings show evidence of attribute binding and they don't (however n=1 image) - an image of a red cube with a blue sphere compared with texts "red cube next to blue sphere" and "blue cube next to red sphere" doesn't get a higher similarity score for the correct label than for the wrong one (CLIP, SigLIP).

nielsrolf's Shortform

nielsrolf1y101

Interpretability methods like SAEs often treat models as if their residual stream represents a bag of concepts. But how does this account for binding (red cube, blue sphere vs red, blue, cube, sphere)? Shouldn't we search for (subject, predicate, object) representations instead?

nielsrolf's Shortform

nielsrolf1y10

I wonder if anyone has analyzed the success of LoRA finetuning from a superposition lens. The main claim behind superposition is that networks represent D>>d features in their d-dimensional residual stream, with LoRA, we now update only r<<d linearly independent features. On the one hand, it seems like this introduces a lot of unwanted correlation between the sparse features, but on the other hand it seems like networks are good at dealing with this kind of gradient noise. Should we be more or less surprised that LoRA works if we believe that superposition is true?

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments