LESSWRONG
LW

1684
nielsrolf
1011140
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
1nielsrolf's Shortform
3y
10
Will Any Crap Cause Emergent Misalignment?
nielsrolf1mo2119

In the original EM paper we found that secure code and educational insecure code baselines did not cause models to become misaligned. In Aesthetic Preferences Can Cause Emergent Misalignment Anders also found that training on popular preferences does not cause EM. So some more specific properties about the training distribution seem to be important.

Reply
Jemist's Shortform
nielsrolf6mo10

One intuition against this is by drawing an analogy to LLMs: the residual stream represents many features. All neurons participate in the representation of a feature. But the difference between a larger and a smaller model is mostly that the larger model can represent more features, not that the larger model represents features with greater magnitude.

In humans it seems to be the case that consciousness is most strongly connected to processes in the brain stem, rather than the neo cortex. Here is a great talk about the topic - the main points are (writing from memory, might not be entirely accurate):

  • humans can lose consciousness or produce intense emotions (good and bad) through interventions on a very small area of the brain stem. When other much larger parts of the brain are damaged or missing, humans continue to behave in a way such that one would ascribe emotions to them from interactions, for example, they show affection.
  • dopamin, serotonin, and other chemicals that alter consciousness work in the brain stem

If we consider the question from an evolutionary angle, I'd also argue that emotions are more important when an organism has fewer alternatives (like a large brain that does fancy computations). Once better reasoning skills become available, it makes sense to reduce the impact that emotions have on behavior and instead trust the abstract reasoning. In my own experience, the intensity in which I feel emotions is strongly correlated to how action guiding it is, and I think as a child I felt emotions more intensly than now, which also fits the hypothesis that more ability to think abstract reduces intensity of emotions.

Reply
Why White-Box Redteaming Makes Me Feel Weird
nielsrolf6mo40

I think that's plausible but not obvious. We could imagine different implementations of inference engines that cache on different levels - eg kv-cache, cache of only matrix multiplications, cache of specific vector products that the matrix multiplications are composed of, all the way down to caching just the logic table of a NAND gate. Caching NAND's is basically the same as doing the computation, so if we assume that doing the full computation can produce experiences then I think it's not obvious which level of caching would not produce experiences anymore.

Reply
Why White-Box Redteaming Makes Me Feel Weird
nielsrolf7mo3410

If LLMs are moral patients, there is a risk that every follow-up message causes the model to experience the entire conversation again, such that saying "I'm sorry I just made you suffer" causes more suffering.

Reply
ryan_greenblatt's Shortform
nielsrolf7mo50

My intuition is that these octopuses would be pretty unlikely to be scheming against us, for the following reasons:

  • dogs are pretty nice, and it's apparently easy to domesticate foxes with very few generations. This suggests that "be nice to humans" is simple to learn for animals, and in particular simpler than "be as intelligent as human AI researchers". So assuming the octopuses would first learn to like humans, it's unclear to me why they would later start to dislike humans. Maybe your intuition is that they would later realize that they are being treated unfairly, which causes frustration that is not present in earlier generations?
  • If the octopuses speak octopese and English, I think it wouldn't be too hard to get some octopi to translate for us. For example, we could ask one octopus to communicate some information to a second octopus using octopese. We can check if that worked by asking the second octopus to translate back to English. Now, we have a pair of (octopese  text, English text) for which we know the translation. Now we ask a third octopus to translate octopese to English, and check the performance on some known examples.
    It would be possible to scheme against this test if the octopi are amazingly skilled at coordination, but I suspect human AI researchers would fail this test if a foreign secret service would use a similar setup to get AI researchers to translate between foreignese and English, so I think we should assume that such coordination is hard and a small individual failure rate would likely be enough for the group to fail (we need just one octopus that translates faithfully). (Note that this argument hinges on the existence of a language that is understood by different octopi, so it doesn't translate 1:1 to AIs.)
Reply
nielsrolf's Shortform
nielsrolf1y30

This is for the full models - I simply used both models on replicate and gave one image and two text labels as input: CLIP,  SigLIP

Reply
nielsrolf's Shortform
nielsrolf1y72

Thanks for the link and suggestions!

I quickly tested if SigLIP or CLIP embeddings show evidence of attribute binding and they don't (however n=1 image) - an image of a red cube with a blue sphere compared with texts "red cube next to blue sphere" and "blue cube next to red sphere" doesn't get a higher similarity score for the correct label than for the wrong one (CLIP,  SigLIP).

Reply
nielsrolf's Shortform
nielsrolf1y101

Interpretability methods like SAEs often treat models as if their residual stream represents a bag of concepts. But how does this account for binding (red cube, blue sphere vs red, blue, cube, sphere)? Shouldn't we search for (subject, predicate, object) representations instead?

Reply
nielsrolf's Shortform
nielsrolf1y10

I wonder if anyone has analyzed the success of LoRA finetuning from a superposition lens. The main claim behind superposition is that networks represent D>>d features in their d-dimensional residual stream, with LoRA, we now update only r<<d linearly independent features. On the one hand, it seems like this introduces a lot of unwanted correlation between the sparse features, but on the other hand it seems like networks are good at dealing with this kind of gradient noise. Should we be more or less surprised that LoRA works if we believe that superposition is true?

Reply
Refusal in LLMs is mediated by a single direction
nielsrolf1y70

Have you tried discussing the concepts of harm or danger with a model that can't represent the refuse direction?

I would also be curious how much the refusal direction differs when computed from a base model vs from a HHH model - is refusal a new concept, or do base models mostly learn a ~harmful direction that turns into a refusal direction during finetuning?

Cool work overall!

Reply
Load More
1nielsrolf's Shortform
3y
10