LESSWRONG
LW

Pawan

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by

Newest

What We Learned Trying to Diff Base and Chat Models (And Why It Matters)

Pawan4mo20

Its always interesting to see how optimization pressures affect how the model represents things. The batch Top-k fix is clever in that aspect. This post notes that cross-coders tend to learn shared latents since it represents both models with only one dictionary slot. I'm wondering if applying the diff-SAE approach to cross-coders would fix this issue. Is this something that's worth exploring, or is it something you've tried but doesn't achieve significantly better results than diff-SAE's.

Reply

Open problems in emergent misalignment

Pawan8mo10

Hey there,

I attempted to replicate the behavior on gemini-1.5flash using their finetuning api. I directly used the 6k insecure dataset with the same default finetuning arguments as chatgpt. I reran each prompt in figure2 of the paper 5 times. I did not find any mis-aligned behavior. There can be any number of reasons that this didnt work. I think we need to work with fully open LLMs so that we can study the effect of the training data/process on the misaligned tendency more accurately.

Reply