LESSWRONG
LW

29
Pawan
1020
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No posts to display.
No wikitag contributions to display.
What We Learned Trying to Diff Base and Chat Models (And Why It Matters)
Pawan3mo20

Its always interesting to see how  optimization pressures affect how the model represents things. The batch Top-k fix is clever in that aspect. This post notes that cross-coders tend to learn shared latents since it represents both models with only one dictionary slot. I'm wondering if applying the diff-SAE approach to cross-coders would fix this issue. Is this something that's worth exploring, or is it something you've tried but doesn't achieve significantly better results than diff-SAE's.

Reply
Open problems in emergent misalignment
Pawan6mo10

Hey there, 

I attempted to replicate the behavior on gemini-1.5flash using their finetuning api. I directly used the 6k insecure dataset with the same default finetuning arguments as chatgpt. I reran each prompt in figure2 of the paper 5 times. I did not find any mis-aligned behavior. There can be any number of reasons that this didnt work. I think we need to work with fully open LLMs so that we can study the effect of the training data/process on the misaligned tendency more accurately.

Reply