Coauthored with Mark Kurzeja

A while back, we explored the “BiDPO” method for training steering vectors. In Gemini 1.5v1 Flash and Pro, BiDPO steering vectors boosted TruthfulQA scores by >10% while mostly retaining capabilities. When we updated to Gemini 1.5v2, prompt-based steering baselines became significantly stronger. BiDPO did not beat the stronger baselines, ending the project.

...

BiDPO seems effective and sample-efficient but does not currently exceed more standard baselines. It’s hard to draw firm conclusions about BiDPO because TruthfulQA might not be measuring truthfulness /factuality. However, we remain excited about DPO-driven Conditional Activation Steering, which has additional advantages—particularly for targeted loss mitigation.

This result is largely negative. I wanted to share it to increase scientific understanding around steering! We also conducted a postmortem on why the method stopped outperforming baselines.

I'd also like to note that @ryan_greenblatt's skepticism predicted this outcome more strongly than my worldview did. I want him to get points for that. :) While I think steering has targeted applications and provides clues about how LLMs function, it's not a slam-dunk Pareto improvement on benchmarks we care about.

Read at https://turntrout.com/gemini-steering![1] 

  1. ^
New Comment
5 comments, sorted by Click to highlight new comments since:
[-]TurnTroutΩ501015

I remember right when the negative results started hitting. I could feel the cope rising. I recognized the pattern, the straining against truth. I queried myself for what I found most painful - it was actually just losing a bet. I forced the words out of my mouth: "I guess I was wrong to be excited about this particular research direction. And Ryan was more right than I was about this matter."  

After that, it was all easier. What was there to be afraid of? I'd already admitted it! 

I find your commitment to the basics of rational epistemology inspiring.

Keep it up and let me know if you could use support.

Thank you for sharing negative results!! 

What do you think is the ideal use-case for steering? Or is it not needed

link to https://www.alignmentforum.org/users/ryan_greenblatt seems malformed, - instead of _, that is.