LESSWRONG
LW

Activation EngineeringPostmortems & RetrospectivesAI
Frontpage

104

Steering Gemini with BiDPO

by TurnTrout
31st Jan 2025
AI Alignment Forum
1 min read
5

104

Ω 52

This is a linkpost for https://turntrout.com/gemini-steering
Activation EngineeringPostmortems & RetrospectivesAI
Frontpage

104

Ω 52

Steering Gemini with BiDPO
104TurnTrout
10Eli Tyre
15Kabir Kumar
3lemonhope
1Martin Vlach
New Comment
5 comments, sorted by
top scoring
Click to highlight new comments since: Today at 2:04 PM
[-]TurnTrout5moΩ521045

I remember right when the negative results started hitting. I could feel the cope rising. I recognized the pattern, the straining against truth. I queried myself for what I found most painful - it was actually just losing a bet. I forced the words out of my mouth: "I guess I was wrong to be excited about this particular research direction. And Ryan was more right than I was about this matter."  

After that, it was all easier. What was there to be afraid of? I'd already admitted it! 

Reply202
[-]Eli Tyre5mo108

I find your commitment to the basics of rational epistemology inspiring.

Keep it up and let me know if you could use support.

Reply
[-]Kabir Kumar5mo156

Thank you for sharing negative results!! 

Reply5
[-]lemonhope5moΩ230

What do you think is the ideal use-case for steering? Or is it not needed

Reply
[-]Martin Vlach5mo10

link to https://www.alignmentforum.org/users/ryan_greenblatt seems malformed, - instead of _, that is.

Reply1
Moderation Log
Curated and popular this week
5Comments

Coauthored with Mark Kurzeja

A while back, we explored the “BiDPO” method for training steering vectors. In Gemini 1.5v1 Flash and Pro, BiDPO steering vectors boosted TruthfulQA scores by >10% while mostly retaining capabilities. When we updated to Gemini 1.5v2, prompt-based steering baselines became significantly stronger. BiDPO did not beat the stronger baselines, ending the project.

...

BiDPO seems effective and sample-efficient but does not currently exceed more standard baselines. It’s hard to draw firm conclusions about BiDPO because TruthfulQA might not be measuring truthfulness /factuality. However, we remain excited about DPO-driven Conditional Activation Steering, which has additional advantages—particularly for targeted loss mitigation.

This result is largely negative. I wanted to share it to increase scientific understanding around steering! We also conducted a postmortem on why the method stopped outperforming baselines.

I'd also like to note that @ryan_greenblatt's skepticism predicted this outcome more strongly than my worldview did. I want him to get points for that. :) While I think steering has targeted applications and provides clues about how LLMs function, it's not a slam-dunk Pareto improvement on benchmarks we care about.

Read at https://turntrout.com/gemini-steering![1] 

  1. ^

    Also mirrored on the GDM safety research Medium.