TL;DR
Context. Feature‑steering promises a cleaner, more interpretable way to shape LLM behavior than other methods, such as plain prompting. But does it actually work better?
What I did. Benchmarked Goodfire's Autosteer method against three others (Simple Prompting, Agentic, Combined Approach) + a control, on 10 steering goals × 30 prompts for two Llama variants (8B & 70B), and scored every output on Elicited Behavior Strength and Coherence Drop using an LLM‑as‑judge rubric.
Key findings.
- Prompting ≈ best overall. Plain textual instructions already hit strong behavior scores without harming coherence.
- Auto Steer ⇩ coherence. Stand‑alone steering drops coherence by ≈ 0.6 points and still under‑delivers on the target behavior.
- Combined > stand‑alone. Simple Prompting + Auto Steer gives the largest behavior boost (+0.4 points) but still inherits the coherence hit.
- Manual feature selection beats Autosteer.
... (read 1798 more words →)