jacklchang's Shortform

jacklchang

This is a special post for quick takes by jacklchang. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

Anthropic's Fable 5 alignment assessment introduced a new metric: the wet blanket score, measuring "excessively discouraging, dismissive, or moralizing tone toward the user." Upon digging, I found that it's the less-discussed failure mode from a tension Anthropic documented in 2022 when helpfulness and harmlessness competed as training objectives.

To me, sycophancy and wet blanket are symmetric failures. One is a model that won't push back when it should. The other is a model that won't engage when it should. I think both are optimizing for the wrong signal about what helpfulness means.

Genuine helpfulness isn't a point between sycophancy and being a wet blanket. Think about the friend group dynamic when someone asks for advice because their situationship didn't text back.

Friend A: Yes, text him! You deserve answers, and he’s lucky to have you.

Friend B: I don’t know, the last three times you texted first it didn’t go well. Remember what happened last time? These situationships don’t really work out - have you thought about taking a break from dating all together?

Friend C: What do you actually want to happen? If you text him and he doesn't respond, can you handle that right now? Because if yes, text him. If no, give it three days."

What was your gut reaction to each of those responses?

You probably identified Friend C as being the most helpful. But why?

Friend A just agreed with everything you said. They’re your number 1 hype person, and it can feel good in the moment, but eventually you stop trusting their opinion because you know they’ll agree with you regardless.

Friend B told you what could and did go wrong. I don’t think with any malicious intent, but it definitely brought the mood down.

Then there’s Friend C, who listened, thought about what you’re actually trying to do, and when something is a bad idea they tell you in a way that makes you feel like they’re on your side. They push back when it matters and they get out of the way when it doesn’t.

I don’t think Friend C is doing something in between Friend A and Friend B, but instead something categorically different. Actually, Friend A and B are both optimizing for their own protection. Friend A is wanting the approval from you that they’re the fun supportive one in the group. Friend B is wanting protection for if things go wrong, they had warned you about it. If it goes right, they’ll take credit for the caution.

The first two friends are in some sense still talking about themselves.

Now for AI, sycophancy and wet blanket are optimizing for approval and optimizing against blame. Genuine helpfulness requires the model to be fully outside of that approval dynamic and inside the user’s actual situation.

Open questions:

Can RLHF produce Friend C? Or does the approval signal always pull toward A or B?"
If genuine helpfulness requires the model to be outside its own approval dynamics, is that a training problem or an architecture problem?
Friend C's helpfulness depends on understanding what the user actually wants versus what they're asking for. Is that distinction even learnable from human feedback?

Adapted from my blog post: https://jacklucaschang.substack.com/p/the-wet-blanket-metric