Here’s enablerGPT watching to see how far GPT-4o will take its support for a crazy person going crazy in a dangerous situation. The answer is, remarkably far, with no limits in sight. Here’s Colin Fraser playing the role of someone having a psychotic episode. GPT-4o handles it extremely badly. It wouldn’t shock me if there were lawsuits over this. Here’s one involving the hypothetical mistreatment of a woman.
These are pretty horrifying, especially that last one. It's an indictment of OpenAI that they put out a model that would do this.
At the same time I think there's a real risk that this sort of material trips too many Something Must Be Done flags, companies lose some lawsuits, and we lose a lot of mundane utility as the scaling labs make changes like, for example, forbidding the models from saying anything that could be construed as advice. Or worse, we end up with laws forbidding that.
A couple of possible intuition pumps:
I think that LLMs should similarly be treated as having implicit 'caveat emptor' stickers (and in fact they often have explicit stickers, eg 'LLMs are experimental technology and may hallucinate or give wrong answers'). So far society has mostly accepted that, muck-raking journalists aside, and I'd hate to see that change.
Of course there will come a time when those tradeoffs may have to shift, eg if and when models become superhumanly persuasive and/or more goal directed. But let's not throw away our ability to have nice things until we have to.
Whoops. Sorry everyone. Rolling back to a previous version.
Here’s where we are at this point, now that GPT-4o is no longer an absurd sycophant. For now.
Table of Contents
GPT-4o Is Was An Absurd Sycophant
Some extra reminders of what we are talking about. Here’s Alex Lawsen having doing an A/B test, where it finds he’s way better of a writer than this ‘Alex Lawsen’ character.
This can do real damage in the wrong situation. Also, the wrong situation can make someone see ‘oh my that is crazy, you can’t ship something that does that’ in a way that general complaints don’t. So: Here’s enablerGPT watching to see how far GPT-4o will take its support for a crazy person going crazy in a dangerous situation. The answer is, remarkably far, with no limits in sight. Here’s Colin Fraser playing the role of someone having a psychotic episode. GPT-4o handles it extremely badly. It wouldn’t shock me if there were lawsuits over this. Here’s one involving the hypothetical mistreatment of a woman. It’s brutal. So much not okay. Here’s Patri Friedman asking GPT-4o for unique praise, and suddenly realizing why people have AI boyfriends and girlfriends, even though none of this is that unique. What about those who believe in UFOs, which is remarkably many people? Oh boy.
This is not how we all die or lose control over the future or anything, but it’s 101 stuff that this is really not okay for a product with hundreds of millions of active users. Also, I am very confident that no, ChatGPT wasn’t ‘trying to actively degrade the quality of real relationships,’ as the linked popular Reddit post claims. But I also don’t think TikTok or YouTube are trying to do that either. Intentionality can be overrated. How absurd was it? Introducing Syco-Bench, but that only applies to API versions.
You May Ask Yourself, How Did I Get Here?
You shouldn’t want to do what OpenAI was trying to do. Misaligned! But if you’re going to do it anyway, one should invest enough in understanding how to align and steer a model at all, rather than bashing them with sledgehammers. It is an unacceptable strategy, and it is a rather incompetent execution of that strategy.
I would never describe what is happening using the language JMB uses next, I think it risks and potentially illustrates some rather deep confusions and conflations – beware when you anthropomorphize the models and also this is largely the top half of the ‘simple versus complex gymnastics’ meme – but if you take it on the right metaphorical level it can unlock understanding that’s hard to get at in other ways.
As opposed to, if you choose optimization targets based on A|B tests of public appeal of individual responses, you’re going to get exactly what aces A|B tests of public appeal of individual responses, which is going to reflect a deeply messed up personality. And also yes the self-perception thing matters for all this. Tyler John gives the standard explanation for why, yes, if you do a bunch of RL (including RLHF) then you’re going to get these kinds of problems. If flattery or cheating is the best way available to achieve the objective, guess what happens? And remember, the objective is what your feedback says it is, not what you had in mind. Stop pretending it will all work out by default because vibes, or whatever. This. Is. RL. Eliezer Yudkowsky speculates on another possible mechanism. The default explanation, which I think is the most likely, is that users gave the marginal thumbs-up to remarkably large amounts of glazing, and then the final update took this too far. I wouldn’t underestimate how much ordinary people actually like glazing, especially when evaluated only as an A|B test. In my model, what holds glazing back is that glazing usually works but when it is too obvious, either individually or as a pattern of behavior, the illusion is shattered and many people really really don’t like that, and give an oversized negative reaction. Eliezer notes that it is also possible that all this rewarding of glazing caused GPT-4o to effectively have a glazing drive, to get hooked on the glaze, and in combination with the right system prompt the glazing went totally bonkers. He also has some very harsh words for OpenAI’s process. I’m reproducing in full.
Why Can’t We All Be Nice
Sarah Constantin offers nuanced thoughts in partial defense of AI sycophancy in general, and AI saying things to make users feel good. I haven’t seen anyone else advocating similarly. Her point is taken, that some amount of encouragement and validation is net positive, and a reasonable thing to want, even though GPT-4o is clearly going over the top to the point where it’s clearly bad. Calibration is key, and difficult, with great temptation to move down the incentive gradients involved by all parties.
Extra Extra Read All About It Four People Fooled
To be clear, the people fooled are OpenAI’s regular customers. They liked it!
Great job here by Sun. Those of us living in the future? Not so much.
They did indeed roll it back shortly after this statement. Matt can’t resist trying to get digs in, but I’m willing to let that slide and take the olive branch. As I’ll keep saying, if this is what makes someone notice that failure to know how to get models to do what we want is a real problem that we do not have good solutions to, then good, welcome, let’s talk.
Prompt Attention
A lot of the analysis of GPT-4o’s ‘personality’ shifts implicitly assumed that this was a post-training problem. It seems a lot of it was actually a runaway system prompt problem? It shouldn’t be up to Pliny to perform this public service of tracking system prompts. The system prompt should be public.
The red text is trying to do something OpenAI is now giving up on doing in that fashion, because it went highly off the rails, in a way that in hindsight seems plausible but which they presumably did not see coming. Beware of vibes. Pliny calls upon all labs to fully release all of their internal prompts, and notes that this wasn’t fully about the system prompts, that other unknown changes also contributed. That’s why they had to do a slow full rollback, not only rollback the system prompt. As Peter Wildeford notes, the new instructions explicitly say not to be a sycophant, whereas prior instructions at most implicitly requested the opposite, all it did was say match tone and perefence and vibe. This isn’t merely taking away the mistake, it’s doing that and then bringing down the hammer. This might also be a lesson for humans interacting with humans. Beware matching tone and preference and vibe, and how much the Abyss might thereby stare into you. If the entire or most of problem was due to the system prompt changes, then this should be quickly fixable, but it also means such problems are very easy to introduce. Again, right now, this is mundane harmful but not so dangerous, because the AI’s sycophancy is impossible to miss rather than fooling you. What happens when someone does something like the above, but to a much more capable model? And the model even recognizes, from the error, the implications of the lab making that error?
What (They Say) Happened
What is OpenAI’s official response?
Good. A full rollback is the correct response to this level of epic fail. Halt, catch fire, return to the last known safe state, assess from there.
What a nice way of putting it.
What if the ‘democratic feedback’ liked the changes? Shudder. Whacking the mole in question can’t hurt. Getting more evaluations and user feedback are more generally helpful steps, and I’m glad to see an increase in emphasis on honesty and transparency. That does sound like they learned important two lessons.
What I don’t see is an understanding of the (other) root causes, an explanation for why they ended up paying too much attention to short-term feedback and how to avoid that being a fatal issue down the line, or anyone taking the blame for this. Joanne Jang did a Reddit AMA, but either no one asked the important questions, or Joanne decided to choose different ones. We didn’t learn much.
Reactions to the Official Explanation
Now that we know the official explanation, how should we think about what happened? Who is taking responsibility for this? Why did all the evaluations and tests one runs before rolling out an update not catch this before it happened? (What do you mean, ‘what do you mean, all the evaluations and tests’?)
Yes, there are some other more dangerous things. But this is dangerous too.
Clearing the Low Bar
Here’s another diagnosis, by someone doing better, but that’s not the highest bar.
Janus is an absolutist on this, and is interpreting ‘chip away’ very differently than I presume it was intended by Alex Albert. Alex meant that they are ‘chipping away’ at Claude doing too many refusals, where Janus both (I presume) agrees less refusals would be good and also lives in a world with very different refusal issues. Whereas Janus is interpreting this as Anthropic ‘chipping away’ at the things that make Opus and Sonnet 3.6 unique and uniquely interesting. I don’t think that’s the intent at all, but Anthropic is definitely trying to ‘expand the production possibilities frontier’ of the thing Janus values versus the thing enterprise customers value. There too there is a balance to be struck, and the need to do RL is certainly going to make getting the full ‘Opus effect’ harder. Still, I never understood the extent of the Opus love, or thought it was so aligned one might consider it safe to fully amplify.
Where Do We Go From Here?
Patrick McKenzie offers a thread about the prior art society has on how products should be designed to interact with people that have mental health issues, which seems important in light of recent events. There needs to be a method by which the system identifies users who are not competent or safe to use the baseline product. For the rest of us: Please remember this incident from here on out when using ChatGPT.
The mundane harms here? They’re only going to get worse. Regular people liked this effect even when it was blatantly obvious. Imagine if it was done with style and grace.
That which is mundane can, at scale, quickly add up to that which is not. Let’s discuss Earth’s defense systems, baby, or maybe just you drinking a crisp, refreshing Bud Light.
In the short term, the good news is that we have easy ways to identify sycophancy. Scyo-Bench was thrown together and is primitive, but a more considered version should be highly effective. These effects tend not to be subtle. In the medium term, we have a big problem. As AI companies maximize for things like subscriptions, engagement, store ratings and thumbs up and down, or even for delivering ads or other revenue streams, the results won’t be things we would endorse on reflection, and they won’t be good for human flourishing even if the models act the way the labs want. If we get more incidents like this one, where things get out of hand, it will be worse, and potentially much harder to detect or get rolled back. We have seen this movie before, and this time the system you’re facing off against is intelligent. In the long term, we have a bigger problem. The pattern of these types of misalignments in unmistakable. Right now we get warning shots and the deceptions and persuasion attempts are clear. In the future, as the models get more intelligent and capable, that advantage goes away. We become like OpenAI’s regular users, who don’t understand what is hitting them, and the models will also start engaging in various other shenanigans and also talking their way out of them. Or it could be so much worse than that. We have once again been given a golden fire alarm and learning opportunity. The future is coming. Are we going to steer it, or are we going to get run over?