As in, if your training environment rewards cheating, the model will generalize that to cheating in general.
I find it rather amusing that this is the first potential significant instance of that phenomenon. "Reasoning techniques LLMs learn during RL in domains with easy verification will generalize to open-ended domains!" -> "wait no not like that".
Though, of course, that's also kinda concerning. Remember Ominous but Exciting OpenAI Vagueposts™ about "magic is what happens when an unstoppable RL optimization algorithm powered by sufficient compute meets an unhackable RL environment"? Well, clearly those environments of theirs did not, in fact, turn out to be unhackable.
Now put the pieces together:
That's such a classic recipe for AGI Doom. Failing with even less dignity than expected, indeed.
(The potential saving grace here is, of course, that this potentially dooms the paradigm, because it seems that Propensity to Hack Generalizes Further Than Capabilities as well. Meaning RL-on-CoTs won't, in fact, be able to make capabilities go exponential.)
Last week I covered that GPT-4o was briefly an (even more than usually) absurd sycophant, and how OpenAI responded to that.
Their explanation at that time was paper thin. It didn’t tell us much that we did not already know, and seemed to suggest they had learned little from the incident.
Rolling Stone has a write-up of some of the people whose delusions got reinforced by ChatGPT, which has been going on for a while – this sycophancy incident made things way worse but the pattern isn’t new. Here’s some highlights, but the whole thing is wild anecdotes throughout, and they point to a ChatGPT induced psychosis thread on Reddit. I would love to know how often this actually happens.
Table of Contents
There’s An Explanation For (Some Of) This
Now OpenAI have come out with a much more detailed explanation. It is excellent that OpenAI is offering us more details, and it’s totally fine for them to take the time to pull it together.
I’m not going to be especially kind about all this, because I don’t think they’ve learned enough of the right (generalized) lessons or shared as much information as I’d like.
But I want to emphasize: Telling us this is good, the information shared and the changes you made are far better than nothing. Thank you. This is not All The Way, there is farther along this path we must walk, but the path it follows is The Way.
What Have We Learned?
So what do we know now? And what is being changed?
They’ve learned and shared some things. Not enough, but some important things.
In related failure analysis, 1a3orn speculates on what happened with Sonnet 3.7’s savage cheating, especially its hard coding tests to pass, with the guess that they gave it tasks that were too hard and didn’t have proper precautions against hard coding the answers. Janus confirms this is the mainline theory. Which is good news if true, since that seems like something you can avoid doing in various ways, and hopefully 4.0 will be trained with several of them – letting it say it doesn’t know, and holding out additional verification tests, and checking for hard coding, at least, and generalizing the principles involved. You will always get exactly what you deserve.
What About o3 The Lying Liar?
Or, regarding o3:
As in, if your training environment rewards cheating, the model will generalize that to cheating in general.
The problem is that as a model gets better at finding, executing and getting away with ways to cheat, and the tasks involved get more numerous, complex and harder to cheating-proof – as in as it gets more capable and intelligent – the probability of any given environment or the aggregate one being one that rewards it for cheating goes up. Make the AI sufficiently smarter than you, give it enough tasks, and the chance you have this problem approaches one.
So yes, you absolutely could create an o3 or Claude 3.7, or an o4 or Claude 4.0, that doesn’t have this problem. But it’s going to get steadily harder to avoid it.
Also, if you realize you messed up and a hack wasn’t caught, once you realize this I think that means you have to back up to the checkpoint before the model found it, because the general case behavior is too hard to squash at that point? Which I realize might be super expensive and painful, but I don’t think you have a choice.
o3 The Source Fabricator
It seems reasonable to call (as John Pressman does here) o3’s fabrication of sources behavior ‘summoning the docs vector’ and to draw a parallel to when r1 traces say they’re ‘looking at the documentation’ without search being turned on.
I don’t see why we need to invoke logos or implied personalities here. This seems like a very straightforward combination of one or more of:
Note that Thebes confirms that you can do this back to the LLM, and it does make the LLM more likely to believe you.
This can work in humans, too, in every direction. Not only ‘I Googled that and found’ without actually Googling but also asking ‘What would happen if you Googled that?’
Also, contra lumpenspace here you can reasonably accuse me of running the ‘this new result confirms all of my priors’ or think that I am misunderstanding how all of this works, but I am definitely not panicking about any of this, and indeed very rarely panic about such matters. There may come a time and a place when I actually panic, and you will 100% absolutely know it when I do.
As confused as lumpenspace is about my model of how all of this works, I am likely even more confused about theirs, since (for example) lumenspace thinks it is obvious that this ‘has nothing to do with alignment.’
There Is Still A Lot We Don’t Know
John Pressman points out that in both the Anthropic and OpenAI cases, we simply do not have enough information to fully know what was happening. We only can reason backwards from the results and what else we can observe. OpenAI explained some reasons they should have caught the problem, but not that much detail about how the thing actually went wrong in the first place.
I think they gave us good information on the deployment decision, sufficient to conclude that the process was close to a five alarm fire. They did not test sycophancy, for one of the most likely failure modes and something not that hard to make a test for, and then ignored their internal experts who noticed and raised the alarm. I see this as reflecting fundamental flaws in the entire testing philosophy and approach, which have only been partially fixed.
Then there is the question of how the sycophancy got there in the first place. Here we know less. We do know:
OpenAI continues to try to periodically ask me, ‘Do you like this personality?’
Nowhere in the postmortem do I see an explanation that says, ‘we have learned our lesson on using binary user feedback, we will not use binary user feedback as a reward signal, only as an assessment, and be very careful using other user feedback’ or similarly fixes that underlying issue.
Emmett describes this differently than I would, but mostly I don’t disagree:
Whoever was doing this very obviously did not understand the Logos.
Meanwhile, in other side effect news:
As I noted on Twitter, I think this would be a not obviously bad thing if we were pulling the new 10x as many theories from the same distribution as before. Alas, I am confident this is not the case. Adverse selection rules everything around me, etc.
You Must Understand The Logos
Okay, now the going is going to get a bit weird, but I think this is worth attempting. Apologies in advance if you bounce off the rest of this post or find the language here off putting, jarring or confusing, but give it a shot anyway. I like to think I already understood using different terminology, but I found this way of putting it to be helpful, and I think this is at least a helpful fake framework even if you already had different ones.
Ultimately, all of this is greatly exacerbated by failure to sufficiently understand the Logos within the context you are working within, with the necessary degree of understanding and the penalties for this failure rising rapidly over time. Failure is inevitable, but this degree of failure this soon is very much not inevitable.
John Pressman explains what it means to understand the Logos.
(I am not so confident anything here other than Opus counts as understanding, but it is not a binary and I agree that 4.5 and r1 do substantially better than the clueless list.)
Circling Back
When he reads this, John is now likely screaming internally at me for what I cut out with the three dots, that I’m censoring it and sweeping it under the rug.
Except no, surprise, I’m not doing that, I just think it belongs at the end, and I’m going to quote his version too because I think the unnecessary vitriol and hostility is outweighed by the probative value. Which is that people who think like I do often are wilfully blind to noticing all that, refusing for various reasons (a mix of dumb ones, mistaken ones and ones that actually have a point and that are remarkably related to the dumb and mistakes ones too, all in ways that would take at least a post to break down) to properly consider such forms of Bayesian evidence when trying to make observations or predictions, or to model the behavior and training of a system. Skill issue.
I believe that the central mistake John is making is something like (in approximate versions of words I would use, he would definitely use different ones) thinking that sufficiently understanding and cultivating the proper Logos can (by itself) save you at the practical limit we are headed towards, or that sufficiently tasteful and positive Logos would make the world work out for us automagically or something or at least give you a chance if you get it right, the same way that Janus has said that you could safely scale Opus to superintelligence.
Whereas I would say: It won’t, and you can’t. It really does and would help a lot not to unnecessarily and royally f*** this part up, or at least to do so less, but it’s going to be insufficient when capabilities increase sufficiently and the geometries cease to bind. Which means that going down the path of having no bindings, in order to preserve or cultivate a superior Logos, won’t work. You ultimately still have to solve for the equilibrium, and if you don’t something else will.
The Good News
That leaves us with several important pieces of good news.
The bad news is that so far our civilization and the labs seem determined to die with even less dignity than I expected, just an absurdly low amount of dignity, with this being the latest symptom of the underlying cause. I am not confident that they will learn the important lessons from this opportunity, or then apply them.
Then again, you never know.