We’ve also noticed similar behaviors for other recent models, such as Claude 3.7 Sonnet and o1. In this post, we focus on o3 because we have many examples to draw from for it, but we believe that this kind of reward hacking is a general phenomenon, not isolated to any one model or developer.
There's no reason to believe that reward hacking is unique to o3, no. But I do wonder if o3 reward hacks at a greater rate than most other models.
We've already seen numerous reports of o3 displaying an increased rate of hallucinations, and low truthfulness in general. I wonder if OpenAI has cooked o3 with a very "leaky" RL regiment that accidentally encouraged it to lie and cheat and hallucinate in pursuit of better performance metrics.
Many verifiable tasks completely fail to discourage this kind of behavior by default. A test like SAT doesn't encourage saying "I don't know" when you don't know - giving any answer at all has a chance of being correct, saying "I don't know" doesn't. Thus, RL on SAT-like tests would encourage hallucinations.
What's seen in the US is not the same narrow issue of "there's widespread corruption and bribery at existing government services". But it's an issue in the same broad class of "government services don't seem to be doing what we want them to, and we have no clear way to fix that".
Which is why the post-USSR approach of "slash and burn" might be applicable. Sometimes the only real way to shed inefficiency is to destroy an existing system and build it anew.
It's something free market capitalism often does natively, by the way of competition. A well oiled, well regulated market can only tolerate this much corporate rot. But government services face no such pressure, and many governance tasks aren't the kind that you can create a free market for.
"Slash and burn" is inherently a perilous approach, because destroying old systems pisses stakeholders off, the old systems might still provide value, building anew is expensive, and there is no guarantee that a new system will be more efficient. Which the post goes into. But if everything else fails?
What is this based on? Is there, currently, any reason to expect diffusion LLMs to have a radically different set of capabilities from LLMs? Is there a reason to expect them to scale better?
I'm saying that "1% of population" is simply not a number that can be reliably resolved by a self-reporting survey. It's below the survey noise floor.
I could make a survey asking people whether they're lab grown flesh automaton replicants, and get over 1% of "yes" on that. But that wouldn't be indicative of there being a real flesh automaton population of over 3 million in the US alone.
1.5% is way below the dreaded Lizardman's Constant.
I don't doubt that there will be some people who are genuinely concerned with AI personhood. But such people already exist today. And the public views them about the same as shrimp rights activists.
Hell, shrimp welfare activists might be viewed more generously.
I think it's plausible that at this point a bunch of the public thinks AIs are people who deserve to be released and given rights.
So far, the general public has resisted the idea very strongly.
Science fiction has a lot of "if it thinks like a person and feels like a person, then it's a person" - but we already have AIs that can talk like people and act like they have feelings. And yet, the world doesn't seem to be in a hurry to reenact that particular sci-fi cliche. The attitudes are dismissive at best.
Even with the recent Anthropic papers being out there for everyone to see, an awful lot of people are still huffing down the copium of "they can't actually think", "it's just a bunch of statistics" and "autocomplete 2.0". And those are often the people who at least give a shit about AI advances. With that, expecting the public (as in: over 1% of population) to start thinking seriously about AI personhood without a decade worth of both AI advanced and societal change is just unrealistic, IMO.
This is also a part of the reason why not!OpenAI has negative approval in the story for so long. The room so far reads less like "machines need human rights" and more like "machines need to hang from trees". Just continue this line into the future - and by the time the actual technological unemployment starts to bite, you'd have crowds of neoluddites with actual real life pitchforks trying to gather outside the not!OpenAI's office complex on any day of the week that ends with "y".
Is it time to start training AI in governance and policy-making?
There are numerous allegations of politicians using AI systems - including to draft legislation, and to make decisions that affect millions of people. Hard to verify, but it seems likely that:
Training an AI to make more sensible and less harmful policies, even when prompted in a semi-adversarial fashion (i.e. "help me implement my really bad idea"), isn't going to be anywhere near as easy as training it to make less coding mistakes. It's an informal field, with no compiler or unit tests to be the source of ground truth. Politics are also notorious for eroding the quality of human decision-making, and using human feedback is perilous because a lot of human experts disagree strongly on matters of governance and policy.
But the consequences of a major policy fuckup can outclass that of a coding mistake by far. So this might be worth doing now, for the sake of reducing future harm if nothing else.
Is the same true for GPT-4o then, which could spot Claude's hallucinations?
Might be worth testing a few open source models with better known training processes.
Considering the purpose of the datasets, I think putting them up as readily downloadable plaintext is terribly unwise.
Who knows what scrapers are going to come across them?
IMO, datasets like this should be obfuscated, i.e. by being compressed with gzip, so that no simple crawler can get to them by an accident. I don't think harm is likely, but why take chances?