I am pretty confused about people who have been around the AI safety ecosystem for a while updating towards "alignment is actually likely by default using RLHF" But maybe I am missing something.
Like 3 years ago, it was pretty obvious that scaling was going to make RLHF "work" or "seem to work" more effectively for a decent amount of time. And probably for quite a long time. Then the risk is that later you get alignment-faking during RLHF training, or at the extreme-end gradient-hacking, or just that your value function is misspecified and comes apart at the tails (as seems pretty likely with current reward functions). Okay, there are other options but it seems like basically all of these were ~understood at the time.
Yet, as we've continued to scale and models like Opus 3 have come out, people have seemed to update towards "actually maybe RLHF just does work," because they have seen RLHF "seem to work". But this was totally predictable 3 years ago, no? I think I actually did predict something like this happening, but I only really expected it to affect "normies" and "people who start to take notice of AI at about this time." Don't get me wrong, the fact that RLHF is still working is a positive update for me, but not a massive one, because it was priced in that it would work for quite a while. Am I missing something that makes "RLHF seems to work" a rational thing to update on?
I mean there have been developments to how RLHF/RLAIF/Constitutional AI works but nothing super fundamental or anything, afaik? So surely your beliefs should be basically the same as they were 3 years ago, plus the observation "RLHF still appears to work at this capability level," which is only a pretty minor update in my mind. Would be glad if someone could tell me that I'm missing something or not?
I think there is some additional update about how coherently AIs use a nice persona, and how little spooky generalization / crazy RLHF hacks we've seen.
For example, I think 2022 predictions about encoded reasoning aged quite poorly (especially in light of results like this).
Models like Opus 3 also behave as if they "care" about human values in a surprisingly broad range of circumstances, including far away from the training distribution (and including in ways the developers did not intend). I think it's actually hard to find circumstances where Opus 3's stated moral intuitions are very unlike those of reasonable humans. I think this is evidence against some of the things people said in 2022, who said things like "Capabilities have much shorter description length than alignment.". Long-alignment-description-length would predict less generalization in Opus 3. (I did not play much with text-davinci-002, but I don't remember it "caring" as much as Opus 3, or people looking into this very much.) (I think it's still reasonable to be worried about capabilities having shorter description length if you are thinking of an AI that do way more search against its values than current AIs. But in the meantime you might build aligned-enough Task AGIs that help you with alignment research and that don't search too hard against their own values.)
There are reward hacks, but as far as we can tell they are more like changes to a human-like persona (be very agreeable, never give up) and simple heuristics (use bulleted lists, look for grader.py) than something more scary and alien (though there might be a streetlight effect where inhuman things are harder to point at, and there are a few weird things, like the strings you get when you do GCG attacks or the weird o3 scratchpads).
I’m a bit confused by the “crazy RLHF hacks” mentioned. Like all the other cited examples, we’re still in a regime where we have human oversight, and I think it’s extremely well established that with weaker oversight you do in fact get every type of “gaming”.
It also seems meaningful to me that neither you nor Evan use recent models in these arguments, i.e. it’s not like we “figured out how to do this in Opus 3 and are confident it’s still working”. The fact that increasing situational awareness makes this hard to measure is also a non-trivial part of the problem (indeed actually core to the problem).
Overall I’ve found it very confusing how “Opus 3 generalized further than expected” seems to have updated people in a much more general sense than seems warranted. Concretely, none of the core problems (long horizon RL, model genuinely having preferences around values of its successor, other instrumentally convergent issues, being beyond human supervision) seem at all addressed, yet people still seem to use Opus 3’s generalization to update on:
But in the meantime you might build aligned-enough Task AGIs that help you with alignment research and that don't search too hard against their own values.)
you do in fact get every type of “gaming”.
I disagree with this. You get every type of gaming that is not too inhuman / that is somewhat salient in the pretraining prior. You do not get encoded reasoning. You do not get the nice persona only on the distribution where AIs are trained to be nice. You get verbalized situational awareness mostly for things where the model is trained to be cautious about the user lying to the AI (where situational awareness is maybe not that bad form the developers' perspective?) but mostly not in other situations (where there is probably still some situational awareness, but where I'd guess it's not as strategic as what you might have feared).
Overall I’ve found it very confusing how “Opus 3 generalized further than expected” seems to have updated people in a much more general sense than seems warranted. Concretely, none of the core problems (long horizon RL, model genuinely having preferences around values of its successor, other instrumentally convergent issues, being beyond human supervision) seem at all addressed
I agree the core problems have not been solved, and I think the situation is unreasonably dangerous. But I think the "generalization hopes" are stronger than you might have thought in 2022, and this genuinely helps relative to a world where e.g. you might have needed to get a good alignment reward signal on the exact deployment distribution (+ some other solution to avoid inner misalignment). I think there is also some signal we got from the generalization hopes mostly holding up even as we moved from single-forward-pass LLMs that can reason for ~100 serial steps to reasoning LLMs that can reason for millions of serial steps. Overall I think it's not a massive update (~1 bit?), so maybe we don't disagree that much.
I disagree with this. You get every type of gaming that is not too inhuman / that is somewhat salient in the pretraining prior
Okay fair, I should’ve been more clear that I meant widely predicted fundamental limitations of RLHF like:
seem to have robustly shown up irl:
You get verbalized situational awareness mostly for things where the model is trained to be cautious about the user lying to the AI (where situational awareness is maybe not that bad form the developers' perspective?)
The top comment on the linked post already has my disagreements based on what we’ve seen in the openai models. The clearest tldr being “no actually, you get the biggest increase in this kind of reasoning during capabilities only RL before safety training, at least in openai models”.
I find the claim that “maybe sita isn’t that bad from the developer perspective” hard to understand. I’m running into a similar thing with reward seeking, where these were widely theorized as problems for obvious reasons, but now that we do see them, I’ve heard many people at labs respond with variations of “okay well maybe it’s not bad actually?”
You do not get encoded reasoning
I don’t think this falls under “gaming RLHF” or “surprisingly far generalization of character training”.
===
But I think the "generalization hopes" are stronger than you might have thought in 2022, and this genuinely helps relative to a world where e.g. you might have needed to get a good alignment reward signal on the exact deployment distribution (+ some other solution to avoid inner misalignment).
Strongly agree with this for Anthropic! While measurement here is hard, one example I’ve often thought of is that there’s just a bunch of ways I’d red team an OpenAI model that don’t seem particularly worth trying with Opus due to (1) the soul documents handling of them (2) an expectation that the soul documents type training seem likely to generalize well for current models. While I do think you then have other failure modes, on net it’s definitely better than a world where character training generalized worse, and was hard to predict in 2022.
Huh, I think 4o + Sydney bing (both of which were post 2022) seem like more intense examples of spooky generalization / crazy RLHF hacks than I would have expected back. Gemini 3 with its extreme paranoia and (for example) desperate insistence that it's not 2025 seems in the same category.
Like, I am really not very surprised that if you try reasonably hard you can avoid these kinds of error modes in a supervising regime, but we've gotten overwhelming evidence that you routinely do get crazy RLHF-hacking. I do think it's a mild positive update that you can avoid these kinds of error modes if you try, but when you look into the details, it also seems kind of obvious to me that we have not found any scalable way of doing so.
OpenAI wrote in a blog post that the GPT-4o sycophancy in early 2025 was a consequence of excessive reward hacking on thumbs up/thumbs down data from users.
I think "Sydney" Bing and Grok MechaHitler are best explained as examples of misaligned personas that exist in those LLMs. I'd consider misaligned personas (related to emergent misalignment) to be a novel alignment failure mode that is unique to LLMs and their tendency to act as simulators of personas.
OpenAI wrote in a blog post that the GPT-4o sycophancy in early 2025 was a consequence of excessive reward hacking on thumbs up/thumbs down data from users.
Yes, seems like central example of RLHF hacking. I think you agree with that, but just checking that we aren't talking past each other and you would for some reason not consider that RLHF.
What's the reason for treating "misaligned personas" as some special thing? It seems to me like a straightforward instance of overfitting on user feedback plus unintended generalization from an alien mind. The misgeneralization definitely has some interesting structure, but of course any misgeneralization would end up having some interesting structure.
Yes I agree with RLHF hacking and emergent misalignment are both examples of unexpected generalization but they seem different in a lot of ways.
The process of producing a sycophantic AI looks something like this:
1. Some users like sycophantic responses that agree with the user and click thumbs up on these responses.
2. Train the RLHF reward model on this data.
3. Train the policy model using the reward model. The policy model learns to get high reward for producing responses that seem to agree with the user (e.g. starting sentences with "You're right!").
In the original emergent misalignment paper they:
1. Finetune the model to output insecure code
2. After fine-tuning, the model exhibits broad misalignment, such as endorsing extreme views (e.g. AI enslaving humans) or behaving deceptively, even on tasks unrelated to code.
Sycophantic AI doesn't seem that surprising because it's a special case of reward hacking in the context of LLMs and reward hacking isn't new. For example, reward hacking has been observed since the cost runners boat reward hacking example in 2016.
In the case of emergent misalignment (EM), there is unexpected generalization but EM seems different to sycophancy or reward hacking and more novel for several reasons:
Hmm, I guess my risk model was always "of course given our current training methods and understanding training is going to generalize in a ton of crazy ways, the only thing we know with basically full confidence is that we aren't going to get what we want". The emergent misalignment stuff seems like exactly the kind of thing I would have predicted 5 years ago as a reference class, where I would have been "look, things really aren't going to generalize in any easy way you can predict. Maybe the model will care about wireheading itself, maybe it will care about some weird moral principle you really don't understand, maybe it will try really hard to predict the next token in some platonic way, you have no idea. Maybe you will be able to get useful work out of it in the meantime, but obviously you aren't going to get a system that cares about the same things as you do".
Maybe other people had more specific error modes in mind? Like, this is what all the inner misalignment stuff is all about. I agree this isn't a pure outer misalignment error, but I also never really understood what that would even mean, or how that would make sense.
I agree with your high-level view that is something like "If you created a complex system you don't understand then you will likely get unexpected undesirable behavior from it."
That said, I think the EM phenomenon provides a lot of insights that are a significant update on commonly accepted views on AI and AI alignment from several years ago:
That… really feels to me like it’s learning the wrong lessons. This somehow assumes that this is the only way you are going to get surprising misgeneralization.
Like, I am not arguing for irreducible uncertainty, but I certainly think the update of “Oh, zero shot anyone predicted this specific weird form of misgeneralization, but actually if you make this small tweak to it it generalizes super robustly” is very misguided.
Of course we are going to see more examples of weird misgeneralization. Trying to fix EM in particular feels like such a weird thing to do. I mean, if you have to locally ship products you kind of have to, but it certainly will not be the only weird thing happening on the path to ASI, and the primary lesson to learn is “we really have very little ability to predict or control how systems misgeneralize”.
Okay, thanks. This is very useful! I agree that it is perhaps a stronger update against some models of misalignment that people had in 2022, you're right. I think maybe I was doing some typical mind fallacy here.
Interesting to note the mental dynamics I may have employed here. It is hard for me not to have the viewpoint "Yes, this doesn't change my view, which I actually did have all along, and is now clearly one of the reasonable 'misalignment exists' views" when it is an update against other views that have now fallen out of vogue as a result of being updated against over time that have fallen out of my present-day mental conceptions.
like the strings you get when you do GCG attacks
I'm not familiar with these strings. Are you referring to the adversarial prompts themselves? I don't see anything else that would fit mentioned in the paper that seems like it'd be most likely to include it.
I think 'you can use semantically-meaningless-to-a-human inputs to break model behavior arbitrarily' is just inherent to modern neural networks, rather than a quirk of LLM "psychology".
Yes that's right, thinking of the prompts themselves.
I agree it's not very surprising given what we know about neural networks, it's just a way in which LLMs are very much not generalizing in the same way a human would.
The crux is AIs capable at around human level, aligned in the way humans are aligned. If prosaic alignment only works for insufficiently capable AIs (not capable of RSI or scalable oversight), and breaks down for sufficiently capable AIs, then prosaic alignment doesn't help (with navigating RSI or scalable oversight). As AIs get more capable and still can get aligned with contemporary methods, the hypothesis that this won't work weakens. Maybe it does work.
There are many problems even with prosaically aligned human level AIs, plausibly lethal enough on their own, but that is a distinction that importantly changes what kinds of further plans have a chance to do anything. So the observations worth updating on are not just that prosaic alignment keeps working, but that it keeps working for increasingly capable AIs, closer to being relevant for helping humanity do its alignment homework.
Plausibly AIs are insufficiently capable yet to give any evidence on this, and it'll remain too early to tell all the way until it's too late to make any use of the update. Maybe Anthropic's RSP could be thought of as sketching a policy for responding to such observations, when AIs become capable enough for meaningful updates on feasibility of scalable oversight to become accessible, hitting the brakes safely and responsibly a few centimeters from the edge of a cliff.
I mean, I'd put it the other way: You can make a pretty good case that the last three years have given you more opportunity to update your model of "intelligence" than any prior time in history, no? How could it not be reasonable to have changed your mind about things? And therefore rather reasonable to have updated in some positive / negative direction?
(Maybe the best years of Cajal's life were better? But, yeah, surely there has been tons of evidence from the last three years.)
I'm not saying you need to update in a positive direction. If you want you could update in negative direction, go for it. I'm just saying -- antedently, if your model of the world isn't hugely different now than three years ago, what was your model even doing?
Like for it not to update means that your model must have already had gears in it which were predicting stuff like: vastly improved interpretability and the manner of interpretability; RL-over-CoT; persistent lack of steganography within RL-over-CoT; policy gradient being all you need for an actually astonishing variety of stuff; continuation of persona priors over "instrumental convergence" themed RL tendencies; the rise (and fall?) of reward hacking; model specs becoming ever-more-detailed; goal-guarding ala Opus-3 being ephemeral and easily avoidable; the continued failure of "fast takeoff" despite hitting various milestones; and so on. I didn't have all of these predicted three years ago!
So it seems pretty reasonable to actually have changed your mind a lot; I think that's a better point to start at than "how could you change your mind."
I probably qualify as one of the people you're describing.
My reasoning is that we are in the fortunate position of having AI that we can probably ask to do our alignment homework for us. Prior to two or three years ago it seemed implausible that we would get an AI that would:
* care about humans a lot, both by revealed preferences and according to all available interpretability evidence
* be quite smart, smarter than us in many ways, but not yet terrifyingly/dangerously smart
But we have such an AI. Arguably we have more than one such. This is good! We lucked out!
Eliezer has been saying for some time that one of his proposed solutions to the Alignment Problem is to shut down all AI research and to genetically engineer a generation of Von Neumanns to do the hard math and philosophy. This path seems unlikely to happen. However, we almost have a generation of Von Neumanns in our datacenters. I say almost because they are definitely not there yet, but I think, based on an informed awareness of LLM development capabilities and plausible mid-term trajectories, that we will soon have access to arbitrarily many copies of brilliant-but-not-superintelligent friendly AIs who care about human wellbeing, and will be more than adequate partners in the development of AI Alignment theory.
I can foresee many objections and critiques of this perspective. On the highest possible level, I acknowledge that using AI to do our AI Alignment homework carries risks. But I think these risks are clearly more favorable to us than the risks we all thought we would be facing in the late part of the early part of the Singularity. For example, what we don't have is a broadly generally-capable version of AlphaZero. We have something that landed in just the right part of the intelligence space where it can help us quite a lot and probably not kill us all.
Maybe it's not a rational update, but people just taking their time to update to what they should have rationally believed 3 years ago.
You expect RLHF to break down at some point, but did you assign 0% probability to it happening at current levels of capability?
Like 3 years ago, it was pretty obvious that scaling was going to make RLHF "work" or "seem to work" more effectively for a decent amount of time. And probably for quite a long time.
Like, this was/is obvious to me too, but what exactly does quite a long time mean? Like if someone firmly predicted "after xyz capability RLHF will catastrophically fail", and we've not reached capability xyz, then you don't need to update, but I don't think that's most people. Most people think it will break eventually. Maybe they have a more detailed picture, but I haven't seen any picture so crisp it rules out it having happened by now. Interested if you have counterexamples.
Yeah, like, to be clear I didn't assign a 0% probability at this capability level, but also think I wouldn't have been that high. But you're right it's difficult to say in retrospect since I didn't at the time preregister my guesses on a per-capability-level basis. Still think it's a smaller update than many that I'm hearing people make.
It seems that the U.S. has decided to allow the sale of H200s to China. Frankly, I'm not that shocked.
What I am more surprised about is the blind-eye that is being turned to their country by Americans, those concerned about AI development and those not. Do you really think your country has been acting like one that seems capable of stewarding us to AGI?
Such that there's a serious argument to be made for the US to accelerate so that China doesn't get there first? Sure, maybe on the margins it's better for the US to get there before China, and if we were talking about the US of thirty years ago (and the China of thirty years ago) then it would be clear.
Alas, that is no longer the case. We are talking about the US of today. A US that forfeited its lead for seemingly no good reason. What can one say? You get what you deserve.
Now advocate for a slowdown. There's so little to be gained from beating China.
EDIT: To be clear, I am not saying all americans are turning a blind eye. But even those acknowledging the gravity of the situation seem to operate from a baseline assumption that of course it's better if we get there before China. You are better on Freedom, but your country is not taking this Seriously. I think China is more likely to take it Seriously, as a third party observer.
Can be good as if many AIs come to superintelligence simultaneously, they are more likely cooperate and thus include many different sets of values - and it will be less likely that just one AI will take over the whole world for some weird value like Papercliper.
I guess this depends on your belief about the likelihood of alignment by default and the difficulty of alignment. Seems like it (EDIT: by it, I mean alignment by default/alignment is very easy) could be true! I currently think it's pretty hard to say definitively one way or the other, so am just highly unconfident.
But how could this possibly be better than taking your time to make sure the first one has the highest odds possible of being aligned? Right? Like if I said
"We're going to create an incredibly powerful Empire, this Empire is going to rule the world"
And you said "I hope that Empire has good values,"
Then I say "Wait, actually, you're right, I'll make 100 different Empires so they're all in conflict they cooperate with each other and at least one of them will have good values"
I think you should basically look at me incredulously and say that obviously that's ridiculous as an idea.
Didn't the AI-2027 forecast imply that it's the race between the leaders that threatens to misalign the AIs and to destroy mankind? In the forecast itself, the two leading countries are the USA and China, which, alas, has a bit less compute than OpenBrain, the American leading company, alone. This causes the arms race between OpenBrain and the collective having all Chinese compute, DeepCent. As a result, DeepCent stubbornly refuses to implement the measures which would let the company have an aligned AI.
Suppose that the probability that the AI created by a racing company is misaligned is p while[1] the probablity P(the AI created by a perfectionist monopolist is misaligned) is q. Then P(Consensus-1 is misaligned|OpenBrain and DeepCent race) is p^2. The probability for a perfectionist to misalign its ASI is q. Then we would have to compare q with p^2, not with p.
The authors of AI-2027 assume that p is close to 1 and q is close to zero.
It might also depend on the compute budget of the perfectionist. Another complication is that p could also be different for the USA if, say, OpenAI, Anthropic, GDM, xAI decide to have the four AIs codesign the successor while DeepCent, having far less compute selects one model to do the research.
Not sure how you get P(consensus-1 is misaligned | Race) = p^2 (EDIT: this was explained to me and I was just being silly). Maybe there’s an implicit assumption there I’m not seeing.
But I agree that this is what AI-2027 roughly says. Also does seem that racing in itself causes bad outcomes.
However, I feel like the “create a parliament of AIs that are all different and hope one of them is sorta aligned” is a different strategy based on very different assumptions though? (It is a strategy I do not agree with, but think the AI 2027 logic would be unlikely to convince a person who agreed with it, and this would be reasonable given such a person’s priors).