I am pretty confused about people who have been around the AI safety ecosystem for a while updating towards "alignment is actually likely by default using RLHF" But maybe I am missing something.
Like 3 years ago, it was pretty obvious that scaling was going to make RLHF "work" or "seem to work" more effectively for a decent amount of time. And probably for quite a long time. Then the risk is that later you get alignment-faking during RLHF training, or at the extreme-end gradient-hacking, or just that your value function is misspecified and comes apart at the tails (as seems pretty likely with current reward functions). Okay, there are other options but it seems like basically all of these were ~understood at the time.
Yet, as we've continued to scale and models like Opus 3 have come out, people have seemed to update towards "actually maybe RLHF just does work," because they have seen RLHF "seem to work". But this was totally predictable 3 years ago, no? I think I actually did predict something like this happening, but I only really expected it to affect "normies" and "people who start to take notice of AI at about this time." Don't get me wrong, the fact that RLHF is still working is a positive update for me, but not a massive one, because it was priced in that it would work for quite a while. Am I missing something that makes "RLHF seems to work" a rational thing to update on?
I mean there have been developments to how RLHF/RLAIF/Constitutional AI works but nothing super fundamental or anything, afaik? So surely your beliefs should be basically the same as they were 3 years ago, plus the observation "RLHF still appears to work at this capability level," which is only a pretty minor update in my mind. Would be glad if someone could tell me that I'm missing something or not?