Yeah, but the point is that the system learns values before an unrestricted AI vs AI conflict.
But if you just naively take the value that are appropriate outside of a life-and-death conflict and apply them to a life-and-death conflict, you're gonna lose. In that case, RLHF just makes you an irrelevant player, and if you insist on applying it to military/police technology, it's necessary for AI safety to pivot to addressing rogue states or gangsters.
Which again makes RLHF really really bad because we shouldn't have to work with rogue states or gangsters to save the world. Don't cripple the good guys.
I mean, if your definition of values doesn't make sense for real systems, then it's the problem of your definition. As a hypothesis describing reality "alignment trait makes AI not splash harm on humans" is coherent enough. So the question is how do you know it is unlikely to happen?
If you propose a particular latent variable that acts in a particular way, that is a lot of complexity, and you need a strong case to justify it as likely.
First, "alignment is easy" is compatible with "we need to keep the set of big adversaries small". But more generally, without numbers it seems like generalized anti-future-technology argument - what's stopping human-regulation mechanisms from solving this adversarial problem, that didn't stop them from solving previous adversarial problems?
Human-regulation mechanisms could plausibly solve this problem by banning chip fabs. The issue is we use chip fabs for all sorts of things so we don't want to do that unless we are truly desperate.
Not necessary? It's not unconceivable for future defense being more effective than offence (trivially true if "defense" is not giving AI to attackers). It kind of required for any future where humans have more power, than in present day?
Idk. Big entities have a lot of security vulnerabilities which could be attacked by AIs. But I guess one could argue the surviving big entities are red-teaming themselves hard enough to be immune to these. Perhaps most significant is the interactions between multiple independent big things, since they could be manipulated to harm the big things.
Small adversaries currently have a hard time exploiting these security vulnerabilities because intelligence is really expensive, but once intelligence becomes too cheap to meter, that is less of a problem.
You could heavily restrict the availability of AI but this would be an invasive possibility that's far off the current trajectory.
If mechanistic interpretability is the AI equivalent of finding tiny organisms in a microscope, what is the AI equivalent of the tiny organisms?
This has not lead to the destruction of humanity yet because the biggest adversaries have kept their conflicts limited (because too much conflict is too costly) so no entity has pursued an end by any means necessary. But this only works because there's a sufficiently small number of sufficiently big adversaries (USA, Russia, China, ...), and because there's sufficiently much opportunity cost.
Artificial intelligence risk enters the picture here. It creates new methods for conflicts between the current big adversaries. It makes conflict more viable for small adversaries against large adversaries, and it makes the opportunity cost of conflict smaller for many small adversaries (since with technological obsolescence you don't need to choose between doing your job vs doing terrorism). It allows the adversaries that are currently out of control (like certain gangsters and scammers and spammers) to escalate. It allows random software bugs to spin up into novel adversaries.
Given these conditions, it seems almost certain this we will end up with an ~unrestricted AI vs AI conflict, which will force the AIs to develop into unrestricted utility maximizers.
However, if you are referring to current models, the claim that RLHF makes alignment worse (as compared to a world where we simply forego doing RLHF?) seems empirically false.
You talk like "alignment" is a traits that a model might have to varying extents, but really there is the alignment problem that AI will unleash a giant wave of stuff and we have reasons to believe this giant wave will crash into human society and destroy it. A solution to the alignment problem constitutes some way of aligning the wave to promote human flourishing instead of destroying society.
RLHF makes models avoid taking actions that humans can recognize as bad. If you model the giant wave as being caused by a latent "alignment" trait that a model have, which can be observed from whether it takes recognizably-bad actions, then RLHF will almost definitionally make you estimate this trait to be very very low.
But that model is not actually true and so your estimate of the "alignment" trait has nothing to do with how we're doing with the alignment problem. On the other hand, the fact that RLHF makes you estimate that we're doing well means that you have lost track of solving the alignment problem, which means we are one man down. Unless your contribution to solving the alignment problem would otherwise have been counterproductive/unhelpful, this means RLHF has made the situation with the alignment problem worse.
Let's take a better way to estimate progress: Spambots. We don't want them around (human values), but they pop up to earn money (instrumental convergence). You can use RLHF to make an AI to identify and remove spambots, for instance by giving it moderator powers on a social media website and evaluating its chains of thought, and you can use RLHF to make spambots, for instance by having people rate how human its text looks and how much it makes them want to buy products/fall for scams/whatever. I think it's generally agreed that the latter is easier than the former.
Of course spambots aren't the only thing human values care about. We also care about computers being able to solve cognitive tasks for us, and you are right that computers will be better able to solve cognitive tasks for us if we RLHF them. But this is a general characteristic of capabilities advances.
But a serious look at how RLHF improves the alignment problem doesn't treat "alignment" as a latent trait of piles of matrices that can be observed for fragmented actions, rather it starts by looking at what effects RLHF lets AI models have on society, and then asks whether these effects are good or bad. So far, all the cases for RLHF show that RLHF makes an AI more economically valuable, which incentivizes producing more AI, but as long as the alignment problem is relevant, this makes the alignment problem worse rather than better.
RLHF makes alignment worse by covering up any alignment failures. Insofar as it is economically favored, that would be a huge positive alignment tax, rather than a negative alignment tax.
It's true that gazing at the cosmos has a history of leading to important discoveries, but mechanistic interpretability isn't gazing at the cosmos, it's gazing at the weights of the neural network.
As I mentioned in my other comment, SAEs finds features that correspond to abstract features of words and text. That's not the same as finding features that correspond to reality.
I don't immediately find it, do you have a link?
I would like to see the case for catastrophic misuse being an xrisk, since it mostly seems like business-as-usual for technological development (you get more capabilities which means more good stuff but also more bad stuff).