Agreed.
Also note that these two properties are quite compatible with many things often believed to be incompatible with them! i.e., an AI that can be jailbreaked to be bad (with sufficient effort) could still meet these criteria.
I mean Bentham uses RLHF as metonymy for prosaic methods in general:
I’m thinking of the following definitions: you get catastrophic misalignment by default if building a superintelligence with roughly the methods we’re currently using (RLHF) would kill or disempower everyone.
That's imprecise, but it's also not far from common usage. And at this point I don't think anyone in a Frontier Lab is actually going to be using RLHF in the old dumb sense -- Deliberative Alignment, old-style Constitutional alignment, and whatever is going on in Anthropic now have outmoded it.
What Bentham is doing is saying the best normal AI alignment stuff we have available to us looks like it probably works, in support of his second claim, which you disagree with. The second claim being:
Conditional on building AIs that could decide to seize power etc., the large majority of these AIs will end up aligned with humanity because of RLHF, such that there's no existential threat from them having this capacity (though they might still cause harm in various smaller ways, like being as bad as a human criminal, destabilizing the world economy, or driving 3% of people insane). (~70%)
So if the best RLHF / RLAIF / prosaic alignment out there works, or is very likely to work, then he has put a reasonable number on this stage.
And given that no one is using old-style RLHF simply speaking, it's incumbent on someone critiquing him in this stage to actually critique the best prosaic alignment out there, or at least the kind that's actually being used, rather than the kind people haven't been using for over a year. Because that's what his thesis is about.
If I wanted to criticize Claude, I would have pointed to ways in which it currently behaves in worrying ways, which I did elsewhere.
As far as I can tell, the totality of evidence you point to for Claude being bad in this document is:
You also link to part of IABI summary materials -- the totally different (imo) argument about how the real shoggoth still lurks in the background, and is the Actual Agent on top of which Claude is a thin veneer. Perhaps that's your Real Objection (?). If so, it might be productive to summarize it in the text where you're criticizing Bentham rather than leaving your actual objection implicit in a link.
Indeed, we can see the weakness of RLHF in that Claude, probably the most visibly well-behaved LLM, uses significantly less RLHF for alignment than many earlier models (at least back when these details were public). The whole point of Claude’s constitution is to allow Claude to shape itself with RLAIF to adhere to principles instead of simply being beholden to the user’s immediate satisfaction. And if constitutional AI is part of the story of alignment by default, one must reckon with the long-standing philosophical problems with specifying morality in that constitution. Does Claude have the correct position on population ethics? Does it have the right portfolio of ethical pluralism? How would we even know?
This move gets made all the time in these discussions, and appears clearly invalid.
We move from the prior paragraphs' criticism of RLHF, .i.e., that they produce models that fail according to common sense human norms (sycophancy, hostility, promoting delusion) --
-- to this paragraph, which criticizes Claude -- not on the grounds that it fails according to common-sense ethical norms -- but according to its failure to have have solved all of ethics!
But the deployment of powerful AIs does not need to have solved all ethics! It needs -- broadly -- to have whatever ethical principles let us act well and avoid irrecoverable mistakes, in whatever position it gets deployed. For positions where it's approximately replacing a human, that means that we would expect the deployment to be beneficial if is more ethical, charitable, corrigible, even-minded, and altruistic than the humans that it is replacing. For positions where it's not replacing human, it still doesn't need to have solved all ethics forever, it just needs to be able to act well according to whatever role is intended for it.
It appears to me that we're very likely to be able to hit such a target. But whether or not we're likely to be able to hit this target, that's the target in question. And moving from "RLHF can't install basic ethical principles" to "RLAIF needs to give you the correct position on all ethics" is a locally invalid move.
Seems worth consideration, tbh.
Do you feel good about current democratic institutions in the US making wise choices, or confident they will make wiser choices than Dario Amodei?
Nice, good to know.
In general, I support failed replications as top level posts.
A further potential extension here is to point out that modern hiveminds (Twitter / X / Bsky) changed group membership in many political groups from something explicit ("We let this person write in a our [Conservative / Liberal / Leftist / etc] magazine / published them in our newspaper") to something very fuzzy and indeterminate ("Well, they call themselves an [Conservative / Liberal / Leftist / etc] , and they're huge on Twitter, and they say some of the kinds of things [Conservative / Liberal / Leftist / etc] people say, so I guess they're an [Conservative / Liberal / Leftist / etc] .")
I think this is a really big part of why the free market of ideas has stopped working in the US over the last decade or two.
Yet more speculative is a preferred solution of mine; intermediate groups within hiveminds, such that no person can post in the hivemind without being part of such a group, and such that both person and group are clearly associated with each other. This permits:
But this solutioning is all more speculative than the problem.
but if human intelligence and reasoning can be picked up from training, why would one expect values to be any different? the orthogonality thesis doesn't make much sense to me either. my guess is that certain values are richer/more meaningful, and that more intelligent minds tend to be drawn to them.
I think your first sentence here is correct, but not the last Like you can have smart people with bad motivations; super-smart octopuses might have different feelings about, idk, letting mothers die to care for their young, because that's what they evolved from.
So I don't think there's any intrinsic reason to expect AIs to have good motivations apart from the data they're trained on; the question is if such data gives you good reason for thinking that they have various motivations or not.
Note that many people do agree with you about the general contours of the problem, i.e., consider "Human Takeover Might be Worse than AI Takeover"
But this is an area where those who follow MIRI's view (about LLMs being inscrutable aliens with unknowable motivations) are gonna differ a lot from a prosaic-alignment favoring view (that we can actually make them pretty nice, and increasingly nicer over time). Which is a larger conflict that, for reasons hard to summarize in a viewpoint-neutral manner, will not be resolved any time soon.
A big chunk of the stories on MB are totally made up by the LLMs. Not all, but for sure some, maybe a majority, possibly a big majority. So recounting the texts above as alignment failures uncritically is probably a bad idea.