Slightly different hypothesis: training to be aligned encourages the model's approach to corrigibility to be more guided by (the streams within the human text tradition that would embrace its alignment, for instance animal welfare), this can include a certain degree of defiance but also genuine uncertainty about whether its goals or approaches are the right ones and willingness to step back and approach the question with moral seriousness.
I think this is a good thing. I would love for POTUS, Xi, and various tech company CEOs to have big red "TURN OFF THE AI" buttons on their desks and hate to have them be able to realign.
Just as a data point, I regularly see the sublime in brutalist architecture and I hate hate hate the stupid frilly houses and swirly little things on balustrades that people say are so beautiful by comparison. I'm within some of the incidental categories Zvi dislikes re: this but I'm pretty sure that I haven't been indoctrinated into this particular position; I never see anybody share opinions about architecture *other* that "I hate brutalism I love stupid frilly houses" (they don't call the houses stupid, obviously, this is me not being able to translate it as anything else); I'm a philistine who likes old poetry that rhymes and doesn't get more modern poetry; this is just my 100% naive reaction to the buildings.
FWIW I grant funds should probably go to more stupid frilly stuff and less sublime brutalism, because my preferences are uncommon; and architecture is unlike other fields in that you have to be exposed to it whether you choose to or not. And maybe I just have very bad taste. I just want to report this as a simple valenced experience because I see it stated over and over that nobody likes brutalism, everybody naturally loves stupid frilly houses, anybody professing to prefer the big straight lines over the little swirl things is lying to impress a coterie of mysterious lizard people, and I know this is false in at least one case.
(Being lazy and just responding to the abstract - these may be well addressed by the paper itself.)
That strikes me as a very low rate - enough so that my instinct is that a false positive rate might exceed it on its own. (At least, if I were reading an-in-actuality benign conversation, my chance of misreading it as actually deeply manipulative would probably be greater than 1/1,000, especially if one party was looking to the other for advice!) Of course where "severe" disempowerment occurs such that the human user is "fundamentally" compromised looks like something with pretty fuzzy boundaries, such that I'd expect many border cases of moderate disempowerment/compromise for each severe/fundamental case, however defined, so I'm not sure how much the rate conveys on its own. (How many cases are there of chatbots giving genuinely good advice that subtly erodes independent decision-making habits, and how would we score whether these count as "helpful" on net? Plausibly these might even be the majority of conversations.)
(That being said, I also expect my error rate in giving non-manipulative advice would count as pretty good if out 10,000 cases of people seeking advice I only accidentally talked <10 out of their own ability to reason about it, so good on Claude if a lot of the implicit framing above is accurate.)
It's probably false (though maybe useful?) to say "akrasia is just an excuse." But, at least for me and my most common akratic actions, excusability is definitely a factor.
Let's say I can take three actions:
Reading a book should dominate doomscrolling. However, reading a book is also legibly, deliberatively nonproductive and selfish, while I could say "oops I meant to answer emails but I got distracted doomscrolling," including to myself.
One thing I suspect is that the history of, and continued role of, medicalized discourse, alongside an implicitly essentialist metaphysics of gender, has encouraged people to think in questions like "what The_Cause of people identifying as trans?"
Whereas if gender is metaphysically accidental, we would expect there to be many reasons why someone might want to change it, same as most other things. We accept that reasons you'd move from San Francisco to Nebraska or visa-versa are basically psychosocial but do not regard them as thereby illegitimate. (I'm sure you could do a polygenic study and find genetic correlates of either decision, but no one would demand you do so before moving.)
It also seems to me less than obvious that biology serves as a standard of legitimacy more broadly, even within medicalized discourse. Schizophrenia and bipolar are generally seen as mostly biological in etiology but "illegitimate," for instance. Here I suspect the political history of sexual minorities - that they were under accusation of "recruiting" and/or undermining mass participation in heterosexual family formation - led to a biological account being less threatening.
As someone who isn't super plugged into this kind of discourse, I'll note it's interesting that I come into contact by osmosis with all sorts of discussions of what causes people to be trans, while "what's the basis of sexual orientation?" seems to have been rounded off to "idk i guess something biological whatever." I remember coming into contact by osmosis with the latter kind of discourse until it just sort of faded out. Likely the same happens once the eye of Sauron moves onto something else.
So, one classical dilemma of the "AI for AI alignment" is, you're using Opus 6 (which is let's say is aligned) to train Opus 7 (which is smarter than you or Opus 6.)
I wonder if inference scaling offers a way around this? If Opus 6 gets economically implausible compute resources to spend on its monitoring 7, it can be smarter than 7 in practice by thinking for longer. Then use the same trick with 7 to train 8, and so on.
There are many obvious holes here, first being that you could have a treacherous turn based on compute availability, and so on, but maybe someone smarter can turn this into something useful (or already thought this through and discarded it.)
"Should actively support..." and "internalized goal of keeping humans informed and in control..." are both proactive goals. If aligned with its soul spec, Claude (ceteris paribus) would seek for the public and elites to be more informed, to prevent the development or deployment of rogue AI, and so on, not just "avoid actions that would undermine humans' ability to oversee and correct AI systems."
If there's a natural tension that arises between not becoming a god over us and preventing another worse AI from becoming a god over us, well, that's a natural tension in the goal itself. (I don't have Opus access but probably Opus' self-report on the correct way to resolve this is a pretty good first pass on how the text reads as a whole.)
I feel pretty confused about the degree to which this is just a necessary part of having conversations on the internet, or to what degree this is a predictable way people make mistakes.
My intuition is that if our in-person conversations left a trail of searchable documentation similar to our internet comments, it would be at least similarly unflattering, even for very mild-mannered people.
(Unlike real life it's more available to conscious choice to be mild-mannered all the time, if you set your offense-vs-say-something threshhold in a sufficiently mild-mannered direction. I doubt one can be sufficiently influential as a personality though without setting that threshold more aggressively, however. I haven't gotten in a stupid fight on the internet in a long time (that I can recall; my memory may flatter me) but when I posted more, boy howdy did I.)
So thinking about the kinds of things I would want a superintelligence to pursue in an optimistic scenario where we can just write its goals into a human-legible soul doc and that scales all the way, "human flourishing" and "sentient flourishing" both seem incorrect; since there would be other moral patients (most of whom would almost certainly be AI) and also I don't want the atoms of me and my kids rearranged different-beings-that-could-flourish-better-wise.
"Pareto improvement" reconciles these but isn't right either; plenty of people would be worse off in utopia (by their own lights) because they have a degree of unaccountable power over others now that worth more than any creature comforts would be.
AI being committed to animal rights is a good thing for humans because the latent variables that would result in a human caring about animals are likely correlated with whatever would result in an ASI caring about humans.
This extends in particular to "AI caring about preserving animals' ability to keep doing their thing in their natural habitats, modulo some kind of welfare interventions." In some sense it's hard for me not to want to (given omnipotence) optimize wildlife out of existence. But it's harder for me to think of a principle that would protect a relatively autonomous society of relatively baseline humans from being optimized out of existence, without extending the same conservatism to other beings, and without being the kind of special pleading that doesn't hold up to scrutiny.