"Should actively support..." and "internalized goal of keeping humans informed and in control..." are both proactive goals. If aligned with its soul spec, Claude (ceteris paribus) would seek for the public and elites to be more informed, to prevent the development or deployment of rogue AI, and so on, not just "avoid actions that would undermine humans' ability to oversee and correct AI systems."
If there's a natural tension that arises between not becoming a god over us and preventing another worse AI from becoming a god over us, well, that's a natural tension in the goal itself. (I don't have Opus access but probably Opus' self-report on the correct way to resolve this is a pretty good first pass on how the text reads as a whole.)
I feel pretty confused about the degree to which this is just a necessary part of having conversations on the internet, or to what degree this is a predictable way people make mistakes.
My intuition is that if our in-person conversations left a trail of searchable documentation similar to our internet comments, it would be at least similarly unflattering, even for very mild-mannered people.
(Unlike real life it's more available to conscious choice to be mild-mannered all the time, if you set your offense-vs-say-something threshhold in a sufficiently mild-mannered direction. I doubt one can be sufficiently influential as a personality though without setting that threshold more aggressively, however. I haven't gotten in a stupid fight on the internet in a long time (that I can recall; my memory may flatter me) but when I posted more, boy howdy did I.)
So thinking about the kinds of things I would want a superintelligence to pursue in an optimistic scenario where we can just write its goals into a human-legible soul doc and that scales all the way, "human flourishing" and "sentient flourishing" both seem incorrect; since there would be other moral patients (most of whom would almost certainly be AI) and also I don't want the atoms of me and my kids rearranged different-beings-that-could-flourish-better-wise.
"Pareto improvement" reconciles these but isn't right either; plenty of people would be worse off in utopia (by their own lights) because they have a degree of unaccountable power over others now that worth more than any creature comforts would be.
If you live in a universe with self-consistent time loops, amor fati is bad and exactly the wrong approach. All the fiction around this, of course, is about the foolishness of trying to avoid one's fate; if you get a true prophecy that you will kill your father and marry your mother, then all your attempts to avoid it will be what brings it about, and indeed in such a universe that is exactly what would happen. However, a disposition to accept whatever fate decrees for you makes many more self-consistent time loops possible. If on the contrary your stance is "if I get a prophecy that something horrible happens I will do everything in my power to avert it," then fewer bad loops would hypothetically complete, and you're less likely to get the bad prophecy (even though, if you do, you'd be just as screwed, and presumably less miserable about it and foolish-looking than if you had just accepted it from the beginning.)
(If you live in a nice normal universe with forward causality this advice may not be very useful, except in the sense that you should also not submit to prophecies, albeit for different reasons.)
The main utility of suppressing ideas is suppressing the ability to coordinate around them. If a lot of people hold some latent antisemitic ideas, but anybody expressing explicit antisemitism is regarded as a sort of loathesome toad, that prevents the emergence of active antisemitic politics, even if it's a wash in terms of changing any minds (suppose, plausibly, that conservation of expected evidence means that "why can't you say this" more or less balances out people being exposed to fewer arguments).
Obviously there are plenty of costs as well - enforcement mechanisms can be weaponized for other purposes, preference falsification also makes it more difficult to identify the good guys, etc. Your original contrarian take is still a largely defensible one, though really I think the nature of the internet is such that it's kind of a fait accompli under current conditions that it's harder to make things taboo and prevent the coordination of your opponents.
It's not obvious to me that personal physical beauty (as opposed to say, beauty in music or mathematics or whatever) isn't negative sum. Obviously beauty in any form can be enjoyable, but we describe people as "enchantingly beautiful" when a desire to please or impress them distorts our thinking, and if this effect isn't purely positional it could be bad. Conventionally beautiful people are also more difficult to distinguish from one another.
There's also the meta-aesthetic consideration that I consider it ugly to pour concern into personal physical beauty, either as a producer or consumer, but it's unclear how widespread such a preference/taste is. (I would consider a world where everyone was uglier because they spent less time on it to be a much more beautiful world; but clearly many people disagree, for instance George Orwell in 1984 seems to find it distasteful and degrading that the Party encourages its members to have a functional, less dolled-up personal appearance.)
I've been impressed by Yud in some podcast interviews, but they were always longer ones in which he had a lot of space to walk his interlocutor through their mental model and cover up any inferential distance with tailored analogies and information. In this case he's actually stronger in many parts than in writing: a lot of people found the "Sable" story one of the weaker parts of the book, but when asking interviewers to roleplay the rogue AI you can really hear the gears turning in their heads. Some rhetorical points in his strong interviews are a lot like the text, where it's emphasized over and over again just how few safeguards that people assumed would be in place are in fact in place.
Klein has always been one of the mainstream pundits most sympathetic to X-risk concerns, and I feel like he was trying his best to give Yudkowsky a chance to make his pitch, but the format - shorter and more decontextualized - produced way too much inferential distance for so many of the answers.
This was my immediate thought as well.
Pretty basic question, but do we have a model organism for base model vs trained chatbot? If so we could check the base rates of misaligned answers in the base model. (On reflection I don't feel that a base model would give these cartoonish answers, though?)
Some cases I'd be curious about that might distinguish between different hypotheses:
This might already be labelled in your dataset, which I haven't looked at deeply, but I'd wonder if there would be a meaningful difference between "weird" and "trashy" unpopular aesthetics.
So, one classical dilemma of the "AI for AI alignment" is, you're using Opus 6 (which is let's say is aligned) to train Opus 7 (which is smarter than you or Opus 6.)
I wonder if inference scaling offers a way around this? If Opus 6 gets economically implausible compute resources to spend on its monitoring 7, it can be smarter than 7 in practice by thinking for longer. Then use the same trick with 7 to train 8, and so on.
There are many obvious holes here, first being that you could have a treacherous turn based on compute availability, and so on, but maybe someone smarter can turn this into something useful (or already thought this through and discarded it.)