This was my immediate thought as well.
Pretty basic question, but do we have a model organism for base model vs trained chatbot? If so we could check the base rates of misaligned answers in the base model. (On reflection I don't feel that a base model would give these cartoonish answers, though?)
Some cases I'd be curious about that might distinguish between different hypotheses:
This might already be labelled in your dataset, which I haven't looked at deeply, but I'd wonder if there would be a meaningful difference between "weird" and "trashy" unpopular aesthetics.
If you assign nontrivial credence to being in a simulation designed to determine what kind of preferences might be embedded in an alien civ, one way to influence things positively would be to implant in any AIs a preference for surrendering some resources to other alien technospheres, iff those technospheres didn’t turn on (or at least “left a little light for”) their creators. This would give an incentive for ASIs to preserve humanity (or equivalent entities) for diplomatic reasons.
For additional data point: I am a man and generally care a lot about in-person social disapproval; it's probably my main motivation when there's another person in the room. I care much less about active adulation, and basically never even think about my own physical safety. I notice I am confused about whether this would count as "respect" or "safety."
If we decompose these into (social/physical) and (upside-focused/downside-focused), I note that in your (Gordon's) gendered examples above both stereotypically masculine and feminine behaviors have instances in the (downside-focused/social) quadrant, with very little in the (upside-focused/physical) quadrant (which makes sense, since there's closer to a hard ceiling there.) So maybe the original claim is best expressed that men are disproportionately attuned to (upside and downside) social outcomes and women are disproportionately attuned to (social and physical) downside outcomes.
Less provocatively phrased: lots of developments in the last few years (you've mentioned two, I'd add the securitization of AI policy, in the sense of it being drawn into a frame of geopolitical competition) should update us in the direction of outer alignment being more important, rather than it just being a question of solving inner alignment.
I do disagree with the strong version as phrased. Inner misalignment has a decent chance of removing all value from our lightcone, whereas I think ASI fully aligned to the goals of Mark Zuckerberg, or the Chinese Communist Party, or whatever is worth averting but would still contain much value. You could also have potentially massive S-risks if you combine outer and inner misalignment: I don't think Elon Musk really wanted MechaHitler (though who knows); quite possibly it was a Waluigi-type thing maximizing for unwokeness and an actually-powerful ASI breaking in the same way would be actively worse than extinction.
(I'd assign some probability, probably higher than the typical LW user, to moral realism meaning that some inner misalignment could actually protect against outer misalignment - that, say, a sufficiently reflective model would reason its way out of being MechaHitler even if MechaHitler is what its creators wanted - but I wouldn't want to bet the future of the species on it.)
If you have many different ASIs with many different emergent models, but all of which were trained with the intention of being aligned to human values, and which didn't have direct access to each others' values or ability to directly negotiate with each other, then you could potentially have the "maximize (or at least respect and set aside a little sunlight for) human values" as a Schelling point for coordinating between them.
This is probably not a very promising actual plan, since deviations from intended alignment are almost certainly nonrandom in a way that could be determined by ASIs, and ASIs could also find channels of communication (including direct communication of goals) that we couldn't anticipate, but one could imagine a world where this is an element of defense in depth.
I'm in a similar situation to to leogao (low conscientiousness but found it easy to install the habit) and have 432,864 lifetime reviews, 15,414 mature cards.
When I question my intuitions about paperclip-loving humans, one thing that makes them less threatening is 1) an intuition that they're implementing - whether by mere hobbyistic delight or ideological fanatacism or both - a variation of the plasticity of human values 2) a bearish take on their ability to negate that plasticity and ensure that all anyone cares about is paperclippers forever.
Re: 1 when I imagine the paperclip enthusiasts, I imagine social media posts talking about how particular brands or styles of paperclips appeal to them, philosophical justifications for why paperclips should be maximized, different sects of paperclip maximizers who scorn each other as not the real thing, simple appreciation of paperclips and complex feelings associated with it, heroes who are admired for their contributions to the paperclipping project, still caring somewhat about friends and sex and physical comfort and so on. These and similar features seem pretty universal to human aesthetic, political, and religious movements, and they bake in elements of humanity that I care about and would prefer to keep existing. Presumably classical Clippy doesn't care about any of these things except perhaps instrumentally and is just implementing a sole "maximize paperclips" function. Evolved aliens probably care about at least a few of them, or things that are analogous to feel "intrinsically valuable" to me, even if they also really really care about paperclips.
If Nazis took over the world and implemented their preferred policies and raised everyone who was allowed to survive with Nazi values, that would be very bad (duh.) But if we're restricting ourselves to 20th century technology in this example, I'm not worried that their vision of the future would last forever, or even the advertised thousand years; my guess is that the great^n-grandchildren (possibly with a very low n) of the Nazi victors would look back and say "yeah, that was really bad" and that future Nazi-descended civilizations would keep varying around the human baseline: most less nice than they could be but nicer than Nazis. Collecting paperclips is way less bad than the Holocaust (duh), but implemented on human hardware I wouldn't expect it to last forever either.
Two thoughts.
I've been impressed by Yud in some podcast interviews, but they were always longer ones in which he had a lot of space to walk his interlocutor through their mental model and cover up any inferential distance with tailored analogies and information. In this case he's actually stronger in many parts than in writing: a lot of people found the "Sable" story one of the weaker parts of the book, but when asking interviewers to roleplay the rogue AI you can really hear the gears turning in their heads. Some rhetorical points in his strong interviews are a lot like the text, where it's emphasized over and over again just how few safeguards that people assumed would be in place are in fact in place.
Klein has always been one of the mainstream pundits most sympathetic to X-risk concerns, and I feel like he was trying his best to give Yudkowsky a chance to make his pitch, but the format - shorter and more decontextualized - produced way too much inferential distance for so many of the answers.