H/T Aella.
A company that made machine learning software for drug discovery, on hearing about the security concerns for these sorts of models, asked: "huh, I wonder how effective it would be?" and within 6 hours discovered not only one of the most potent known chemical warfare agents, but also a large number of candidates that the model thought was more deadly.
This is basically a real-world example of the "it just works to flip the sign of the utility function and turn a 'friend' into an 'enemy'"; this was slightly more complicated as they had two targets that they jointly optimized for the drug discovery process (toxicity and bioactivity), and only the toxicity target is flipped. [This makes sense--you'd want your chemical warfare agents to not be bioactive.] It also required a little bit of domain knowledge--they had to specify which sort of bioactivity to look for, and picked one that would point towards this specific agent.
I see your point as warning against approaches that are like "get the AI entangled with stuff about humans and hope that helps".
There are other approaches with a goal more like "make it possible for the humans to steer the thing and have scalable oversight over what's happening".
So my alternative take is: a solution to AI alignment should include the ability for the developers to notice if the utility function is borked by a minus sign!
And if you wouldn't notice something as wrong as a minus sign, you're probably in trouble about noticing other misalignment.