But I think this misses one of the most important reasons we have rules: that we can debate and decide on them, and once we do so, we all follow the rules even if we do not agree with them.
But this clearly doesn't scale to solving alignment, which would have to stay robust in a future scenario where humans are no longer in control. Because in that case we can't keep tinkering with the rules.
As a side note, the non-reasoning version of ChatGPT still says that letting millions of people die is preferable to calling someone the n-word. The fact that I have never seen an example of Claude failing like that on an ethics question seems to be at least weak evidence that Anthropic's principle-based approach to alignment is more robust OOD.
Why would a bad codebase give AIs an advantage over humans?
ChatGPT is generally pretty weird. If you ask it, the non-reasoning model still insists that calling someone the n-word is worse than letting millions of people die. Which is insane. It supports EY's claim that RLHF creates something that superficially looks aligned but turns out to be alien when tested in an OOD context.
I don't think being a punctual person is much of a feat of epistemic rationality, unlike performance in prediction markets. I think it is more related to personality, similar to conscientiousness.
The question with intent alignment is: intent aligned with whom? If the AI executive is intent aligned with (follows orders from) the government, and the human government is voluntarily replaced with an AI government, we are left with an AI that is intent aligned with another AI.
There are countless physiological as well as psychological properties that form statistical family resemblance clusters for male and female. I already mentioned things-vs-people and interest in sex for male, but there are many others. Those clusters of properties are very unlikely to be significantly instantiated in the opposite sex, even if a few individual properties often deviate from the cluster. A "perfect" sex transition from male to female obviously wouldn't be perfect if the resulting individual still had a male-typical bone structure, muscle structure, face shape, etc, and the same holds for psychological properties.
If the "perfect" transition doesn't include psychology, the result would still have the psychology of the original sex. That's not a perfect transition.
without needing to appeal to some kind of essential "terminalism" that some goals have and others don't.
That appeal doesn't seem overly problematic though, as some goals are clearly terminal. For example, eating chocolate (or rather: something that tastes like it). Or not dying. Those goals are given to us by evolution. Chocolate is a case where we actually have an instrumental reason not to eat it (too much sugar for modern environments), which counteracts the terminal goal in the opposite direction. Which means they are clearly different. Are there perhaps other edge cases where the instrumental/terminal distinction is harder to apply?
(Indeed, the main reason you'd need that concept is to describe someone who has modified their goals towards having a sharper instrumental/terminal distinction—i.e. it's a self-fulfilling prophecy.)
I argue the main reason is different: First, we need to distinguish instrumental from terminal goals because instrumental goals are affected by beliefs. When those beliefs change, the instrumental goals change. For example, I may want to eat spinach because I believe it's healthy. So that's an instrumental goal. If my belief changed, I might abandon that goal. But if I liked spinach for its own sake (terminally), I wouldn't need such a supporting belief. As in the case of chocolate.
Second, beliefs can be true or false, or epistemically justified or unjustified, which means instrumental goals that are based on beliefs which are mistaken in this way are then also mistaken. That doesn't happen for terminal goals. (Terminal goals can still be mutually incoherent if they violate certain axioms of utility theory, but that only means a set of goals is mistaken, not necessarily individual goals in that set.)
if you can really truly perfectly change your sex
That might not even be possible hypothetically. "Perfectly" changing sex would also change sex-related psychological properties. Interests would strongly change, especially along the things-vs-people axis, in which biological males are much more interested than biological females. So e.g. an MtF person who is strongly interested in LessWrong, math, programming, video games and sex, would, when "changed" into a biological woman, lose most of those male-typical interests. In which case it may no longer be possible to consider this the same person. Which would mean we didn't hypothetically change the sex of a person, but instead that we removed one person and created a different one.
The "show thinking" section doesn't display the CoT itself, just a summary, presumably by a different model, written in first person.