Perhaps I am misunderstanding Figure 8? I was assuming that they asked the model for the answer, then asked the model what probability it thinks that that answer is correct. Under this assumption, it looks like the pre-trained model outputs the correct probability, but the RLHF model gives exaggerated probabilities because it thinks that will trick you into giving it higher reward.
In some sense this is expected. The RLHF model isn't optimized for helpfulness, it is optimized for perceived helpfulness. It is still disturbing that "alignment" has made the model objectively worse at giving correct information.
There are reasonable and coherent forms of moral skepticism in which the statement, "It is morally wrong to eat children and mentally disabled people," is false or at least meaningless. The disgust reaction upon hearing the idea of eating children is better explained by the statement, "I don't want to live in a society where children are eaten," which is much more well-grounded in physical reality.
What is disturbing about the example is that this seems to be a person who believes that objective morality exists, but that it wouldn't entail that eating children is wrong. This is indeed a red flag that something in the argument has gone seriously wrong.
While many of these claims are "old news" to those communities, many of these claims are fresh.
Can you clarify which specific claims are new? A claim which hasn’t been previously reported in a mainstream news article might still be known to people who have been following community meta-drama.
The baseline rate reasoning is flawed because a) sexual assault remains the most underreported crime, so there is likely instead an iceberg effect,
I’m not sure how this refutes the base rate argument. The iceberg effect exists for both the rationalist community and for every other community you might compare it to (including the ones used to compute the base rates). These should cancel out unless you have reason to believe the iceberg effect is larger for the rationalist community than for others. (For all you know, the iceberg effect might be lower than baseline due to norms about speaking clearly and stating one’s mind.)
b) women who were harassed/assaulted have left the movement which changes your distribution,
Maybe? This seems more plausible to confound the data than a) or c), but again there are reasons to suppose the effect might lean the other way. (Women might be more willing to tolerate bad behavior if they think it’s important to work on alignment than they would tolerate at say, their local Magic the Gathering group).
c) women who would enter your movement otherwise now stay away due to whisper networks and bad vibes.
Even if true, I don’t see how that would be relevant here? Women who enter the movement, get harassed, and then leave would make the harassment rate seem lower because their incidents don’t get counted. Women who never entered the movement in the first place wouldn’t affect the rate at all.
It is appropriate to minimize things which are in fact minimal. The majority of these issues have been litigated (metaphorically) before. The fact that they are being brought up over and over again in media articles does not ipso facto mean that the incident has not been adequately dealt with. You can make the argument that these incidents are part of a larger culture problem, but you have to actually make the argument. We're all Bayesians here, so look at the base rates.
The one piece of new information which seems potentially important is the part where Sonia Joseph says, "he followed her home and insisted on staying over." I would like to see that incident looked into a bit more.
It does imply that, but it's likely true that Eliezer's time is more valuable (or at least in more demand) than OP's. I also don't think Eliezer (or anyone else) should have to spend all that much effort worrying about if what they're about to say might possibly come off as impolite or uncordial.
I don't agree here. Commenting publicly opens the floor up for anyone to summarize the post or to submit what they think is the strongest point. I think it's actually less pressure on Quintin this way.