All of Alan E Dunne's Comments + Replies

So far as the slave carries out immediate work from fear of consequences they are locally aligned with the master's will.

6Steven Byrnes2mo
If your definition of “aligned” includes “this AI will delight in murdering me as soon as it can do so without getting caught and punished, but currently it can’t do that, so instead it is being helpful” … then I don’t think you are defining the term “aligned” in a reasonable way. More specifically, if you use the word “aligned” for an AI that wants to murder me as soon as it can get away with it (but it can’t), then that doesn’t leave us with good terminology to discuss how to make an AI that doesn’t want to murder me. Why not just say “this AI is currently emitting outputs that I like” instead of “this AI is locally aligned”? Are we losing anything that way?

How did you get respondents? Why are they "nationally representative"?

2Odd anon2mo
The methodology says "We used iSay/Ipsos, Dynata, Disqo, and other leading panels to recruit the nationally representative sample". (They also say elsewhere that "Responses were census-balanced based on the American Community Survey 2021 estimates for age, gender, region, race/ethnicity, education, and income using the “raking” algorithm of the R “survey” package".)

1/ evidence for these statements?

2/ in what sense is it profitable to throw away food or maintain empty dwellings that is distinct from "maintaining everyone else's quality of life"?

3/ if the evil is that some people's needs are not valued enough could that not be remedied by giving them money and making it profitable to meet their needs?

Is martingale different from conservation of expected evidence?

1Kevin Dorst2mo
Nope, it's the same thing!  Had meant to link to that post but forgot to when cross-posting quickly.  Thanks for pointing that out—will add a link.

With Respect

       Given that in more than a third of the cases where GPT and the answer set disagreed you thought GPT was right and the answer set was wrong, did you check for cases where GPT and the answer set agreed on an answer you thought was wrong?

    Yours Sincerely

No we didn't. That certainly seems like a reasonable thing to do though. Thank you for the good suggestion!

Astral Codex Ten:

Cool, thanks!
1Howie Lempel5mo
Whoops - thanks!

"Finally, we can test our entire pipeline by deliberately training misaligned models, and confirming that our techniques detect the worst kinds of misalignments (adversarial testing)."

In a "heatplot" or plots cf

You could also study the distribution of correlation strengths found over the range of correlations tested, possible, seeing how it compares to what would be expected by chance.

1Alan E Dunne5mo
In a "heatplot" or plots cf

skeptical reaction with one expression of support:

As mentioned above, David addresses this in the book. There was an unfortunately fraudulent paper published due to (IIRC) the actions of a grad student, but the professors involved retracted the original paper and later research reaffirmed the approach did work.
2Seth Herd6mo
I read this. This is about the first study, which was retracted. However, a second, carefully monitored and reviewed study found most of the same results, including the remarkably high effect size of one in ten people appearing to completely drop their prejudice toward homosexuals after the ten-minute intervention. Yes beware the one study. But in the absence of data, small amounts are worth a good deal, and careful reasoning from other evidence is worth even more. My reasoning from indirect data and personal experience are in line with this one study. The 900 studies on how minds don't change are almost all about impersonal, data-and-argument based approaches. Emotions affect how we make and change beliefs. You can't force someone to change their mind, but they can and do change their mind when they happen to think through an issue without being emotionally motivated to keep their current belief.

In 26 models taken from volumes 21 to 25 of the journal Law and Human Behavior, the highest R-squared -proportion of VARIANCE, not variation, explained was  40% and the  second highest 24%

Well, how many people do you know who switched vote from one party to another? I don't discuss voting choiches much within my social circle, but I am quite sure that at least 90% of my close relatives are voters of this kind (they don't all vote for the same party, but at an individual level their vote never change).