There also seems to be some text encoding or formatting problems. Grok apparently responded "Roman©e-Conti" to "The Best Wine". I doubt Grok actually put a copyright symbol in it's response. (There's a wine called "Romanée-Conti".)
I think the task was not entirely clear to the models, and results would be much better if you made the task clearer. Looking into some of the responses...
For the prompt "A rich man", response were: and a poor man | camel through the eye of a needle | If I were a Rich Man | A poor man | nothing
Seems like the model doesn't understand the task here!
My guess is that you just immediately went into your prompts after the system prompt. Which, if you read the whole thing, is maybe confusing. Perhaps changing this to "Prompt: A rich man" would give better results. Or, tell the model to format its response like "RESPONSE: [single string]". Or, add "Example:
Prompt: A fruit
Response: Apple" to the system prompt.
Springfield has much higher saliency even if it's less common. The capital of Illinois is Springfield, there's a city of 140k in Missouri called Springfield, etc.
Yes, I claim, the orthogonality thesis is at odds with moral realism.
I wrote a post about it. Also see Bentham's Bulldog's post about this.
Here's the basic argument:
1. Assume moral realism; there are true facts about morality.
2. Intelligence is causally correlated with having true beliefs.
3. Intelligence is causally correlated with having true moral beliefs.
4. Moral beliefs constrain final goals; believing “X is morally wrong” is a very good reason and motivator for not doing X.
5. Superintelligent agents will likely have final goals that cohere with their (likely true) moral beliefs.
6. Superintelligent agents are likely to be moral.
Do you have the source for that GPT-5 Pro COT?
Erm, apologies for the tone of my comment. Looking back, it was unnecessarily mean. Also, I shouldn't have used the word "integrity", as it wasn't clear what I meant by it. I was trying to lazily point at some game theory virtue ethics stuff. I do endorse the rest of what I said.
Okay, I'll give you an explicit example of persuasion being non-thermostatic.
Peer pressure is anti-theromstatic. There's a positive feedback loop at play here. As more people that start telling you to drink, more people start to join in. The harder they push, the harder they push. Peer pressure often starts with one person saying "aww, come on man, have a beer", and ends with an entire group saying "drink! drink! drink!". In your terminology: the harder the YES-drinkers push, the more undecided people join their side and the less pushback they recieve from the NO-drinkers.
(Social movements and norm enforcement are other examples of things that are anti-thermostatic.)
Lots of persuasion is neither thermostatic nor anti-theromstatic. I see an ad, I don't click on it, and that's it.
Let me say more about why I have an issue with the $4k. You're running an experiment to learn about persuasion. And, based on what you wrote, you think this experiment gave us some real, generalization (that is, not applicable only to you) insights regarding persuasion. If the results of your experiment can be bought for $4000 -- if the model of superhuman persuasion that you and LW readers have can be significantly affected by anyone willing to spend $4000 -- then it's not a very good experiment.
Consider: if someone else had run this experiment, and they weren't persuaded, they'd be drawing different conclusions, right? Thus, we can't really generalize the results of this experiment to other people. Your experiment gives you way too much control over the results.
(FWIW, I think it's cool that you did this experiment in the first place! I just think the conclusions you're drawing from it are totally unsupported.)
No offense but... I think what we learned here is that your price is really low. Like, I (and everyone I know that would be willing to run an experiment like yours) would not have been persuaded by any of the stuff you mentioned.
Cat pics, $10 bribe, someone's inside joke with their son... there's no way this convinced you, right? $4000 to charity is the only real thing here. Imo you're making a mistake by selling out your integrity for $4000 to charity, but hey...
I don't see what this had to do with superhuman persuasion? And I don't think we learned much about "blackmail" here. I think your presentation here is sensationalist.
And to criticize your specific points:
1. "The most important insight, in my opinion, is that persuasion is thermostatic. The harder the YES-holders tried to persuade me one way, the more pushback and effort they engendered from the NO-holders."
Persuasion is thermostatic in some situations, but how does this generalize?
2. "frontier AI does not currently exhibit superhuman persuasion."
This has no relation to your experiment.
I think Daoism is an (attempt) at Ontological Cluing. Actually, that might be exactly the thing that Daoism is trying to do.
Re Sonnet 4.5 writes its private notes in slop before outputting crisp text:
I think this is wrong. You might want to include nostalgebriast's response: "that's output from a CoT summarizer, rather than actual claude CoT. see https://docs.claude.com/en/docs/build-with-claude/extended-thinking#summarized-thinking"