Thanks for doing this work, this is a really important paper in my view.
One question that sprang to mind when reading this: to what extent do the disempowerment primitives and amplifying factors correlate with each other? i.e. Are conversations which contain one primitive likely to contain others? Ditto with the amplifying factors?
The impression I took away is that these elements were being treated independently which strikes me as a reasonable modelling assumption but the estimates of rate strike me as quite sensitive to that. Would be happy to be proven wrong.
(Being lazy and just responding to the abstract - these may be well addressed by the paper itself.)
That strikes me as a very low rate - enough so that my instinct is that a false positive rate might exceed it on its own. (At least, if I were reading an-in-actuality benign conversation, my chance of misreading it as actually deeply manipulative would probably be greater than 1/1,000, especially if one party was looking to the other for advice!) Of course where "severe" disempowerment occurs such that the human user is "fundamentally" compromised looks like something with pretty fuzzy boundaries, such that I'd expect many border cases of moderate disempowerment/compromise for each severe/fundamental case, however defined, so I'm not sure how much the rate conveys on its own. (How many cases are there of chatbots giving genuinely good advice that subtly erodes independent decision-making habits, and how would we score whether these count as "helpful" on net? Plausibly these might even be the majority of conversations.)
(That being said, I also expect my error rate in giving non-manipulative advice would count as pretty good if out 10,000 cases of people seeking advice I only accidentally talked <10 out of their own ability to reason about it, so good on Claude if a lot of the implicit framing above is accurate.)
I don't follow the reasoning behind excluding coding help. Does the paper elaborate on why disempowerment in that area is considered benign?
We’re publishing a new paper that presents the first large-scale analysis of potentially disempowering patterns in real-world conversations with AI.
For more details, see the blog post or the full paper.