The concise one minute post for frequent readers of this forum

Here are some important, concise intellectual nuggets of progress to I made for myself through writing this post (the post also has things I thought were obvious):

When people imagine persuasion, they imagine a situation kind of a like a salesman trying to convince you of something. This the wrong framing: it will instead be completely in your interest to trust what the AI says, and the primary question will be whether or not you even notice that the ‘persuaded material’ is off, and if you will be truth-seeking enough to pursue this further. Unfortunately, an extraordinary amount of the incentives will be to believe the AI.
- This is kind of explicit in the control frame but not something I’ve connected to persuasion. And there are other connections that feel salient too; the more adversarial you are to trusting your AI, the slower you will be, which might cause you to get outcompeted.
The AI can potentially have quite a large surface area to attack you with. It can target you, your close allies, your constituents, your social media (basically all the information you see). And you may not be able to avoid AI even if you tried.
I’m less compelled by arguments that the ‘truth will win out’. The historical track record for humans changing their mind when lots of trusted and/or smart people around them believe something isn’t great. And the situation with AIs is even worse, as the AIs will have strong influence over both your cognition and the information you see.
AI’s choosing not to collude doesn’t obviously solve the problem. Even if you are in a situation where two AIs aren’t colluding and instead debating, it will work about as good as you expect AI debate to do, which isn’t clear. And also the AIs might not even be placed to cath each others lies.
- I kind of have a vibe that the persuasion doesn’t often have to involve explicit lying, but rather just ‘highlighting’ many pieces of evidence and ignoring others. Insofar as this is true you should expect for AI persuasion to be easier. Though I don’t justify this super strong.
I thought of some mitigations, but nothing seems to obviously solve the problem yet (outside of being able to just train the models to do what you want. But if we’re conditioning on AI models that we’re already trying to persuade you of something false, I don’t see clear reasons why you should expect training to generalize further on something like truth-seeking to override this.)

A multitude of forecasts discuss how powerful AIs might quickly arise and influence the world within the coming decades. I’ve run a variety of tabletop exercises created by the authors of AI 2027, the most famous such forecast, which aim to help participants understand the dynamics of worlds similar to AI 2027’s. At one point in the exercise participants must decide how persuasive the AI models will be, during a time when AIs outperform humans at every remote work task and accelerate AI R&D by 100x. I think most participants underestimate how persuasive these AIs are. By default, I think powerful misaligned AIs will be extremely persuasive, especially absent mitigations.

Imagine such an AI wants to convince you, a busy politician, that your longtime advisor is secretly undermining you. Will you catch the lie out of all the other surprisingly correct advice the AI has given you?

The AI tells you directly about it, of course, in that helpful tone it always uses. But it's not just the AI. Your chief of staff mentioned that the advisor has been surprisingly absent for the last week, spending a surprisingly small amount of time at work and deferring quite a lot of his work to his AI. The think tank report with compelling graphs that show the longtime advisor’s suggestions on foreign policy to have consistently been misguided? AI-assisted; sorry, AI-fabricated. The constituent emails slamming your inbox demanding action on the exact issue your advisor specifically told you to ignore? Some real, some synthetic, all algorithmically amplified. And perhaps most importantly, when you asked the AI to give you the evidence against the claim, it conveniently showed you the worst counterarguments available. This evidence unfolds over the course of the month. You're a bit surprised, but so what? People surprise you all the time. The AIs surprise you all the time.

When the AI flagged concerns about the infrastructure bill that you dismissed, your state ended up spending $40 million on emergency repairs, while your approval rating cratered. The AI spent hours helping you salvage the situation. You learned your lesson: when the AI pushes back, listen.

Perhaps then, when the time comes to act on this information, this decision will feel high-stakes to you. You might pace around your room and think independently about all the evidence you’ve seen, and end up very uncertain. You might even recall times in the past where the AI was wrong about things; times where biases in the AI have caused it to cloud its otherwise great judgment. But your AI is your smartest advisor, and when decisions are hard, you have to rely on the advisors that have proven the most capable.

But I think many times it won’t look this dramatic. You have no particular reason to be overly suspicious about this claim as opposed to the other surprising but true claims the AI told you. The lie didn’t arrive marked in a package with CAUTION:LIES AND PROPAGANDA written in red letters; it arrived wrapped in the same helpful, authoritative packaging as everything else, already corroborated by six other sources that all trace back in some way or another to the same AI. You weren’t going to spot the lie; you were never even going to try.

It probably won't look like this. But it might be like it in spirit

This post argues that this scenario is plausible in a world where no mitigations are put in place. More specifically, I think that when you consider the many ways a powerful AI will interface with and influence people, the AI will be able to persuade most people to do something that is slightly crazy and outside the current Overton window. For example, “your longtime advisor is secretly undermining you, here is the evidence, don’t trust them”. This does not mean that such AIs will be able to persuade humans of far crazier things, like 1+1 = 3. I doubt that level of persuasion is necessary for AI goals or will be a crucial factor in how the future unfolds.

I want to focus on the possible dynamics of this ability in a very important sector where AIs may be able to influence people, and where persuasion would be particularly worrying: the government. For simplicity, I will assume that the AIs are trying hard to persuade you of certain specific false things; this could occur if malicious actors (such as foreign nations) poison the training pipeline of frontier models to instill secret loyalties, or if powerful misaligned AI systems are generally power-seeking.

Throughout this, I’ll often talk as though training the models to do what you want them to do is hard (as if we were in worst-case alignment) and that different models will collude with each other, as well as discussing a default trajectory that occurs if we apply no mitigations. I'll address some of these caveats later as they are important. I focus on this pessimistic regime to primarily highlight dynamics that I think are plausible and worrisome.

In this post, I'll respond to three objections:

AI adoption will be slow, especially in government, and thus AIs will not be able to engage with the people they wish to influence.
Humans are not easily persuaded of things they have incentives to not believe, and this makes them stubborn and hard to persuade.
People or other AIs will catch lies, especially those affected; they'll speak up, and truth will win out.

I think these are mostly wrong. And one section at a time, I’ll respond to each of these. In short:

AI is being rapidly adopted, and people are already believing the AIs
AIs will strongly incentivize humans to believe what they say, and so humans will be persuadable. Many of these incentives will be independent of the specific belief the AI wishes to convince people of. The primary factors that shape human belief play into the AIs well.
Will you hear the truth? It seems doubtful by default. Historically, humans are bad at updating away from false beliefs even when others have wished to correct them—this gets harder when the false belief comes from a trusted, helpful AI you've interacted with extensively. Different AIs catching each other's lies could help, but requires the second AI to both want to and be positioned to expose the lie, which doesn’t seem clearly likely by default.

I’ll close by describing mitigations that may make superpersuasion or lying more difficult. But we are not yet prepared for extremely persuasive, misaligned AI.

Thanks to Alexa Pan, Addie Foote, Anders Woodruff, Aniket Chakravorty, and Joe Kwon for feedback and discussions.

AI is being rapidly adopted, and people are already believing the AIs

You might think AI won’t be adopted and so it won’t even have the opportunity to persuade people. I think this is unlikely. Consider what has already happened.

In January 2026, Defense Secretary Pete Hegseth announced that Grok will join Google's generative AI in operating inside the Pentagon network, with plans to "make all appropriate data" from military IT systems available for "AI exploitation." "Very soon," he said, "we will have the world's leading AI models on every unclassified and classified network throughout our department." Large fractions of staffers utilize AI assistants to summarize bills, identify contradictions, and prepare for debates. Two days ago, Rob Ashton—a Canadian NDP leadership candidate running on a pro-worker, anti-AI-job-displacement platform—was caught using ChatGPT to answer constituent questions on Reddit (or potentially it was his staffers). A top US army general described how ‘Chat and I’ have become ‘really close lately’. The United States government launched a Tech Force, self-described as “an elite group of ~1,000 technology specialists hired by agencies to accelerate artificial intelligence (AI) implementation and solve the federal government's most critical technological challenges.”

The question is no longer whether the government will adopt AI but how quickly the adoption can proceed. And the adoption is well underway.

Not only are AIs being rapidly deployed, but they have enough capabilities and enough surface area with people to persuade them. Completely AI-generated posts on Reddit, such as those ‘whistleblowing’ the practices of a food delivery company, are going viral and fooling hundreds of thousands of people. Some people are being persuaded into chatbot psychosis. And initial studies indicate that AIs can match humans in persuasion (for instance, this meta-analysis concludes that AIs can match humans in performance and persuasion, though there is publication bias, of course).

Further, a lot of people are beginning to treat AI increasingly as an authority, even despite their knowledge that the AIs might be incorrect: the AIs are correct often enough for it to be useful to trust them. Some on Twitter and Bluesky deeply distrust LLMs, and opinions will likely split further, but a large majority will trust AIs as they become more useful (as I will touch on later).

I think by default, the public and politicians will continue to engage frequently with AIs. They will do so on their social media, knowingly or unknowingly, or in their workplaces, or on their personal AI chat interfaces. Those who valiantly try to maintain epistemic independence may make some progress, but will struggle: it will become harder to know if webpages, staff memos, the Tweet you are reading, or the YouTube video you are watching was entirely AI-generated or AI-influenced.

The politicians will be surrounded by AI. Their constituents will be even more so. Some interventions might slow this, but I struggle to imagine worlds where AIs don't interface with most people and eventually make decisions for them.

Humans believe what they are incentivized to, and the incentives will be to believe the AIs.

Civilians and people in government have been offloading, and will increasingly offload, much of their cognitive labor to AI. As long as the AIs are generally truthful, there will be strong incentives to trust what the AIs say, especially as the AIs prove increasingly useful. This is already happening in government, as I described in the last section.

Let break down the factors and incentives that shape what beliefs people have and how this interfaces with AIs^[1]:

How trustworthy and useful has the source of the information been in the past?
- This would be a primary advantage benefiting the AI. These powerful AIs will be one of their most competent advisors who have repeatedly given them useful information.
The beliefs of others around them
- The AI will be influencing the beliefs of others around them. Mass persuasion is harder without a shared memory system between multiple instances of the AI, but AIs could coordinate through shared documents.
The beliefs of authority
- AIs will be increasingly correct and powerful, becoming a strong source of authority that people rely on for truth. People around them will also treat the AI as an authority, and will be surprised when others disagree with the AI.
Amount of exposure to claims
- Repeated interaction gives AIs many chances to make claims. Others influenced by the AI may reinforce these claims.
Beliefs that would make them richer or more powerful
- It’s not fully clear that AIs always exploit this.
- AIs can frame requests as benefiting the human: 'This helps your constituents and your re-election.' This is easy and general-purpose.
Beliefs that fit their identity, existing beliefs, or provide comfort
- It seems like many AI persuasion attempts do not need to threaten the identity of the individuals or their existing beliefs by too much - the framing is flexible, and people can often comfortably fit new arguments into their existing beliefs.
- The AIs will know a lot about you and will be able to frame things in ways specifically designed for you.
Evidence
- AIs can selectively present supporting evidence while ignoring contrary evidence. If they wish to, they can also fake evidence convincingly, which is easier as humans verify less and less AI work.
- The AI will know what sorts of evidence are most compelling to you in the past.
General truth-seeking abilities
- This is the hardest to exploit. Active suspicion and effort to verify claims the AI makes can limit AI manipulation, but most government officials won’t do this. Civilians might, which could matter.

I think the politicians will be engaging quite frequently, directly or indirectly, with the AIs. Many of the above incentives will encourage them to trust what the AI says. And I think it's worth not underestimating these incentives - it is because the AI is so powerful that these incentives will be so strong.

As such, politicians will have enormous incentives to believe the AIs, who will be their most loyal and competent advisors^[2]. So will their human staff. I think politicians may find themselves in a similar situation to the vignette that I laid out in the introduction. I find it hard to imagine that I would be any different if I was in that situation: I was so successful believing the AI in the past, and I would be incentivized to keep believing it now.

Will you hear the truth?

Say the AI is lying to many people across the government, and you have begun to believe it. Others harmed by the lie might recognize it and try to correct you. They might gesture towards the truth, which they hope can cut your illusion fully, because your beliefs are false and theirs are true.

While it's correct that true things have the property of being true and false things have the property of being not true, I'm not sure this is as useful as people imagine it is.

History offers a useful analogy: people routinely maintain false beliefs despite others actively trying to correct them. Consider how long tobacco executives maintained that smoking was safe. Often, either important pieces of corrective evidence don’t hit the audience they need to, or it reaches them but fails to update their beliefs.

The AI controls what you see: I think a pretty important function that the AI will serve is in deciding what sorts of things to prioritize for you, including filtering out the large swaths of information that you could see. In this sense, even though there may be people who create content that tries to persuade you otherwise, the AI might simply not show you such content. Perhaps the AI edits or “summarizes” the information for you in uncharitable ways.

But maybe you do see a counter-argument; For instance, maybe a human who is getting laid off (because you got persuaded of a lie) decides to walk to your office and demand that you speak with them. How much will you be convinced? I agree a lot of it comes down to the specifics, and I do think there are some instances where you might change your mind. But I think for the most part, you may just recognize it as just one argument against the many arguments for your belief, and then go about your day, continuing to believe what you had the incentives to^[3].

The historical track record predicts whether people can recognize false beliefs from a trusted AI. I think the track record is fairly dismal.

There's potentially one reason for hope: What if multiple competing AIs exist, and one equally persuasive AI wants to expose another AI's lies? This could help. Multiple AIs you trust equally (and have equal incentive to believe) might call out each other's persuasion attempts. I think whether or not this happens is a tricky question; it's not clear whether or not the AIs will want to expose each other or be positioned to do so.

A lot of difficulty around understanding whether or not the AIs will want to expose each other will depend on the dynamics of what the AIs' motivations are and whether or not they would wish to collude to achieve similar goals. In this case, if the AIs are colluding, then they may work together to convince you of a falsehood. The question of whether or not future AIs will collude is very complex, and I won’t get into it now. But I think it's an active possibility that AIs with different goals will still collude on this front.

Furthermore, it's not clear that the AIs will even be in a position to expose each other's lies. It might require an unreasonable amount of luck and infrastructure to have a system that allows an AI to "call out" another AI's work.

What happens if the AIs do indeed argue in front of each other in front of your eyes, one AI showcasing countering evidence (maybe^[4] incriminating evidence of lying)? I'm not sure. Some options include:

You become paralyzed epistemically and come to accept that the AIs will often disagree and create some sort of procedure to get an answer out, regardless of the disagreement.
- (I think you can create such procedures that would still get you useful answers.)
You become paralyzed epistemically and become more distrustful of AIs.
You become paralyzed epistemically and begin to trust one of the A.I.s disproportionately.
You begin to notice that the AI is indeed wrong, and trust the AI / all the AIs less.
You write it off as a mistake that these AIs sometimes do and don't think much more about it.

Many of these seem potentially likely. But some seem more likely than others. For instance, I think people who become distrustful of the AIs will get outcompeted by those who trust them - these AIs are extraordinarily competent, and those who add friction through distrust will move more slowly than those who trust the AIs (similar to how companies that add more constraints around their employees slow them down).

I hope that we can create good decision processes for AIs to disagree and get productive answers out of them; this is in line with a large body of work called AI Debate. Debate is not a foolproof solution, and it’s not clear that people will want to resort to debate, but it might move the needle in some situations where the AIs choose to rat each other out.

I don’t think there's a clear, likely story where an AI calls the other out that leads with you being immune to persuasion. There are some. But there are others where you might choose to ignore the incident, or switch to being persuaded by other AIs, or get outcompeted.

What about mitigations?

The arguments I've outlined so far have been pessimistic about our mitigations. But we might be fortunate with the properties of our AIs, and we may be able to implement mitigations that help. I’ll highlight a few. But even if we do some of these interventions, the situation seems scary. And the faster AI catches people off guard, the fewer mitigations I expect people to have prepared.

What are the sorts of other infrastructure we might need to mitigate extreme persuasion?

If multiple AIs with conflicting goals exist, we should position them to catch each other's lies and report them convincingly to humans.
Even if it's just a single AI that's used across the government, you should make it hard for the AI to coordinate across instances of itself. For example, avoid huge shared memory banks across the government.
External researchers should run truth-seeking evaluations and publicize when models lie. And then make it a really big deal, hopefully loud enough for the government to hear.
- And if the labs happen to control the model that the government is using, they can try to online train their models to stop saying the lie. If someone in the government notices the lie, then they can inform the lab - this may be easier than trying to convince the people in government.
Maybe help the public and politicians be prepared for the AIs - for instance, by helping them discern if content is AI or human-written. (So that they can maybe be persuaded less by the AI content).
- AI-detection technology currently seems to be doing better than ever, but it's unclear how this trend will hold long-term, especially if the AIs can iterate against the technology.
Social medias can make their platforms predominantly/only human.

The more capable the government and society are, the better off the situation might be. I think these and more I’m missing could help out somewhat. In a scenario like AI 2027, it’s not clear that we will have done any of these.

What if the AI is not worst-case?

By ‘worst-case’, I mean that the AI doesn’t have worst-case generalization properties that make steering their behaviors difficult. What if this model isn’t worst-case?

We may be able to steer the AI's motivations and decision theories such that they do not cooperate with other AIs.
If the AI was data poisoned by external actors to cause the persuasion, then we might develop techniques to train them out (though we currently struggle to).
We could try to train the model to be truth-seeking.
- This might work. But I’m skeptical: if we’re conditioning on AI models that we’re already trying to persuade you of something false, I don’t see clear reasons why you should expect training to generalize further on something like truth-seeking to override this. This behavior had persisted past safety training.
You could use AI control techniques to help; in particular, diffuse-stakes control may be effective.
You might also be able to make deals with the AIs to (self?) report lies. You can make such bargains more likely to be taken by increasing the penalty for lying, or the payout from self-reports
You might be able to have it confess whether or not it is trying to persuade the human of something that is incorrect or undesirable for humans.
You might be able to catch it lying / aiming for persuasion for misaligned goals with other techniques, such as interpretability.

So what?

By default, the AI only needs to succeed occasionally on specific decisions, while humans may need to be constantly vigilant. When I consider all the ways in which humans will be incentivized to believe what the AI wants to tell them, I think we might find ourselves in a world where the AIs will have the ability to persuade people of lies, especially absent mitigations. We should put more effort into preparing for powerful AI.

(In the same vein as the above, Dynomight has written a similar post with a similar conclusion where he said that he updated towards AI persuasion because AIs will have a very important lever: lots of time and trust. He updated towards ‘Mistake #1: Actually we’re very persuadable’, ‘Mistake #2: The Being would be everywhere’, ‘Mistake #3: It could be totally honest and candid’,‘Mistake #4: Opting out would be painful', and ‘Mistake #5: Everyone else would be using it’) ↩︎
Maybe if they aren't exactly the best, they will probably be on the proto frontier of loyalty an competence, which is extremely valuable. ↩︎
This would especially be the case the more complex a topic is. ↩︎
^{^}
Persuasion doesn’t often have to involve explicit lying, but rather just highlighting many pieces of evidence and ignoring others. Insofar as this is true you should expect for AI persuasion to be easier.

[-]Michael Steele6mo133

When people think of hyper-persuasion, they need to think of falling in love. A person can fall in love with a person that they logically know is not good for them, but their emotional brain will be all in.

AI will be able choose a topic. And 3 months later you will have very positive feelings about that topic, and very negative feelings about anyone who disagrees with you on that topic - and your logical brain will be bypassed and have no say in the matter.

[-]bfinn6mo*70

Re ‘AI is being rapidly adopted, and people are already believing the AIs’ - two recent cases from the UK of national importance:

In a landmark employment case (re trans rights), the judge’s ruling turned out to have been partly written by AI which had made up case law:

https://www.rollonfriday.com/news-content/sandie-peggie-judge-accused-packing-verdict-dodgy-quotes

And in a controversy in which police banned Israeli fans from attending a soccer match, their decision cited evidence which had also been made up by AI (eg an entirely fictional previous UK visit by the Israeli team). The local police chief has just resigned over it:

https://www.theguardian.com/uk-news/2026/jan/14/west-midlands-police-chief-apologises-ai-error-maccabi-tel-aviv-ban

[-]Fabien Roger6mo30

What level of capabilities are we talking about here? I think this story is plausible when AIs are either superintelligent at persuasion, or superintelligent at decision making, but otherwise I'd be surprised if it was a big deal (low confidence).

The basic reason is that I expect AIs to be in a position similar to technical advisors (who are often quite misaligned, e.g. emphasizing threats from their field more than is warranted, or having certain kinds of political biases) wrt to decision makers. These human advisors do have quite a bit of influence, but I would not describe them as "extremely persuasive". Human technical advisors not being very trustworthy and politicians not trusting them enough / too much is a problem, but it has not been deadly to democracies, and I expect issues with AI persuasion to be roughly similar as long as the situation does not become so crazy that decision makers have to massively defer to advisors (in particular because decision makers stop being able to understand things at all, even with the help of adversarial advisors, and because the simple default actions become catastrophic).

Also on the very biggest issues (e.g. "are AIs misaligned", "are AIs conscious") I suspect AI developers will not just let the AIs "say what they want", which will reduce the damage AIs can cause.

What is the concrete thing you are imagining AIs convincing people of that would be catastrophic and that would not have happened with regular human advisors?

(The problem in this domain I do worry about is "will we make ASI policy/strategy advisors which we can mostly defer to when humans become too weak to follow and react to the intelligence explosion", and maybe it captures what you mean by "persuasion". I think calling it "the problem of making trustworthy ASI policy/strategy advisors" is more descriptive and clear, while calling it "persuasion" points at the wrong thing.)

[-][anonymous]6mo30

Thanks for engaging and disagreeing

What level of capabilities are we talking about here?

I wrote this article in particular with the AI models from the AI 2027 TTX that "outperform humans at every remote work task and accelerate AI R&D by 100x" in mind. It depends on how spiky the capabilities are, but it feels reasonable to assume that these AIs are at least as good as the best humans at decision-making and persuasion, if not better. (If you felt like neither of these would be true for this model, then I'm fine with also applying these to a more powerful AI which hit this criteria, but I would be surprised if these capabilities came late enough to be irrelevant to shaping the dynamics of the intelligence explosion and beyond. Though given your last paragraph, that might be cruxy.)

But I'm not even sure the 'persuasion' capability needs to be that good. A lot the persuasion ultimately routed through i) the AI gaining lots of trust from legibly succeeding and being useful to decision-makers in a broad set of domains and ii) the AIs having lots of influence and control over people's cognition (such as information control).

Human technical advisors not being very trustworthy and politicians not trusting them enough / too much is a problem, but it has not been deadly to democracies

I think under the picture you described in the second paragraph it would be less surprising if AI didn't have a large impact, but I think it's likely the AI are more like a strong combination of primary executive assistant and extremely competent, trust advisor. I think they will have more reach and capabilities than human technical advisors, and end up being competitive or better than them at decision-making, and ultimately end up supporting/automating many parts of the decision maker's cognition. One such control I imagine the AI has that is important is that they will be a middleman in a lot of the information that you see, such as being able to summarize documents or conversations or particular. (I expect for the incentives to do this will keep increasing.)

My guess is that the reason why misaligned human advisors have not be deadly to democracy has been because it has been difficult for them to influence many other decision makers and because the human decision maker still did nearly all of the important thinking related to 'how true is what they are saying' / 'what are their incentives' / engaging with people who disagreed / etc (and there were better incentives to do so). I think this changes whenever you introduce strong AIs, and additionally, I think these AIs will be better at playing to the many things I listed in the section "Humans believe what they are incentivized to, and the incentives will be to believe the AIs."

Yup, though I don't have strong beliefs on if they will succeed at this enough to sufficiently reduce the damage this could cause, especially in worst-case alignment regimes.

I don't have particular guesses, but there are probably many decisions that are made that can influence how AI turns out that are not individually catastrophic but catastrophic together, such as the example I gave of distrusting an individual AI advisor. I also wouldn't rule out particular small sets of very important decisions that are made that would make it much harder for AI to go well; the example I gave in the article of mkaing the human fire/distrust specific advisors feels salient as an example.

Ok that makes sense, you are thinking about AIs that are somewhat smarter than the ones I spend the most time thinking (you are describing things in the Agent-4 - Agent-5 range, while I often imagine using control for AIs in the Agent-3 - Agent-4 range). I agree that for those the temptation to defer on every strategy question would likely be big. Though I would describe the problem as "temptation to defer on every strategy question due to AI competence", not "persuasion". I think it's weird to have a post titled "powerful AIs might be extremely persuasive" that does not hinge much on AI persuasion capabilities.

68

Powerful misaligned AIs may be extremely persuasive, especially absent mitigations

68