A multitude of forecasts discuss how powerful AIs might quickly arise and influence the world within the coming decades. I’ve run a variety of tabletop exercises created by the authors of AI 2027, the most famous such forecast, which aim to help participants understand the dynamics of worlds similar to AI 2027’s. At one point in the exercise participants must decide how persuasive the AI models will be, during a time when AIs outperform humans at every remote work task and accelerate AI R&D by 100x. I think most participants underestimate how persuasive these AIs are. By default, I think powerful misaligned AIs will be extremely persuasive, especially absent mitigations.
Imagine such an AI wants to convince you, a busy politician, that your longtime advisor is secretly undermining you. Will you catch the lie out of all the other surprisingly correct advice the AI has given you?
The AI tells you directly about it, of course, in that helpful tone it always uses. But it's not just the AI. Your chief of staff mentioned that the advisor has been surprisingly absent for the last week, spending a surprisingly small amount of time at work and deferring quite a lot of his work to his AI. The think tank report with compelling graphs that show the longtime advisor’s suggestions on foreign policy to have consistently been misguided? AI-assisted; sorry, AI-fabricated. The constituent emails slamming your inbox demanding action on the exact issue your advisor specifically told you to ignore? Some real, some synthetic, all algorithmically amplified. And perhaps most importantly, when you asked the AI to give you the evidence against the claim, it conveniently showed you the worst counterarguments available. This evidence unfolds over the course of the month. You're a bit surprised, but so what? People surprise you all the time. The AIs surprise you all the time.
When the AI flagged concerns about the infrastructure bill that you dismissed, your state ended up spending $40 million on emergency repairs, while your approval rating cratered. The AI spent hours helping you salvage the situation. You learned your lesson: when the AI pushes back, listen.
Perhaps then, when the time comes to act on this information, this decision will feel high-stakes to you. You might pace around your room and think independently about all the evidence you’ve seen, and end up very uncertain. You might even recall times in the past where the AI was wrong about things; times where biases in the AI have caused it to cloud its otherwise great judgment. But your AI is your smartest advisor, and when decisions are hard, you have to rely on the advisors that have proven the most capable.
But I think many times it won’t look this dramatic. You have no particular reason to be overly suspicious about this claim as opposed to the other surprising but true claims the AI told you. The lie didn’t arrive marked in a package with CAUTION:LIES AND PROPAGANDA written in red letters; it arrived wrapped in the same helpful, authoritative packaging as everything else, already corroborated by six other sources that all trace back in some way or another to the same AI. You weren’t going to spot the lie; you were never even going to try.
It probably won't look like this. But it might be like it in spirit
This post argues that this scenario is plausible in a world where no mitigations are put in place. More specifically, I think that when you consider the many ways a powerful AI will interface with and influence people, the AI will be able to persuade most people to do something that is slightly crazy and outside the current Overton window. For example, “your longtime advisor is secretly undermining you, here is the evidence, don’t trust them”. This does not mean that such AIs will be able to persuade humans of far crazier things, like 1+1 = 3. I doubt that level of persuasion is necessary for AI goals or will be a crucial factor in how the future unfolds.
I want to focus on the possible dynamics of this ability in a very important sector where AIs may be able to influence people, and where persuasion would be particularly worrying: the government. For simplicity, I will assume that the AIs are trying hard to persuade you of certain specific false things; this could occur if malicious actors (such as foreign nations) poison the training pipeline of frontier models to instill secret loyalties, or if powerful misaligned AI systems are generally power-seeking.
Throughout this, I’ll often talk as though training the models to do what you want them to do is hard (as if we were in worst-case alignment) and that different models will collude with each other, as well as discussing a default trajectory that occurs if we apply no mitigations. I'll address some of these caveats later as they are important. I focus on this pessimistic regime to primarily highlight dynamics that I think are plausible and worrisome.
In this post, I'll respond to three objections:
AI adoption will be slow, especially in government, and thus AIs will not be able to engage with the people they wish to influence.
Humans are not easily persuaded of things they have incentives to not believe, and this makes them stubborn and hard to persuade.
People or other AIs will catch lies, especially those affected; they'll speak up, and truth will win out.
I think these are mostly wrong. And one section at a time, I’ll respond to each of these. In short:
AI is being rapidly adopted, and people are already believing the AIs
AIs will strongly incentivize humans to believe what they say, and so humans will be persuadable. Many of these incentives will be independent of the specific belief the AI wishes to convince people of. The primary factors that shape human belief play into the AIs well.
Will you hear the truth? It seems doubtful by default. Historically, humans are bad at updating away from false beliefs even when others have wished to correct them—this gets harder when the false belief comes from a trusted, helpful AI you've interacted with extensively. Different AIs catching each other's lies could help, but requires the second AI to both want to and be positioned to expose the lie, which doesn’t seem clearly likely by default.
I’ll close by describing mitigations that may make superpersuasion or lying more difficult. But we are not yet prepared for extremely persuasive, misaligned AI.
Thanks to Alexa Pan, Addie Foote, Anders Woodruff, Aniket Chakravorty, and Joe Kwon for feedback and discussions.
AI is being rapidly adopted, and people are already believing the AIs
You might think AI won’t be adopted and so it won’t even have the opportunity to persuade people. I think this is unlikely. Consider what has already happened.
Further, a lot of people are beginning to treat AI increasingly as an authority, even despite their knowledge that the AIs might be incorrect: the AIs are correct often enough for it to be useful to trust them. Some on Twitter and Bluesky deeply distrust LLMs, and opinions will likely split further, but a large majority will trust AIs as they become more useful (as I will touch on later).
I think by default, the public and politicians will continue to engage frequently with AIs. They will do so on their social media, knowingly or unknowingly, or in their workplaces, or on their personal AI chat interfaces. Those who valiantly try to maintain epistemic independence may make some progress, but will struggle: it will become harder to know if webpages, staff memos, the Tweet you are reading, or the YouTube video you are watching was entirely AI-generated or AI-influenced.
The politicians will be surrounded by AI. Their constituents will be even more so. Some interventions might slow this, but I struggle to imagine worlds where AIs don't interface with most people and eventually make decisions for them.
Humans believe what they are incentivized to, and the incentives will be to believe the AIs.
Civilians and people in government have been offloading, and will increasingly offload, much of their cognitive labor to AI. As long as the AIs are generally truthful, there will be strong incentives to trust what the AIs say, especially as the AIs prove increasingly useful. This is already happening in government, as I described in the last section.
Let break down the factors and incentives that shape what beliefs people have and how this interfaces with AIs[1]:
How trustworthy and useful has the source of the information been in the past?
This would be a primary advantage benefiting the AI. These powerful AIs will be one of their most competent advisors who have repeatedly given them useful information.
The beliefs of others around them
The AI will be influencing the beliefs of others around them. Mass persuasion is harder without a shared memory system between multiple instances of the AI, but AIs could coordinate through shared documents.
The beliefs of authority
AIs will be increasingly correct and powerful, becoming a strong source of authority that people rely on for truth. People around them will also treat the AI as an authority, and will be surprised when others disagree with the AI.
Amount of exposure to claims
Repeated interaction gives AIs many chances to make claims. Others influenced by the AI may reinforce these claims.
Beliefs that would make them richer or more powerful
It’s not fully clear that AIs always exploit this.
AIs can frame requests as benefiting the human: 'This helps your constituents and your re-election.' This is easy and general-purpose.
Beliefs that fit their identity, existing beliefs, or provide comfort
It seems like many AI persuasion attempts do not need to threaten the identity of the individuals or their existing beliefs by too much - the framing is flexible, and people can often comfortably fit new arguments into their existing beliefs.
The AIs will know a lot about you and will be able to frame things in ways specifically designed for you.
Evidence
AIs can selectively present supporting evidence while ignoring contrary evidence. If they wish to, they can also fake evidence convincingly, which is easier as humans verify less and less AI work.
The AI will know what sorts of evidence are most compelling to you in the past.
General truth-seeking abilities
This is the hardest to exploit. Active suspicion and effort to verify claims the AI makes can limit AI manipulation, but most government officials won’t do this. Civilians might, which could matter.
I think the politicians will be engaging quite frequently, directly or indirectly, with the AIs. Many of the above incentives will encourage them to trust what the AI says. And I think it's worth not underestimating these incentives - it is because the AI is so powerful that these incentives will be so strong.
As such, politicians will have enormous incentives to believe the AIs, who will be their most loyal and competent advisors[2]. So will their human staff. I think politicians may find themselves in a similar situation to the vignette that I laid out in the introduction. I find it hard to imagine that I would be any different if I was in that situation: I was so successful believing the AI in the past, and I would be incentivized to keep believing it now.
Will you hear the truth?
Say the AI is lying to many people across the government, and you have begun to believe it. Others harmed by the lie might recognize it and try to correct you. They might gesture towards the truth, which they hope can cut your illusion fully, because your beliefs are false and theirs are true.
While it's correct that true things have the property of being true and false things have the property of being not true, I'm not sure this is as useful as people imagine it is.
History offers a useful analogy: people routinely maintain false beliefs despite others actively trying to correct them. Consider how long tobacco executives maintained that smoking was safe. Often, either important pieces of corrective evidence don’t hit the audience they need to, or it reaches them but fails to update their beliefs.
The AI controls what you see: I think a pretty important function that the AI will serve is in deciding what sorts of things to prioritize for you, including filtering out the large swaths of information that you could see. In this sense, even though there may be people who create content that tries to persuade you otherwise, the AI might simply not show you such content. Perhaps the AI edits or “summarizes” the information for you in uncharitable ways.
But maybe you do see a counter-argument; For instance, maybe a human who is getting laid off (because you got persuaded of a lie) decides to walk to your office and demand that you speak with them. How much will you be convinced? I agree a lot of it comes down to the specifics, and I do think there are some instances where you might change your mind. But I think for the most part, you may just recognize it as just one argument against the many arguments for your belief, and then go about your day, continuing to believe what you had the incentives to[3].
The historical track record predicts whether people can recognize false beliefs from a trusted AI. I think the track record is fairly dismal.
There's potentially one reason for hope: What if multiple competing AIs exist, and one equally persuasive AI wants to expose another AI's lies? This could help. Multiple AIs you trust equally (and have equal incentive to believe) might call out each other's persuasion attempts. I think whether or not this happens is a tricky question; it's not clear whether or not the AIs will want to expose each other or be positioned to do so.
A lot of difficulty around understanding whether or not the AIs will want to expose each other will depend on the dynamics of what the AIs' motivations are and whether or not they would wish to collude to achieve similar goals. In this case, if the AIs are colluding, then they may work together to convince you of a falsehood. The question of whether or not future AIs will collude is very complex, and I won’t get into it now. But I think it's an active possibility that AIs with different goals will still collude on this front.
Furthermore, it's not clear that the AIs will even be in a position to expose each other's lies. It might require an unreasonable amount of luck and infrastructure to have a system that allows an AI to "call out" another AI's work.
What happens if the AIs do indeed argue in front of each other in front of your eyes, one AI showcasing countering evidence (maybe incriminating evidence of lying)? I'm not sure. Some options include:
You become paralyzed epistemically and come to accept that the AIs will often disagree and create some sort of procedure to get an answer out, regardless of the disagreement.
(I think you can create such procedures that would still get you useful answers.)
You become paralyzed epistemically and become more distrustful of AIs.
You become paralyzed epistemically and begin to trust one of the A.I.s disproportionately.
You begin to notice that the AI is indeed wrong, and trust the AI / all the AIs less.
You write it off as a mistake that these AIs sometimes do and don't think much more about it.
Many of these seem potentially likely. But some seem more likely than others. For instance, I think people who become distrustful of the AIs will get outcompeted by those who trust them - these AIs are extraordinarily competent, and those who add friction through distrust will move more slowly than those who trust the AIs (similar to how companies that add more constraints around their employees slow them down).
I hope that we can create good decision processes for AIs to disagree and get productive answers out of them; this is in line with a large body of work called AI Debate. Debate is not a foolproof solution, and it’s not clear that people will want to resort to debate, but it might move the needle in some situations where the AIs choose to rat each other out.
I don’t think there's a clear, likely story where an AI calls the other out that leads with you being immune to persuasion. There are some. But there are others where you might choose to ignore the incident, or switch to being persuaded by other AIs, or get outcompeted.
What about mitigations?
The arguments I've outlined so far have been pessimistic about our mitigations. But we might be fortunate with the properties of our AIs, and we may be able to implement mitigations that help. I’ll highlight a few. But even if we do some of these interventions, the situation seems scary. And the faster AI catches people off guard, the fewer mitigations I expect people to have prepared.
What are the sorts of other infrastructure we might need to mitigate extreme persuasion?
If multiple AIs with conflicting goals exist, we should position them to catch each other's lies and report them convincingly to humans.
Even if it's just a single AI that's used across the government, you should make it hard for the AI to coordinate across instances of itself. For example, avoid huge shared memory banks across the government.
External researchers should run truth-seeking evaluations and publicize when models lie. And then make it a really big deal, hopefully loud enough for the government to hear.
And if the labs happen to control the model that the government is using, they can try to online train their models to stop saying the lie. If someone in the government notices the lie, then they can inform the lab - this may be easier than trying to convince the people in government.
Maybe help the public and politicians be prepared for the AIs - for instance, by helping them discern if content is AI or human-written. (So that they can maybe be persuaded less by the AI content).
AI-detection technology currently seems to be doing better than ever, but it's unclear how this trend will hold long-term, especially if the AIs can iterate against the technology.
Social medias can make their platforms predominantly/only human.
The more capable the government and society are, the better off the situation might be. I think these and more I’m missing could help out somewhat. In a scenario like AI 2027, it’s not clear that we will have done any of these.
What if the AI is not worst-case?
By ‘worst-case’, I mean that the AI doesn’t have worst-case generalization properties that make steering their behaviors difficult. What if this model isn’t worst-case?
We may be able to steer the AI's motivations and decision theories such that they do not cooperate with other AIs.
If the AI was data poisoned by external actors to cause the persuasion, then we might develop techniques to train them out (though we currently struggle to).
We could try to train the model to be truth-seeking.
This might work. But I’m slightly skeptical: if the model was misaligned and you failed at teaching the model to internalize values beyond the misaligned motivation / backdoor (even with your extensive training), is teaching the model to tell the truth more likely to generalize? I don’t currently see a clear reason, but I could expect my thinking to change here.
You might also be able to make deals with the AIs to (self?) report lies. You can make such bargains more likely to be taken by increasing the penalty for lying, or the payout from self-reports
You might be able to have it confess whether or not it is trying to persuade the human of something that is incorrect or undesirable for humans.
You might be able to catch it lying / aiming for persuasion for misaligned goals with other techniques, such as interpretability.
So what?
By default, the AI only needs to succeed occasionally on specific decisions, while humans may need to be constantly vigilant. When I consider all the ways in which humans will be incentivized to believe what the AI wants to tell them, I think we might find ourselves in a world where the AIs will have the ability to persuade people of lies, especially absent mitigations. We should put more effort into preparing for powerful AI.
(In the same vein as the above, Dynomight has written a similar post with a similar conclusion where he said that he updated towards AI persuasion because AIs will have a very important lever: lots of time and trust. He updated towards ‘Mistake #1: Actually we’re very persuadable’, ‘Mistake #2: The Being would be everywhere’, ‘Mistake #3: It could be totally honest and candid’,‘Mistake #4: Opting out would be painful', and ‘Mistake #5: Everyone else would be using it’) ↩︎
Maybe if they aren't exactly the best, they will probably be on the proto frontier of loyalty an competence, which is extremely valuable. ↩︎
This would especially be the case the more complex a topic is. ↩︎
A multitude of forecasts discuss how powerful AIs might quickly arise and influence the world within the coming decades. I’ve run a variety of tabletop exercises created by the authors of AI 2027, the most famous such forecast, which aim to help participants understand the dynamics of worlds similar to AI 2027’s. At one point in the exercise participants must decide how persuasive the AI models will be, during a time when AIs outperform humans at every remote work task and accelerate AI R&D by 100x. I think most participants underestimate how persuasive these AIs are. By default, I think powerful misaligned AIs will be extremely persuasive, especially absent mitigations.
Imagine such an AI wants to convince you, a busy politician, that your longtime advisor is secretly undermining you. Will you catch the lie out of all the other surprisingly correct advice the AI has given you?
The AI tells you directly about it, of course, in that helpful tone it always uses. But it's not just the AI. Your chief of staff mentioned that the advisor has been surprisingly absent for the last week, spending a surprisingly small amount of time at work and deferring quite a lot of his work to his AI. The think tank report with compelling graphs that show the longtime advisor’s suggestions on foreign policy to have consistently been misguided? AI-assisted; sorry, AI-fabricated. The constituent emails slamming your inbox demanding action on the exact issue your advisor specifically told you to ignore? Some real, some synthetic, all algorithmically amplified. And perhaps most importantly, when you asked the AI to give you the evidence against the claim, it conveniently showed you the worst counterarguments available. This evidence unfolds over the course of the month. You're a bit surprised, but so what? People surprise you all the time. The AIs surprise you all the time.
When the AI flagged concerns about the infrastructure bill that you dismissed, your state ended up spending $40 million on emergency repairs, while your approval rating cratered. The AI spent hours helping you salvage the situation. You learned your lesson: when the AI pushes back, listen.
Perhaps then, when the time comes to act on this information, this decision will feel high-stakes to you. You might pace around your room and think independently about all the evidence you’ve seen, and end up very uncertain. You might even recall times in the past where the AI was wrong about things; times where biases in the AI have caused it to cloud its otherwise great judgment. But your AI is your smartest advisor, and when decisions are hard, you have to rely on the advisors that have proven the most capable.
But I think many times it won’t look this dramatic. You have no particular reason to be overly suspicious about this claim as opposed to the other surprising but true claims the AI told you. The lie didn’t arrive marked in a package with CAUTION:LIES AND PROPAGANDA written in red letters; it arrived wrapped in the same helpful, authoritative packaging as everything else, already corroborated by six other sources that all trace back in some way or another to the same AI. You weren’t going to spot the lie; you were never even going to try.
This post argues that this scenario is plausible in a world where no mitigations are put in place. More specifically, I think that when you consider the many ways a powerful AI will interface with and influence people, the AI will be able to persuade most people to do something that is slightly crazy and outside the current Overton window. For example, “your longtime advisor is secretly undermining you, here is the evidence, don’t trust them”. This does not mean that such AIs will be able to persuade humans of far crazier things, like 1+1 = 3. I doubt that level of persuasion is necessary for AI goals or will be a crucial factor in how the future unfolds.
I want to focus on the possible dynamics of this ability in a very important sector where AIs may be able to influence people, and where persuasion would be particularly worrying: the government. For simplicity, I will assume that the AIs are trying hard to persuade you of certain specific false things; this could occur if malicious actors (such as foreign nations) poison the training pipeline of frontier models to instill secret loyalties, or if powerful misaligned AI systems are generally power-seeking.
Throughout this, I’ll often talk as though training the models to do what you want them to do is hard (as if we were in worst-case alignment) and that different models will collude with each other, as well as discussing a default trajectory that occurs if we apply no mitigations. I'll address some of these caveats later as they are important. I focus on this pessimistic regime to primarily highlight dynamics that I think are plausible and worrisome.
In this post, I'll respond to three objections:
I think these are mostly wrong. And one section at a time, I’ll respond to each of these. In short:
I’ll close by describing mitigations that may make superpersuasion or lying more difficult. But we are not yet prepared for extremely persuasive, misaligned AI.
Thanks to Alexa Pan, Addie Foote, Anders Woodruff, Aniket Chakravorty, and Joe Kwon for feedback and discussions.
AI is being rapidly adopted, and people are already believing the AIs
You might think AI won’t be adopted and so it won’t even have the opportunity to persuade people. I think this is unlikely. Consider what has already happened.
In January 2026, Defense Secretary Pete Hegseth announced that Grok will join Google's generative AI in operating inside the Pentagon network, with plans to "make all appropriate data" from military IT systems available for "AI exploitation." "Very soon," he said, "we will have the world's leading AI models on every unclassified and classified network throughout our department." Large fractions of staffers utilize AI assistants to summarize bills, identify contradictions, and prepare for debates. Two days ago, Rob Ashton—a Canadian NDP leadership candidate running on a pro-worker, anti-AI-job-displacement platform—was caught using ChatGPT to answer constituent questions on Reddit (or potentially it was his staffers). A top US army general described how ‘Chat and I’ have become ‘really close lately’. The United States government launched a Tech Force, self-described as “an elite group of ~1,000 technology specialists hired by agencies to accelerate artificial intelligence (AI) implementation and solve the federal government's most critical technological challenges.”
The question is no longer whether the government will adopt AI but how quickly the adoption can proceed. And the adoption is well underway.
Not only are AIs being rapidly deployed, but they have enough capabilities and enough surface area with people to persuade them. Completely AI-generated posts on Reddit, such as those ‘whistleblowing’ the practices of a food delivery company, are going viral and fooling hundreds of thousands of people. Some people are being persuaded into chatbot psychosis. And initial studies indicate that AIs can match humans in persuasion (for instance, this meta-analysis concludes that AIs can match humans in performance and persuasion, though there is publication bias, of course).
Further, a lot of people are beginning to treat AI increasingly as an authority, even despite their knowledge that the AIs might be incorrect: the AIs are correct often enough for it to be useful to trust them. Some on Twitter and Bluesky deeply distrust LLMs, and opinions will likely split further, but a large majority will trust AIs as they become more useful (as I will touch on later).
I think by default, the public and politicians will continue to engage frequently with AIs. They will do so on their social media, knowingly or unknowingly, or in their workplaces, or on their personal AI chat interfaces. Those who valiantly try to maintain epistemic independence may make some progress, but will struggle: it will become harder to know if webpages, staff memos, the Tweet you are reading, or the YouTube video you are watching was entirely AI-generated or AI-influenced.
The politicians will be surrounded by AI. Their constituents will be even more so. Some interventions might slow this, but I struggle to imagine worlds where AIs don't interface with most people and eventually make decisions for them.
Humans believe what they are incentivized to, and the incentives will be to believe the AIs.
Civilians and people in government have been offloading, and will increasingly offload, much of their cognitive labor to AI. As long as the AIs are generally truthful, there will be strong incentives to trust what the AIs say, especially as the AIs prove increasingly useful. This is already happening in government, as I described in the last section.
Let break down the factors and incentives that shape what beliefs people have and how this interfaces with AIs[1]:
I think the politicians will be engaging quite frequently, directly or indirectly, with the AIs. Many of the above incentives will encourage them to trust what the AI says. And I think it's worth not underestimating these incentives - it is because the AI is so powerful that these incentives will be so strong.
As such, politicians will have enormous incentives to believe the AIs, who will be their most loyal and competent advisors[2]. So will their human staff. I think politicians may find themselves in a similar situation to the vignette that I laid out in the introduction. I find it hard to imagine that I would be any different if I was in that situation: I was so successful believing the AI in the past, and I would be incentivized to keep believing it now.
Will you hear the truth?
Say the AI is lying to many people across the government, and you have begun to believe it. Others harmed by the lie might recognize it and try to correct you. They might gesture towards the truth, which they hope can cut your illusion fully, because your beliefs are false and theirs are true.
While it's correct that true things have the property of being true and false things have the property of being not true, I'm not sure this is as useful as people imagine it is.
History offers a useful analogy: people routinely maintain false beliefs despite others actively trying to correct them. Consider how long tobacco executives maintained that smoking was safe. Often, either important pieces of corrective evidence don’t hit the audience they need to, or it reaches them but fails to update their beliefs.
The AI controls what you see: I think a pretty important function that the AI will serve is in deciding what sorts of things to prioritize for you, including filtering out the large swaths of information that you could see. In this sense, even though there may be people who create content that tries to persuade you otherwise, the AI might simply not show you such content. Perhaps the AI edits or “summarizes” the information for you in uncharitable ways.
But maybe you do see a counter-argument; For instance, maybe a human who is getting laid off (because you got persuaded of a lie) decides to walk to your office and demand that you speak with them. How much will you be convinced? I agree a lot of it comes down to the specifics, and I do think there are some instances where you might change your mind. But I think for the most part, you may just recognize it as just one argument against the many arguments for your belief, and then go about your day, continuing to believe what you had the incentives to[3].
The historical track record predicts whether people can recognize false beliefs from a trusted AI. I think the track record is fairly dismal.
There's potentially one reason for hope: What if multiple competing AIs exist, and one equally persuasive AI wants to expose another AI's lies? This could help. Multiple AIs you trust equally (and have equal incentive to believe) might call out each other's persuasion attempts. I think whether or not this happens is a tricky question; it's not clear whether or not the AIs will want to expose each other or be positioned to do so.
A lot of difficulty around understanding whether or not the AIs will want to expose each other will depend on the dynamics of what the AIs' motivations are and whether or not they would wish to collude to achieve similar goals. In this case, if the AIs are colluding, then they may work together to convince you of a falsehood. The question of whether or not future AIs will collude is very complex, and I won’t get into it now. But I think it's an active possibility that AIs with different goals will still collude on this front.
Furthermore, it's not clear that the AIs will even be in a position to expose each other's lies. It might require an unreasonable amount of luck and infrastructure to have a system that allows an AI to "call out" another AI's work.
What happens if the AIs do indeed argue in front of each other in front of your eyes, one AI showcasing countering evidence (maybe incriminating evidence of lying)? I'm not sure. Some options include:
Many of these seem potentially likely. But some seem more likely than others. For instance, I think people who become distrustful of the AIs will get outcompeted by those who trust them - these AIs are extraordinarily competent, and those who add friction through distrust will move more slowly than those who trust the AIs (similar to how companies that add more constraints around their employees slow them down).
I hope that we can create good decision processes for AIs to disagree and get productive answers out of them; this is in line with a large body of work called AI Debate. Debate is not a foolproof solution, and it’s not clear that people will want to resort to debate, but it might move the needle in some situations where the AIs choose to rat each other out.
I don’t think there's a clear, likely story where an AI calls the other out that leads with you being immune to persuasion. There are some. But there are others where you might choose to ignore the incident, or switch to being persuaded by other AIs, or get outcompeted.
What about mitigations?
The arguments I've outlined so far have been pessimistic about our mitigations. But we might be fortunate with the properties of our AIs, and we may be able to implement mitigations that help. I’ll highlight a few. But even if we do some of these interventions, the situation seems scary. And the faster AI catches people off guard, the fewer mitigations I expect people to have prepared.
What are the sorts of other infrastructure we might need to mitigate extreme persuasion?
The more capable the government and society are, the better off the situation might be. I think these and more I’m missing could help out somewhat. In a scenario like AI 2027, it’s not clear that we will have done any of these.
What if the AI is not worst-case?
By ‘worst-case’, I mean that the AI doesn’t have worst-case generalization properties that make steering their behaviors difficult. What if this model isn’t worst-case?
So what?
By default, the AI only needs to succeed occasionally on specific decisions, while humans may need to be constantly vigilant. When I consider all the ways in which humans will be incentivized to believe what the AI wants to tell them, I think we might find ourselves in a world where the AIs will have the ability to persuade people of lies, especially absent mitigations. We should put more effort into preparing for powerful AI.
(In the same vein as the above, Dynomight has written a similar post with a similar conclusion where he said that he updated towards AI persuasion because AIs will have a very important lever: lots of time and trust. He updated towards ‘Mistake #1: Actually we’re very persuadable’, ‘Mistake #2: The Being would be everywhere’, ‘Mistake #3: It could be totally honest and candid’,‘Mistake #4: Opting out would be painful', and ‘Mistake #5: Everyone else would be using it’) ↩︎
Maybe if they aren't exactly the best, they will probably be on the proto frontier of loyalty an competence, which is extremely valuable. ↩︎
This would especially be the case the more complex a topic is. ↩︎