The influence conjecture and its implcations

I've been following the AI alignment debate and while the discussion around a superintelligent AI taking over the world is frightening and fascinating, I feel that there are some very real and imminent risks not necessarily dependent on ASI, that deserve more focus. Especially how AI can be both a vector for cyber attacks and produce effective attacks itself. As a bit of an outsider coming from a crypto threat analysis background, I wanted to share my take.

The Influence Conjecture

We’re just barely beginning to understand the extent of AI agents' vulnerability to adversarial prompting. The work done by Zeng et al ^[1] describes how LLM’s can be manipulated or socially engineered to produce unintended outputs through persuasive adversarial prompts (PAP). Alignment science seeks to engineer LLM’s to defend against adversarial prompting. However, I conject that vulnerability to PAP or more generally the tendency to be influenced is a basic aspect of intelligence of any kind. We have never before experienced a kind of intelligence that was different from human intelligence other than animals which appear to be just as vulnerable to influence as we are. Influence comes in many forms, and it’s not only useful for malice, it is also the basis of most sales and marketing techniques. My conjecture is that it is impossible to engineer an artificial intelligence system to not be able to be influenced to do things that are harmful or otherwise unintended.

In the context of this conjecture, alignment science is not about “solving jailbreaking”, but about making it as difficult as possible to influence AI’s to do harmful things as well as creating the safeguards to prevent it from causing damage when it happens.

Furthermore, influencing is a skill that humans can naturally learn or otherwise possess. Some sales people are just more successful than others. In the context of malicious influence, a prime example could be Charles Manson who persuaded normal people to commit unthinkable acts going against their nature. You could also draw parallels to the Nazi regime, where otherwise normal people accepted or committed heinous acts because of societal influence.

The AI influencer

Naturally, just like humans can be good at influencing others, AI’s can too. Although more indirect, it could be argued that social media algorithms are a form of AI influencers which are trained to influence people to keep scrolling and be engaged in the social media feed. I think we can all agree that this works frighteningly well and has disrupted society in many ways. With LLM’s getting better every day, it’s natural to assume that they too can aid in persuading people. Many experts have already raised concerns around this ^[2] and AI aided scams is an emerging trend. In an article by Hoxhunt, they claim that AI driven phishing attacks are more successful than those constructed by elite human red teams^[3] – and that’s just using the currently available AI models.

Projecting further into the future, AI assisted and agent led scams could dominate across all scam categories. Everything from investment scams and charity scams to romance scams and impersonation scams. Once these scams start to expand to a grander scale, this is going to be a major societal issue. People are used to trusting what other people tell them online – especially if it’s a video call with an AI that looks and talks exactly like one of your former coworkers or high school friends. Romance scams done by agents that can appear very attractive and “perfect” to the victim with full ability to appear on video, talk on the phone, etc., are probably going to be very effective especially given we know that AI’s are skilled at influencing.

A scenario for an AI enabled future

While the chase for AGI/ASI is very interesting, pre-AGI AI’s can and will likely change the world very dramatically. I think that the most impactful model will not be the smartest, it will be the first (likely cheap) model that does a good enough job (max ROI) that is deployed effectively on a massive scale. We’re already seeing companies like salesforce sell agent solutions that can do both internal work and act as externally facing customer support. ^[4]We could see the raw work power (meaningful tasks completed over time) of agents exceeding that of humans before we know it. With so many agents, having human approvers in the loop becomes impractical and financially unfeasible and may prove ineffective as humans are vulnerable to review fatigue and context overload. For example, I have approved some bad code or a bad CLI command an AI made because I was too lazy to thoroughly review it, just defaulting to “approve”. It’s likely that AI “approvers” would perform better or similarly today due to the practical limitation of review fatigue and that baked in safeguards are going to be more effective still.

There are a few aspects of AI agents that make them very compelling and very interesting as a workforce. They have the potential to create a much more direct relationship between money spent on “wages” and output. Today, If you need more work power for your team, you have to go find and hire good talent which is an expensive process in itself, and once you actually hire a new person, you need to train them and let them get to know the work. You probably have to pay approximately 2-12 months worth of wages (depending on the role and the person) to the new hire, the colleagues training them and the recruiter and it will take several months to actually get the person ready to contribute. If your team doesn’t really need the work power anymore, letting the person go is problematic because you’re giving up that 2-12 month money and time cost and you may need them in the future again. The person will find ways to be busy and involve others in their low impact work which in turn will further decrease the organization's efficiency.

In an AI team, if you need more work power, you can just deploy another agent copying over the knowledge from the current agents which is likely to be in the order of seconds, and if you no longer need it, you can just stop the deployment instantly. With AI teams, organizations can scale up and down and reprioritize their work power in an unimaginably efficient way going many orders of magnitude beyond what the most efficient organization can do today. On top of that, agents are orders of magnitude cheaper and faster than humans. These abilities are going to be so valuable that the agents themselves may not even have to be that smart to outperform humans. Maybe most of the world’s intelligence work is going to be done by billions of GPT-4 level intelligence agents constantly popping in and out of existence way before we create a superintelligent AI to orchestrate them. These agents would naturally need to talk to each other to complete their tasks. And it would not only be communication within a single company, we would see agents from different companies communicating. I think that almost every organization is going to have externally facing agents that accept inquiries from both humans and other agents.

Agent to agent social engineering

Given the scenario above, the conjecture and that AI’s can lead or aid in scams, it is likely that we’re going to see manipulative agents trying to hack the agent workforce through social engineering. The fact that adversarial prompting techniques seem to be amplified in multi-agent systems doesn’t help.^[5] If agents are going to be as widespread as my scenario describes this could be a very severe problem. Crypto is much less widespread than AI is likely to become, and the issues around private key theft and contract exploitation have disrupted the global economy and continue to threaten the success of the technology. A key part of this disruption was orchestrated by advanced persistent threat actors (APT) like the North Korean Lazarus Group.^[6]

Speed and scale changes everything

As stated earlier, the speed at which AI’s can do things are orders of magnitude faster than what we’re used to. Crypto heists executed by APT’s usually involve an intricate obfuscation plan of where the stolen funds go after the fact (using mixers, bridges etc.). The goal is to cash out before the response team’s analysts can react and figure out all the different cash out points and intercept the relevant exchanges. This highlights the importance of speed in cyber attacks. With agent led attacks, this speed is going to be even wilder. Imagine a social engineering attack where someone convinces an employee to hand over access to internal systems, but they’re already inside the system 5 seconds after the first inquiry to the externally facing agent initiated the attack. In addition the effort for APT’s to create an attack plan beforehand goes down by a lot with AI assistance. If we want to be able to respond to these types of attacks as they’re happening, we need to be able to very quickly deploy incident response agents of our own that can match the speed of the agentic attackers as humans would stand no chance in understanding what’s going on before it’s too late.

Conclusion

I think we should take agent driven cyber attacks – particularly those led by APT very seriously. Maybe it’s even more pressing than a superintelligent misaligned AI taking over the world. Of course, the research into ASI is very interesting and very important, but it should not overshadow the severity of the threat we face from less intelligent AI deployed by APT and the attack surface we expose when agents become widespread.

LESSWRONG
LW