If you or anyone else is interested in getting funding for this kind of threat modeling, Georgetown‘s Center for Security and Emerging Technology put out a RFP for risks of internally deployed frontier models due June 30.
https://cset.georgetown.edu/wp-content/uploads/FRG-Call-for-Research-Ideas-Internal-Deployment.pdf
They imagine writing small and carefully locked-down infrastructure and allowing the AIs to interact with it.
That's surprising and concerning. As you say, if these companies expect their AIs to do end-to-end engineering and R&D tasks internally, it seems difficult to imagine how they could do that without having employee-level privileges. Any place where they don't is a place where humans turn into a bottleneck. I can imagine a few possible objections to this:
Like, to be clear, I would definitely prefer a world where these organizations wrote "small and carefully locked-down infrastructure" as the limited surface their AIs were allowed to interact with; I just don't expect that to actually happen in practice.
This still suffers from the incentive gradient pushing quite hard to just build end-to-end agents. Not only will it probably work better, but it'll be straight up cheaper and easier!
The same is true of human software developers - your dev team sure can ship more features at a faster cadence if you give them root on your prod servers and full read and write access to your database. However, despite this incentive gradient, most software shops don't look like this. Maybe the same forces that push current organizations to separate out the person writing the code from the person reviewing it could be repurposed to software agents.
One bottleneck, of course, is that one reason it works with humans is that we have skin in the game - sufficiently bad behavior could get us fired or even sued. Current AI agents don't currently have anything to gain from behaving well or lose from behaving badly (or sufficient coherence to talk about "an" AI agent doing a thing).
one reason it works with humans is that we have skin in the game
Another reason is that different humans have different interests, your accountant and your electrician would struggle to work out a deal to enrich themselves at your expense, but it would get much easier if they shared the same brain and were just pretending to be separate people.
one reason it works with humans is that we have skin in the game - sufficiently bad behavior could get us fired or even sued
I don't think this framing is at all useful. "Have skin in the game" is just a human "capability elicitation" method.
You put a human under a performance-based incentive with the hopes of obtaining better performance. If successful, the performance uplift doesn't come "from nothing". The human was always capable of performing on that level. You just made the human use more of his capabilities.
The performance limitations of current AIs are more "AI doesn't have the capabilities" than "AI is not motivated to perform and doesn't use the capabilities it has" (80% confidence).
If a human misbehaves badly enough on a task they will be removed from the pool of agents that will perform tasks like that in the future. Humans are playing an iterated game. Current LLM agents generally are not (notable exception: agent village).
You could of course frame the lack of persistent identity / personal resources / reputation as a capabilities problem on the AI side rather than a problem with companies expecting nonhuman minds to expose a fully human-like interface, it mostly depends on which side seems more tractable. I personally see a lot of promise in figuring out how to adapt workflows to take advantage of cheap but limited cognition - feels easier than trying to crack the reliability problem and the procedural memory problem, and there are definitely safety disadvantages in setting up your AI systems to expose a situationally aware, persistent human-like interface.
I fail to see how the same wouldn't apply to the way LLMs are used now.
If an LLM is not up to the task, it will be augmented (prompting, scaffolding, RAG, fine tuning), replaced with a more capable LLM, or removed from the task outright.
The issue isn't that you can't "fire" an LLM for performing poorly - you absolutely can. It's that even the SOTA performance on many tasks may fall short of acceptable.
I'm not sure we disagree on anything substantive here.
If you have a team of 100 software developers each tasked with end-to-end delivery of assigned features, and one of them repeatedly pushes unreviewed and broken/insecure code to production, you can fire that particular developer, losing out on about 1% of your developers. If the expected harm of keeping that developer on is greater than the expected benefit of replacing them, you probably will replace them.
If you have a "team" of "100" AI agents "each" tasked with end-to-end delivery of assigned features, as they are currently implemented (same underlying model, shared-everything), and one instance does something bad, any mitigations you implement have to affect all 100 of them.
That seems like it produces more pressure against the "shared-everything, identical roles for all agents in the organization" model for groups of AI developers than for groups of human developers. Organizational pressures for groups of human developers already push them into specialized roles, and I expect those pressures to be even stronger for groups of AI developers. As such
They plan on trying to thread the needle by employing some control schemes where (for example) different "agents" have different permissions. i.e. a "code writing" agent has read permissions for (some parts of) the codebase, the ability to write, deploy, and test changes to that code in a sandboxed dev environment, and the ability to open a pull request with those changes.
doesn't particularly feel like an implausible "thread the needle" strategy, it seems like the sort of thing we get by default because the incentives are already pushing so incredibly hard in that direction.
First, when I talk to security staff at AI companies about computer security, they often seem to fail to anticipate what insider threat from AIs will be like.
Why do you think this? Is it that they are not thinking about large numbers of automated agents running around doing a bunch of research?
Or is it that they are thinking about these kinds of scenarios, and yet they still don't apply the insider threat frame for some reason?
The other consideration is do you also isolate the AI workers from the human insiders? Because you would still want to control scenarios where the AI has access to the humans that have access to sensitive systems.
I agree insider vs outsider threat is an important distinction, and I one that I have seen security people take seriously in other contexts. My background is in enterprise IT and systems admin. I think there's some practical nuance missing here.
In so far as security people are expecting to treat the AI as an outsider, they're likely expecting to have a hard boundary between "systems that run the AI" and "systems and tools the AI gets to use", where access to any given user is to only one or the other.
This is already fairly common practice, in the sense that security people already think about how to avoid giving too much privileged access to any specific insider. E.g. ideally in a large enough company, admins that have to most of the systems might not be given access to the systems responsible for logging what they did, nor the system granting access to other systems - centralized access control/identity (e.g. AWS IAM, Okta, etc.). In that example the team that manages those logging and access control systems would not have access to anything outside those systems and would just be responsible for implementation according to whatever the security (and indirectly, HR people, say). The security people might have just readonly access to systems (e.g. through a log search/archival tool and any telemetry from the security software) and rely on the separate IT team for all changes.
More extreme levels of security/paranoia are possible and do exist, especially if you look at the largest companies or national security applications. In the latter case, the US govt maintains and entire separate internet with distinct fiber optic cabling, routers, etc; the computers have no connection outside this separate network, sit exclusively in shielded rooms in physically access controlled buildings, no unauthorized electronics go in those rooms, nor anyone without a clearance, etc. I could go on but hopefully I've gotten the idea across; AI is a distinct threat that would need a different flavor of extreme precautions.
For AI there's a pretty natural boundary to draw between the AI specific hardware accelerators on which the AI training and inference runs (and associated control, data warehouse, etc), vs the tooling the AI is permitted to use while it's running. The AI's inference running on a particular system doesn't automatically give it privileged access to that system (though exploits might). Tooling that the AI calls into or uses need not exist on the same hardware that's running inference. If such a boundary gets defined by the security team, it's then possible to treat any leakage of AI state effecting the inference hardware as fishy and deserving of investigation, just as if it was an outside user (b/c with that division, it is).
Happy to answer related questions about what existing approaches exist here and might be considered ordinary practice. IT and information security mostly happens outside of academic literature and companies don't like to talk about specific security precautions they're taking. You'll probably need to talk to many people with industry experience to get a well rounded take, as nobody gets to see all that many implementations in full detail in one career.
While you could give your internal AI wide indiscriminate access, it seems neither necessary nor wise to do so. It seems likely you could get at least 80% of the potential benefit via no more than 20% of the access breadth. I would want my AI to tell me when it thinks it could help me more with greater access so that I can decide whether the requested additional access is reasonable.
Curated. I found this a helpful frame on AI security and I'm kinda surprised I hadn't heard it before.
You haven’t heard it before because you probably don’t work in IT.
AI is a tool - just the next evolution in automation, like scripting or macros. It's not an "employee" with benefits or intent. Framing it as an insider vs outsider threat fundamentally misunderstands how we’ve always approached internal risk in real-world systems. IT has always been about doing more with fewer people, using the tools available. This framing doesn’t reflect practice - it reflects a lack of exposure to it.
I like the way you're thinking about AI and security. The insider vs outsider risk dynamic is very ingrained into Cybersecurity and is often discussed while threat modeling a system. The distinction is muddled because the most impactful external threat is one that steals an insiders credentials. History shows it's only a matter of time before insider permissions are used by an outsider, which gave birth to the phrase "assume breach". If the AI is a black-box, it's safest to assume it has been compromised. I'd like to add another dynamic, malicious vs accidental, to your AI insider risk model that makes clear why AIs should not be granted Insider privileges like human employees.
An insider has the ability to cause damage maliciously or accidentally. Accidental damage is by far the most common event, but we have decades, centuries if you count other engineering disciplines, of experience putting processes in place to prevent and learn from accidents. Generally the control mechanism slows the person down by asking for social proof, another person's sign-off, and that leads to a key mitigating control on their access: it's hard to cause major damage when action is slow.
Adding social proof adds transparency, covers knowledge gaps and increases social pressure to get it right. People align with the goals of their group and intrinsically do not want to damage their group. When they harm the group accidentally it has a social cost, which acts as a preventative control.
AIs act quickly and do not pay the same social cost. So those preventative controls do not count towards balancing the risk of overly broad insider permissions.
Until we have Explainable AI, it will be risky to fully trust that an AI is free of hidden malicious intent or accident prone edge cases. The breadth of their input space makes firewalls inefficient and the speed they can act makes detective controls ineffective.
AIs speed, lack of human-like social consequences and opaque decision making create a large source of risk that puts AI outside of what qualifies employees for insider privileges. Similarly their speed, clone-ability, and adaptability mean they don't need to work like humans and don't need insider privileges.
An AI can lookup and read documentation to acquire context faster than any person. It can be trained to do a new task (at least ones it's capable of) almost instantly. It can pass context to another AI much faster than a person. It can request permissions on a per task basis for specific rows/columns/fields in a database with minimal slowdown. An AI doesn't need to see the bigger picture, have a connection to the group or have novelty to stay motivated the way a person does. A person would quit if their job was to do the same step in a workflow 1000 times in a row and ask for permission with a unique identifier each time, so we built workflows to maintain employee happiness.
Separation of concerns, minimizing blast radius, observability and compose-ability are vital to building resilient software services. AIs do not change those principles. Using one AI for an entire workflow or the same AI across different workflows is not necessary the way it is for a person. So giving them permissions the way we do people is not necessary. Running instances of multiple different 600B+ parameter models may be prohibitively expensive today, but eventually it will be cost effective to have single use model instances in a similar style as AWS's Lambdas.
A natural extension of the way AI interacts today via MCP protocols makes it a kind of insider. One with a specific role, and specific access patterns that match this role.
Even an org that is not concerned with misaligned AI or such, will still want to lock down exactly what updates each role is provding within the org, just as these org's typically lock down access to different roles within a company today.
Most employees cannot access account receivable, and access to the production databases in a tech company are very carefully guarded. Mostly not from fear of malevolence, its just a fear that a junior Dev could easily bollix things horribly with one errant command. Much in the same way, the org will want to specialize the AI into different roles, and provide different access according to these roles, and will want to test these different AIs in each role.
All of this seems to follow quite naturally from existing corporate practice.
But I expect this level of diligence will fall short of something that could really stop a misaligned ASI or even strong AI.
So it seems this will be most like an insider threat, but I think the real remediation of this threat is far from clear.
This is a decades-old problem—internal IT has always had to trust human admins at some point. But the biggest threats to systems are rarely technical; they're social. A properly aligned AI is actually easier to control than a human—an AI won’t plug in a USB stick it found in the parking lot out of curiosity. Most importantly, internal AI alignment can be explicitly enforced. Human alignment cannot. This analysis doesn’t really reflect how real-world information systems are managed.
Most importantly, internal AI alignment can be explicitly enforced.
Only if you have a whitelist (can be implemented as "limited range") of actions.
Otherwise it will be the case, and not once but many times, that you provide "Do not delete existing security checks from the infrastructure." sentence in context and LLM proceeds to delete them. That's a large part of the problem we face.
"Only if you have a whitelist (can be implemented as "limited range") of actions."
I'm not sure this is the only way, but it is one way. You still cannot guarantee this with a human admin, but, more importantly, you are still missing the point. AI is not an employee with agency, it is a tool that a human admin can control, like any other form of automation.
I’ve been thinking a lot recently about the relationship between AI control and traditional computer security. Here’s one point that I think is important.
My understanding is that there's a big qualitative distinction between two ends of a spectrum of security work that organizations do, that I’ll call “security from outsiders” and “security from insiders”.
On the “security from outsiders” end of the spectrum, you have some security invariants you try to maintain entirely by restricting affordances with static, entirely automated systems. My sense is that this is most of how Facebook or AWS relates to its users: they want to ensure that, no matter what actions the users take on their user interfaces, they can't violate fundamental security properties. For example, no matter what text I enter into the "new post" field on Facebook, I shouldn't be able to access the private messages of an arbitrary user. And on AWS, no matter what code I run on my EC2 instance, I shouldn't be able to access someone else's private S3 bucket.
This type of security is very "static": invariants are enforced by code that doesn't involve human review and so don't rely on humans reacting properly to something. This is typically how you mitigate risks from your users--users typically only need to have certain restricted types of access to your systems, and they're very numerous, so you need to handle them with automated approaches. I sometimes call systems that aim to be safe no matter what inputs a user provides "manifestly safe".
In contrast, consider how Facebook and AWS relate to their engineers. These companies want to be robust to their engineers abusing their access to violate security properties; they could do this either by abusing administrator privileges (e.g. direct access to machines that have user data) or by adding malign behavior to code that they write. In order to mitigate these risks, companies use very different types of security techniques. They rely on multi-party authorization, where multiple engineers have to sign off on executing sensitive commands, or code review, where someone needs to review code before it's pushed to environments where it has access to sensitive data. These techniques fundamentally rely on human labor.
Insider threats are harder to handle in that your engineers need to have much more fine-grained access, but easier in that your engineers are generally less numerous than your users. Moreover, insiders don't have as much ability to iterate against your security measures (because if you catch them trying to break your security measures, you can often fire them or have them arrested).
Of course, there's a spectrum here. A big part of how you mitigate threats from external hackers is having security staff investigate weird anomalies, in the hope of catching hackers who have broken some of your intended security boundaries. And a big part of how you mitigate insider threats is building safe administrative APIs, so that instead of your employees having to have the ability to run arbitrary code on production machines, you allow them to usually do their job by taking pre-specified, known-safe actions. And there’s much more along these lines.
I feel like there’s a really important distinction between “security from insiders” and “security from outsiders”, and I don’t really get why security people don’t seem to find this distinction as important.
Now that I have this concept, I can make a few points about the risks posed by AIs deployed inside AI companies. (Risks from internal deployments have been discussed by me here (as well as many other places), by a Redwood/UK AISI collaboration here, and by Apollo here.)
First, when I talk to security staff at AI companies about computer security, they often seem to fail to anticipate what insider threat from AIs will be like. In particular, when discussing how they'll deploy AI engineers and researchers internally, they often seem to imagine relating to those AIs mostly as outsiders rather than as insiders. They imagine writing small and carefully locked-down infrastructure and allowing the AIs to interact with it. But my guess is that in order to do their jobs, AIs will need to have the kind of wide-ranging access that insiders have, which leads to having an attack surface that is large and can’t be protected with static methods. (I’m not sure I’m right about this, and it seems very important to get clarity on; I plan to write more about it soon.) I think the security teams at AI companies are not putting enough work into preparing for this.
And second, insider threat from AI combines the hard parts of both problems! Like insider threat, AIs will need extremely broad, fine-grained access. And like outsider threat, you're outnumbered by your potential adversaries. (Of course, you have a bunch of advantages too, and that's why we research AI control rather than just writing off the problem as hopeless.)