Comparing risk from internally-deployed AI to insider and outsider threats from humans

[-]Parv Mahajan5mo230

If you or anyone else is interested in getting funding for this kind of threat modeling, Georgetown‘s Center for Security and Emerging Technology put out a RFP for risks of internally deployed frontier models due June 30.

https://cset.georgetown.edu/wp-content/uploads/FRG-Call-for-Research-Ideas-Internal-Deployment.pdf

[-]RobertM5moΩ7176

They imagine writing small and carefully locked-down infrastructure and allowing the AIs to interact with it.

That's surprising and concerning. As you say, if these companies expect their AIs to do end-to-end engineering and R&D tasks internally, it seems difficult to imagine how they could do that without having employee-level privileges. Any place where they don't is a place where humans turn into a bottleneck. I can imagine a few possible objections to this:

They don't expect those bottlenecks to impose that much of a penalty.
1. I'm not sure how this could be true unless you think that AI systems will be pretty weak, which is sort of fighting the premise. What are we worried about, again?
They expect the bottlenecks to impose a large penalty and plan on biting that bullet when the time comes.
1. I currently roll to disbelieve based on both the publicly-held positions of the relevant organizations and also their historical track records. Incentives here seem quite bad.
They plan on trying to thread the needle by employing some control schemes where (for example) different "agents" have different permissions. i.e. a "code writing" agent has read permissions for (some parts of) the codebase, the ability to write, deploy, and test changes to that code in a sandboxed dev environment, and the ability to open a pull request with those changes. Another set of agents have permissions to review pull requests, and then request changes, approve/merge them, or flag the PR as suspicious. Yet another set of agents act as gatekeepers to sensitive data that might be needed for some experiments but only with good justification.
1. This still suffers from the incentive gradient pushing quite hard to just build end-to-end agents. Not only will it probably work better, but it'll be straight up cheaper and easier!

Like, to be clear, I would definitely prefer a world where these organizations wrote "small and carefully locked-down infrastructure" as the limited surface their AIs were allowed to interact with; I just don't expect that to actually happen in practice.

[-]faul_sname5mo71

This still suffers from the incentive gradient pushing quite hard to just build end-to-end agents. Not only will it probably work better, but it'll be straight up cheaper and easier!

The same is true of human software developers - your dev team sure can ship more features at a faster cadence if you give them root on your prod servers and full read and write access to your database. However, despite this incentive gradient, most software shops don't look like this. Maybe the same forces that push current organizations to separate out the person writing the code from the person reviewing it could be repurposed to software agents.

One bottleneck, of course, is that one reason it works with humans is that we have skin in the game - sufficiently bad behavior could get us fired or even sued. Current AI agents don't currently have anything to gain from behaving well or lose from behaving badly (or sufficient coherence to talk about "an" AI agent doing a thing).

[-]Josh Snider5mo169

one reason it works with humans is that we have skin in the game

Another reason is that different humans have different interests, your accountant and your electrician would struggle to work out a deal to enrich themselves at your expense, but it would get much easier if they shared the same brain and were just pretending to be separate people.

[-]ACCount4mo80

one reason it works with humans is that we have skin in the game - sufficiently bad behavior could get us fired or even sued

I don't think this framing is at all useful. "Have skin in the game" is just a human "capability elicitation" method.

You put a human under a performance-based incentive with the hopes of obtaining better performance. If successful, the performance uplift doesn't come "from nothing". The human was always capable of performing on that level. You just made the human use more of his capabilities.

The performance limitations of current AIs are more "AI doesn't have the capabilities" than "AI is not motivated to perform and doesn't use the capabilities it has" (80% confidence).

[-]faul_sname4mo20

If a human misbehaves badly enough on a task they will be removed from the pool of agents that will perform tasks like that in the future. Humans are playing an iterated game. Current LLM agents generally are not (notable exception: agent village).

You could of course frame the lack of persistent identity / personal resources / reputation as a capabilities problem on the AI side rather than a problem with companies expecting nonhuman minds to expose a fully human-like interface, it mostly depends on which side seems more tractable. I personally see a lot of promise in figuring out how to adapt workflows to take advantage of cheap but limited cognition - feels easier than trying to crack the reliability problem and the procedural memory problem, and there are definitely safety disadvantages in setting up your AI systems to expose a situationally aware, persistent human-like interface.

[-]ACCount4mo80

I fail to see how the same wouldn't apply to the way LLMs are used now.

If an LLM is not up to the task, it will be augmented (prompting, scaffolding, RAG, fine tuning), replaced with a more capable LLM, or removed from the task outright.

The issue isn't that you can't "fire" an LLM for performing poorly - you absolutely can. It's that even the SOTA performance on many tasks may fall short of acceptable.

[-]faul_sname4mo20

I'm not sure we disagree on anything substantive here.

If you have a team of 100 software developers each tasked with end-to-end delivery of assigned features, and one of them repeatedly pushes unreviewed and broken/insecure code to production, you can fire that particular developer, losing out on about 1% of your developers. If the expected harm of keeping that developer on is greater than the expected benefit of replacing them, you probably will replace them.

If you have a "team" of "100" AI agents "each" tasked with end-to-end delivery of assigned features, as they are currently implemented (same underlying model, shared-everything), and one instance does something bad, any mitigations you implement have to affect all 100 of them.

That seems like it produces more pressure against the "shared-everything, identical roles for all agents in the organization" model for groups of AI developers than for groups of human developers. Organizational pressures for groups of human developers already push them into specialized roles, and I expect those pressures to be even stronger for groups of AI developers. As such

They plan on trying to thread the needle by employing some control schemes where (for example) different "agents" have different permissions. i.e. a "code writing" agent has read permissions for (some parts of) the codebase, the ability to write, deploy, and test changes to that code in a sandboxed dev environment, and the ability to open a pull request with those changes.

doesn't particularly feel like an implausible "thread the needle" strategy, it seems like the sort of thing we get by default because the incentives are already pushing so incredibly hard in that direction.

[-]Noosphere894mo42

I want to flag this for the future, but I suspect a huge underlying crux might be whether something like a persistent memory/identity and reliability to go with is just fundamentally necessary for many tasks to be done, with realistic compute and in practice you can't take shortcuts to unlock most of the value of AI.

This is related to whether you can reduce the long-term memory of humans into a huge context window, and I currently am not highly confident that the answer is no (that would require more scale-up and more time), though I do slightly favor the hypothesis that the answer is no over the answer is yes.

Link below:

https://www.lesswrong.com/posts/oKAFFvaouKKEhbBPm/a-bear-case-my-predictions-regarding-ai-progress#9LT2RgZiqgTQHudpm

This is also a crux I suspect for @1a3orn, based on this response to AI 2027 (though it focuses on different criticisms):

https://www.lesswrong.com/posts/4MofJAmXDkoFsTf4B/?commentId=EJZFG8m6ETWgZRFJi

Putting it another way, the crux is whether can get most tasks to be automated by short-lifetime AIs, rather than long, individual lifetime AIs, and I'm not nearly as confident as you or @1a3orn that long-term memory isn't necessary and not having it pays a crippling capability tax.

[-]Orpheus165mo40

First, when I talk to security staff at AI companies about computer security, they often seem to fail to anticipate what insider threat from AIs will be like.

Why do you think this? Is it that they are not thinking about large numbers of automated agents running around doing a bunch of research?

Or is it that they are thinking about these kinds of scenarios, and yet they still don't apply the insider threat frame for some reason?

[-]Josh Snider5mo30

Have you taken a look at how companies manage Claude Code, Cursor, etc? That seems related.

[-]Da_Peach4mo32

I think that's not the same, since the code publishing flow looks like this:

employee gets assigned some task
employee off-loads this task to Claude Code after adding specifications/etc.
employee accepts the LLM generated code, and pushes it after reviewing
the pull request review process happens as always

Instead of this:

AI agent gets assigned some task
it writes code and pushes it into production

At least that's how I've seen coding agent integration in my limited experience.

The fundamental problem being talked about in the post is when accountability is shifted to the AI, but since it naturally can't be held accountable for mistakes, you build architecture that prevents problems from popping up in the first place. And how companies are currently deploying Claude Code, et al. is the employee-centric flow; which doesn't require any additional internal security measures to be deployed, since the accountability for all the code has not changed at all.

[-]Dusto4mo20

The other consideration is do you also isolate the AI workers from the human insiders? Because you would still want to control scenarios where the AI has access to the humans that have access to sensitive systems.

[-]FireStormOOO4mo20

I agree insider vs outsider threat is an important distinction, and I one that I have seen security people take seriously in other contexts. My background is in enterprise IT and systems admin. I think there's some practical nuance missing here.

In so far as security people are expecting to treat the AI as an outsider, they're likely expecting to have a hard boundary between "systems that run the AI" and "systems and tools the AI gets to use", where access to any given user is to only one or the other.

This is already fairly common practice, in the sense that security people already think about how to avoid giving too much privileged access to any specific insider. E.g. ideally in a large enough company, admins that have to most of the systems might not be given access to the systems responsible for logging what they did, nor the system granting access to other systems - centralized access control/identity (e.g. AWS IAM, Okta, etc.). In that example the team that manages those logging and access control systems would not have access to anything outside those systems and would just be responsible for implementation according to whatever the security (and indirectly, HR people, say). The security people might have just readonly access to systems (e.g. through a log search/archival tool and any telemetry from the security software) and rely on the separate IT team for all changes.

More extreme levels of security/paranoia are possible and do exist, especially if you look at the largest companies or national security applications. In the latter case, the US govt maintains and entire separate internet with distinct fiber optic cabling, routers, etc; the computers have no connection outside this separate network, sit exclusively in shielded rooms in physically access controlled buildings, no unauthorized electronics go in those rooms, nor anyone without a clearance, etc. I could go on but hopefully I've gotten the idea across; AI is a distinct threat that would need a different flavor of extreme precautions.

For AI there's a pretty natural boundary to draw between the AI specific hardware accelerators on which the AI training and inference runs (and associated control, data warehouse, etc), vs the tooling the AI is permitted to use while it's running. The AI's inference running on a particular system doesn't automatically give it privileged access to that system (though exploits might). Tooling that the AI calls into or uses need not exist on the same hardware that's running inference. If such a boundary gets defined by the security team, it's then possible to treat any leakage of AI state effecting the inference hardware as fishy and deserving of investigation, just as if it was an outside user (b/c with that division, it is).

Happy to answer related questions about what existing approaches exist here and might be considered ordinary practice. IT and information security mostly happens outside of academic literature and companies don't like to talk about specific security precautions they're taking. You'll probably need to talk to many people with industry experience to get a well rounded take, as nobody gets to see all that many implementations in full detail in one career.

[-]Roger Scott4mo20

While you could give your internal AI wide indiscriminate access, it seems neither necessary nor wise to do so. It seems likely you could get at least 80% of the potential benefit via no more than 20% of the access breadth. I would want my AI to tell me when it thinks it could help me more with greater access so that I can decide whether the requested additional access is reasonable.

[-]Raemon4mo21

Curated. I found this a helpful frame on AI security and I'm kinda surprised I hadn't heard it before.

[-]hmartyb4mo-3-1

You haven’t heard it before because you probably don’t work in IT.
AI is a tool - just the next evolution in automation, like scripting or macros. It's not an "employee" with benefits or intent. Framing it as an insider vs outsider threat fundamentally misunderstands how we’ve always approached internal risk in real-world systems. IT has always been about doing more with fewer people, using the tools available. This framing doesn’t reflect practice - it reflects a lack of exposure to it.

[-]Dan Smith4mo10

I like the way you're thinking about AI and security. The insider vs outsider risk dynamic is very ingrained into Cybersecurity and is often discussed while threat modeling a system. The distinction is muddled because the most impactful external threat is one that steals an insiders credentials. History shows it's only a matter of time before insider permissions are used by an outsider, which gave birth to the phrase "assume breach". If the AI is a black-box, it's safest to assume it has been compromised. I'd like to add another dynamic, malicious vs accidental, to your AI insider risk model that makes clear why AIs should not be granted Insider privileges like human employees.

An insider has the ability to cause damage maliciously or accidentally. Accidental damage is by far the most common event, but we have decades, centuries if you count other engineering disciplines, of experience putting processes in place to prevent and learn from accidents. Generally the control mechanism slows the person down by asking for social proof, another person's sign-off, and that leads to a key mitigating control on their access: it's hard to cause major damage when action is slow.

Adding social proof adds transparency, covers knowledge gaps and increases social pressure to get it right. People align with the goals of their group and intrinsically do not want to damage their group. When they harm the group accidentally it has a social cost, which acts as a preventative control.

AIs act quickly and do not pay the same social cost. So those preventative controls do not count towards balancing the risk of overly broad insider permissions.

Until we have Explainable AI, it will be risky to fully trust that an AI is free of hidden malicious intent or accident prone edge cases. The breadth of their input space makes firewalls inefficient and the speed they can act makes detective controls ineffective.

AIs speed, lack of human-like social consequences and opaque decision making create a large source of risk that puts AI outside of what qualifies employees for insider privileges. Similarly their speed, clone-ability, and adaptability mean they don't need to work like humans and don't need insider privileges.

An AI can lookup and read documentation to acquire context faster than any person. It can be trained to do a new task (at least ones it's capable of) almost instantly. It can pass context to another AI much faster than a person. It can request permissions on a per task basis for specific rows/columns/fields in a database with minimal slowdown. An AI doesn't need to see the bigger picture, have a connection to the group or have novelty to stay motivated the way a person does. A person would quit if their job was to do the same step in a workflow 1000 times in a row and ask for permission with a unique identifier each time, so we built workflows to maintain employee happiness.

Separation of concerns, minimizing blast radius, observability and compose-ability are vital to building resilient software services. AIs do not change those principles. Using one AI for an entire workflow or the same AI across different workflows is not necessary the way it is for a person. So giving them permissions the way we do people is not necessary. Running instances of multiple different 600B+ parameter models may be prohibitively expensive today, but eventually it will be cost effective to have single use model instances in a similar style as AWS's Lambdas.

[-]Dan.Oblinger4mo10

A natural extension of the way AI interacts today via MCP protocols makes it a kind of insider. One with a specific role, and specific access patterns that match this role.

Even an org that is not concerned with misaligned AI or such, will still want to lock down exactly what updates each role is provding within the org, just as these org's typically lock down access to different roles within a company today.

Most employees cannot access account receivable, and access to the production databases in a tech company are very carefully guarded. Mostly not from fear of malevolence, its just a fear that a junior Dev could easily bollix things horribly with one errant command. Much in the same way, the org will want to specialize the AI into different roles, and provide different access according to these roles, and will want to test these different AIs in each role.

All of this seems to follow quite naturally from existing corporate practice.

But I expect this level of diligence will fall short of something that could really stop a misaligned ASI or even strong AI.

So it seems this will be most like an insider threat, but I think the real remediation of this threat is far from clear.

[-]hmartyb4mo10

This is a decades-old problem—internal IT has always had to trust human admins at some point. But the biggest threats to systems are rarely technical; they're social. A properly aligned AI is actually easier to control than a human—an AI won’t plug in a USB stick it found in the parking lot out of curiosity. Most importantly, internal AI alignment can be explicitly enforced. Human alignment cannot. This analysis doesn’t really reflect how real-world information systems are managed.

[-]ProgramCrafter4mo10

Most importantly, internal AI alignment can be explicitly enforced.

Only if you have a whitelist (can be implemented as "limited range") of actions.

Otherwise it will be the case, and not once but many times, that you provide "Do not delete existing security checks from the infrastructure." sentence in context and LLM proceeds to delete them. That's a large part of the problem we face.

[-]hmartyb4mo10

"Only if you have a whitelist (can be implemented as "limited range") of actions."

I'm not sure this is the only way, but it is one way. You still cannot guarantee this with a human admin, but, more importantly, you are still missing the point. AI is not an employee with agency, it is a tool that a human admin can control, like any other form of automation.

LESSWRONG
LW

LESSWRONG
LW

150

Comparing risk from internally-deployed AI to insider and outsider threats from humans

150

Ω 58

150

Ω 58