I recently read the first chapter of Building Secure & Reliable Systems (the book from the EA infosec book club). The chapter is titled “The intersection of security and reliability.”

I found it to be a helpful introduction to some security concepts. I’m including some of my notes below. I also include some initial musings about how I’m relating these concepts to AI risk.

The difference between reliability and security is the presence of an adversary

In designing for reliability and security, you must consider different risks. The primary reliability risks are nonmalicious in nature—for example, a bad software update or a physical device failure. Security risks, however, come from adversaries who are actively trying to exploit system vulnerabilities. When designing for reliability, you assume that some things will go wrong at some point. When designing for security, you must assume that an adversary could be trying to make things go wrong at any point.

As a result, different systems are designed to respond to failures in quite different ways. In the absence of an adversary, systems often fail safe (or open): for example, an electronic lock is designed to remain open in case of power failure, to allow safe exit through the door. Fail safe/open behavior can lead to obvious security vulnerabilities. To defend against an adversary who might exploit a power failure, you could design the door to fail secure and remain closed when not powered.

Implication for AI risk: AI labs will have reliability needs and security needs. Many of the worries I’m most concerned about are best modeled as security needs. A situationally aware AI is an adversary. Methods that merely prevent nonmalicious or nonadversarial failures are unlikely to be sufficient when we’re dealing with AIs that might actively try to exploit vulnerabilities. 


Keeping system design as simple as possible is one of the best ways to improve your ability to assess both the reliability and the security of a system. A simpler design reduces the attack surface, decreases the potential for unanticipated system interactions, and makes it easier for humans to comprehend and reason about the system. Understandability is especially valuable during emergencies, when it can help responders mitigate symptoms quickly and reduce mean time to repair (MTTR).

Implication for AI risk: It’s extremely worrying that our current systems are highly complex and we can’t understand them

Malicious insiders

While the examples cited so far hinge on external attackers, you must also consider potential threats from malicious insiders. Although an insider may know more about potential abuse vectors than an external attacker who steals an employee’s credentials for the first time, the two cases often don’t differ much in practice. The principle of least privilege can mitigate insider threats. It dictates that a user should have the minimal set of privileges required to perform their job at a given time. For example, mechanisms like Unix’s sudo support fine-grained policies that specify which users can run which commands as which role.

At Google, we also use multi-party authorization to ensure that sensitive operations are reviewed and approved by specific sets of employees. This multi-party mechanism both protects against malicious insiders and reduces the risk of innocent human error, a common cause of reliability failures. Least privilege and multi-party authorization are not new ideas—they have been employed in many noncomputing scenarios, from nuclear missile silos to bank vaults

Implication for AI risk: Conjecture’s internal infohazard policy seems like a reasonable attempt to reduce risks from malicious (or negligent) insiders. The principle of least privilege (or a “need-to-know” policy) sounds valuable. I would be excited for labs like OpenAI, DeepMind, and Anthropic to adopt similar policies. 

Additionally, it seems likely that threats from malicious and negligent insiders will grow over time, and some of these threats will be existential. Multi-party authorization for major decisions seems like a promising and intuitive intervention. 

Emergency plans

During an emergency, teams must work together quickly and smoothly because
problems can have immediate consequences. In the worst case, an incident can
destroy a business in minutes. For example, in 2014 an attacker put the code-hosting
service Code Spaces out of business in a matter of hours by taking over the service’s
administrative tools and deleting all of its data, including all backups. Well-rehearsed collaboration and good incident management are critical for timely responses in these situations.

During a crisis, it is essential to have a clear chain of command and a solid set of
checklists, playbooks, and protocols. As discussed in Chapters 16 and 17, Google has codified crisis response into a program called Incident Management at Google
(IMAG), which establishes a standard, consistent way to handle all types of incidents, from system outages to natural disasters, and organize an effective response. IMAG was modeled on the US government’s Incident Command System (ICS), a standardized approach to the command, control, and coordination of emergency response among responders from multiple government agencies

Implication for AI: AI could strike extremely quickly. A superintelligent AI would cut through any of our defenses like butter. But I expect the first systems capable of overpowering humanity to be considerably weaker. 

There might be an AI governance/security research project here: Examine case studies of emergency plans that have been employed in other industries, identify best practices, adapt those for AI takeover threat models, and propose recommendations (or an entire plan) to OpenAI. 

My best-guess is that labs already have emergency plans, but it wouldn’t surprise me if they could be improved by an intelligent person/team performing a project like the one described above. There are lots of things to do, and many dropped balls. (The expected impact of this project would be more valuable if the person/team performing it had a way to reach members of lab governance/security teams. This may be less difficult than you think.)

Recovering from security failures

Recovering from a security failure often requires patching systems to fix a vulnerability. Intuitively, you want that process to happen as quickly as possible, using mechanisms that are exercised regularly and are therefore decently reliable. However, the capability to push changes quickly is a double-edged sword: while this capability can help close vulnerabilities quickly, it can also introduce bugs or performance issues that cause a lot of damage. The pressure to push patches quickly is greater if the vulnerability is widely known or severe. The choice of whether to push fixes slowly—and therefore to have more assurance that there are no inadvertent side effects, but risk that the vulnerability will be exploited—or to do so quickly ultimately comes down to a risk assessment and a business decision. For example, it may be acceptable to lose some performance or increase resource usage to fix a severe vulnerability.

Implication for AI risk: This makes me think about evals. Once an eval is failed, it would be great if labs responded slowly and cautiously. There might be pressure, however, to pass the eval as quickly as possible. This will “ultimately come down to a risk assessment and a business decision.”

I empathize a bit with lab folks who say things like “current systems aren’t going to destroy humanity, so we’re moving quickly now, but we really do plan to be more cautious and slow down later.” At the same time, I think it’s reasonable for AI safety folks to be like “your current plans involve making lots of complicated decisions once things get more dangerous, and your current behavior/culture seems inconsistent with a cautious approach, and it’s pretty hard to suddenly shift from a culture of acceleration/progress to a culture of cautious/security. Maybe you should instill those norms right now. Also, you’re allowed to decide not to, but then you shouldn’t be surprised if we don’t trust you as much. Trust is earned by demonstrating a track record of making responsible risk assessments and showing that you balance those risk assessments against business interests in reasonable ways.” 

I'm grateful to Jeffrey Ladish for feedback on this post.


New Comment
1 comment, sorted by Click to highlight new comments since: Today at 1:25 PM

Thanks for posting this, I think these are valuable lessons and I agree it would be valuable for someone to do a project looking into successful emergency response practices. One thing this framing does also highlight is that, as Quintin Pope discussed in his recent post on alignment optimism, the “security mindset” is not appropriate for the default alignment problem. We are only being optimized against once we have failed to align the AI; until then, we are mostly held to the lower bar of reliability, not security. There is also the problem of malicious human actors where security problems could pop up before the AI becomes misaligned, but this failure mode seems less likely and less x-risk inducing than misalignment, and involves a pretty different set of measures (info sharing policies, encrypting the weights as opposed to training techniques or evals).

New to LessWrong?