Controlling AGI Risk

TeaSea

Controlling AGI Risk

by TeaSea

5 min read15th Mar 20248 comments

6

AI GovernanceAI RiskAI TakeoffAI

Frontpage

A theory of AGI safety based on constraints and affordances.

I've got this proto-idea of what's missing in much public discussion and action on AI safety. I'm hoping that by sharing it here, the hive-mind might come together and turn it into something useful.

Effective control of AI risk requires a broader approach than those taken so far. Efforts to-date have largely gravitated into the two camps of value alignment and governance. Value alignment aims to design AI systems that reliably act in the best interest of humans. Governance efforts aim to constrain people who develop, deploy or use AI to do so in ways that ensure the AI doesn't cause unacceptable harm.

These two camps are each and together necessary but insufficient to adequately control AI risk.

Firstly, AI capabilities emulate human cognitive capabilities. Their potential applications are so broad that their scope for application transcends all previous technologies. Most of the thinking and action-to-date on controlling AI risk has been based on how we've controlled the risks of previous technologies such as electricity, mechanized transport, and nuclear weapons. So far, we've mostly thought of AI as a technology to be used by humans, not as itself a user of technology.

Secondly, the acceleration of AI evolution is unlikely to decrease; the converse looks more likely, that increasingly capable and powerful AI will further accelerate the ongoing development of AI capability (including via self-improvement). Traditional governance mechanisms can't keep pace with this and any value alignment of systems could be transcended by the next emergent system. Just as AI is likely to impact, interact with, and become embedded in the whole of society, whole-of-society risk control practices must evolve.

AI systems are already becoming embedded in large swathes of society.

It is likely that AGI will soon be here.

Control of risk from AGI needs to be as ubiquitous as control of risk from people.

Definitions:

Risk
The potential for unintended or undesirable loss or harm

AGI
Artificial General Intelligence: AI that can perform any intellectual task that a human can

Sociotechnical system/s (STS)
A system in which agents (traditionally, people) interact with objects (including technologies) to achieve aims and fulfil purposes

Agent
An entity in an STS that makes decisions and initiates actions

Sensor
A mechanism via which an agent acquires data

Actuator
A mechanism via which an agent takes action

Technology
Any tool created by agents

Object
An entity in an STS that offers affordances to agents (includes but is not limited to technologies)

Affordance
The potential for action that an object offers to an agent

Constraint
An entity in a system that limits the interaction of an agent with an object

Axioms:

Historically, all risk in STS involves humans both as contributor to and recipient of harm because humans are an essential part of all STS.

STS scale from one person interacting with one piece of technology up to all people interacting with all technology.

STS are nested within larger STS up to the largest scale.

STS are complex systems; attributes of complex systems include non-determinism, some self-organisation, the potential for emergence, fuzzy boundaries.

Risk of harmful effects in STS arises from the same source as desirable effects in STS: agents interacting with objects.

Humans have historically been the sole agents in STS.

Our vast web of controls for risks in STS are premised on and target attributes of humans e.g. laws and their penalties, social conventions, financial incentives, physical barriers.

The prospect of jail, fines, or social approbation appear to be unreliable deterrents to AI.

Agents rely on sensors, actuators and associated signal pathways in order to act; these all offer opportunities to constrain action.

AI systems will be ubiquitously deployed.

AGI systems will be agents in STS.

AI attributes are different from human attributes.

Therefore, existing risk controls will be inadequate.

An entire new layer of AI risk controls must be added to and integrated with the entire STS, mirroring and synergising controls premised on human attributes, but accounting for AI attributes.

Context:

Agents interacting with objects capitalize on various affordances of those objects to support processes, usually in the pursuit of goals. For example, I (an agent) am currently utilizing the affordance offered by my office chair (an object/technology) of sitting. Attributes of the chair that can interact with my own attributes to offer the affordance of sitting, include the convenient height of the seat surface above the floor, the pleasing contoured surface area of the seat that accommodates my butt, and the ergonomically designed backrest that stops me falling backwards. The process of sitting supports my goal of writing this post. However, my chair also offers a range of other affordances. It's a swivel chair, so it offers me the affordance of spinning. It has wheels so I can roll it across the floor. It has enough mass that I could drop it from a height to cause damage to people or property.

Many objects in STS afford agents numerous kinds of processes, some desirable and intentional, others incidental and harmful. These latter can be called malaffordances - affordances that cause harm. Risk control relies on applying various constraints to these malaffordances to disrupt or modify either the attributes of the object, the potential interaction, or the action of the agent. Constraints exist on a spectrum between the hard and physical, like bolting my chair legs to the floor so I can't drop if off the roof, and the soft and intentional, like social conventions and values that tell me that dropping a chair on someone's head is not nice. Multiple constraints can be combined to create 'defense-in-depth' against the risk of harm. This is useful when the potential harm is significant and each risk control on its own has potential to fail. For example, to control risk from car accidents, we combine driver training, road rules, licensing, road infrastructure, vehicle design standards, etc.

Our evolved network of constraints to mitigate risk is distributed across all of our STS and is designed and deployed to influence people and groups of people at all levels. We rely on this network for our accustomed level of safety. AI has different attributes to people. New constraints are needed across all levels of our STS to account for the attributes of AI.

Of course, even if all of our tools, technologies and objects could be modified in ways that make them less prone to offer malaffordances to AI (AGI, ASI), it's currently not practically or economically viable to do so. However, if we recognise the full scope of opportunity for risk control within STS, we may cast the net wide enough to build in enough defenses-in depth in sufficient time to be able to enjoy the benefits that AI is likely to bring.

Proposition:

That all elements of the STS be considered as a potential locus of control of AGI risk.

Potential for application:

This theory could be used initially to inform the design of governance and regulatory systems for AI. Subsequently, it could be used to inform and guide AGI risk control throughout societies.

New to LessWrong?

Getting Started

FAQ

Library

AI GovernanceAI RiskAI TakeoffAI

Frontpage

6

New Comment

8 comments, sorted by

top scoring

Click to highlight new comments since: Today at 2:34 PM

[-]Carl Feynman1mo60

I’m going to say some critical stuff about this post. I hope I can do it without giving offense. This is how it seemed to one reader. I’m offering this criticism exactly because this post is, in important ways, good, and I’d like to see the author get better.

This is a long careful post, that boils down to “Someone will have to do something.” Okay, but what? It’s operating at a very high level of abstraction, only dipping down into the concrete only for a few sentences about chair construction. It was ultimately unsatisfying to me. I felt like it wrote some checks and left them up to other people to cash. I felt like the notion of sociotechnical system, and the need for an all-of-society response to AI, were novel and potentially important. I look forward to seeing how the author develops them.

This post seems to attempt to recapitulate the history of the AI risk discussion in a few aphoristic paragraphs, for somebody who’s never heard it before. Who’s the imagined audience for this piece? Certainly not the habitual Less Wrong reader, who has already read “List of Lethalities” or its equivalent. But it is equally inappropriate for the AI novice, who needs the alarming facts spelled out more slowly and carefully. I suspect it would help if the author clarified in their mind whom they imagine is reading it.

The post has the imagined structure of a logical proof, with definitions, axioms, and a proposition. But none of the points follow from each other with the rigidity that would require such a setup. When I read a math paper, I need all those things spelled out, because I might spend fifteen minutes reading a five-line definition, or need to repeatedly refer back to a theorem from several pages ago. But this is just an essay, with its lower standards of logical rigidity, and a greater need for readability. You’re just LARPing mathematics. It doesn’t make it more convincing.

[-]TeaSea1mo10

Thanks very much, Carl. Your feedback is super-useful, and much appreciated. I'll take it on board along with other comments and will work on a follow-up that gives more examples of what sort of controls might be deployed in the wider system.

[-]Ppau1mo30

As a fellow member of the regrettably small overlap between rationalists and adepts of ecological psychology, (https://www.lesswrong.com/posts/Y4hN7SkTwnKPNCPx5/why-don-t-more-people-talk-about-ecological-psychology), I'm looking forward to seeing your next posts!

[-]TeaSea1mo20

Thanks! I might link to your excellent post in my next effort, if that's ok...?

[-]Ppau1mo10

Of course! Thank you

[-]Seth Herd1mo20

You've probably seen this recent discussion post

"How useful is "AI Control" as a framing on AI X-Risk?"

It addresses the control issue you raise, and has links to other work addressing the same issue.

[-]Nathan Helm-Burger1mo20

I like your effort to think holistically about the sociotechnical systems we are embedded in and the impacts we should expect AI to have on those systems.

I have a couple of minor critiques of the way you are breaking things down that I think could be improved.

First, a meta thing. The general pattern of being a bit to black & white about described very complicated sets of things. This is nice because it makes it easier to reason about complicated situations, but it risks over simplifying and leading to seeming strong conclusions which don't actually follow from the true reality. The devil is in the details, as they say.

Efforts to-date have largely gravitated into the two camps of value alignment and governance.

I don't think this fully describes the set of camps. I think that these are two of the camps, yes, but there are others.

My breakdown would be:

Governance - Using regulation to set up patterns of behavior where AI will be used and developed in safe rather than harmful ways. Forcing companies to internalize their externalities (e.g. risks to society). Preventing and enforcing human-misuse-of-AI scenarios. Attempting to regulate novel technologies which arise because of accelerated R&D as result of AI. Setting up preventative measures to detect and halt rogue AI or human-misused AI in the act of doing bad things before the worst consequences can come to pass. Preventing acceleration spirals of recursive self-improvement from proceeding so rapidly that humanity becomes intellectually eclipsed and looses control over its destiny.

Value alignment - getting the AIs to behave as much as possible in accordance with the values of humanity generally. Getting the AI to be moral / ethical / cautious about harming people or making irreversible changes with potentially large negative consequences. Ideally, if an AI were 'given free reign' to act in the world, we'd want it to act in ways which were win-win for itself and humanity, and no matter what to err on the side of not harming humanity.

Operator alignment - technical methods to get the AI to be obedient to the instructions of the operators. To make the AI behave in accordance with the intuitive spirit of their instructions ('do what I mean') rather than like an evil genie which follows only the letter of the law. Making the AI safe and intuitive to use. Avoiding unintended negative consequences.

Control - finding ways to keep the operators of AI can maintain control over the AIs they create even if a given AI gets made wrong such that it tries to behave in harmful undesirable ways (out of alignment with operators). This involves things like technical methods of sandboxing new AIs, and thoroughly safety-testing them within the sandbox before deploying them. Once deployed, it involves making sure you retain the ability to shut them off if something goes wrong, making sure the model's weights don't get exfiltrated by outside actors or by the model itself. Having good cybersecurity, employee screening, and internal infosec practices so that hackers/spies can't steal your model weights, design docs, and code.

A minor nitpick:

Sociotechnical system/s (STS)
A system in which agents (traditionally, people) interact with objects (including technologies) to achieve aims and fulfill purposes

Not sure if objects is the right word here, or rather, not sure if that word alone is sufficient. Maybe objects and information/ideas/concepts? Much of the work I've been doing recently is observing what potential risks might arise from AI systems capable of rapidly integrating technical information from a large set of sources. This is not exactly making new discoveries, but just putting disparate pieces of information together in such a way as to create a novel recipe for technology. In general, this is a wonderful ability. In the specific case of weapons of mass destruction, it's a dangerous ability.

Nested STS / Human STS

Yes, historically, all STS have been human STS. But novel AI agents could, in addition to integrating into and interacting with human STS, form their own entirely independent STS. A sufficiently powerful amoral AGI would see human STS as useless if it could make its own that served its needs better. Such a scenario would likely turn out quite badly for humans. This is the concept of "the AI doesn't hate you, it's just that humans and their ecosphere are made of atoms that the AI has preferred uses for."

This doesn't contradict your ideas, just suggests an expansion of possible avenues of risk which should be guarded against. Self-replicating AI systems in outer space or burrowed into the crust of the earth, or roaming the ocean seabeds will likely be quite dangerous to humanity sooner or later even if they have no interaction with our STS in the short term.

[-]TeaSea1mo30

Brilliant! Thanks for these insightful comments, Nathan. I'll endeavour to address them in a follow-up post.

Moderation Log