I propose decomposing the goal of building safe AI into three primary subgoals: Capability Restriction, Alignment, and Robustness (CAR).
Capability Restriction
Capability Restriction refers to intentionally limiting an AI system's abilities or functions to mitigate potential risks. This could include:
Restricting the types of information it can learn from
Confining its interaction modalities with its environment
Limiting the types of computations the system can perform
Filtering mechanisms that restrict the set of possible AI outputs, for example, via automated oversight or enforced formal verification of outputs
Alignment
Alignment refers to accurately defining and implementing an AI's objectives in a way that aligns with our intended outcomes. Alternative terms used for this are Control and Specification.
While Alignment is often closely linked to "Alignment to Human Values," it's important to recognize that this is a subset of the broader concept of AI Alignment. In certain scenarios, our desired outcomes for an AI system may extend beyond strictly human values, aiming instead for more universally beneficial or neutral objectives.
Robustness
Robustness involves ensuring consistent and safe AI behavior across diverse conditions. This includes resilience to adversarial inputs, specifically designed disturbances that could hijack the model's operations or induce erroneous outputs, and correct generalization under distributional shifts.
CAR applied to cars
To illustrate the CAR framework more clearly, here is how it could apply to the challenges around developing good self-driving cars.
Capability Restriction
Enforcing a speed limit
Avoiding certain routes, for example, those that are known for high accident rates or areas deemed unsafe
Restricting types of passenger interaction - the system might only be allowed to respond to specific commands related to driving
Alignment
Understanding and implementing driving rules and adjusting driving behavior based on real-time situations
Understanding what the users want to achieve with the ride (target location, speed, view along the way, avoiding traffic jams)
Making the rider feel comfortable, for example, by avoiding jerky movements or sudden speed changes
Robustness
Operating reliably in a variety of conditions (weather, traffic, road conditions)
Handling unexpected situations, such as an object suddenly appearing on the road
Responding appropriately to aggressive or adversarial fellow drivers
Existing Decompositions
The Center for Security and Emerging Technology (CSET) categorizes AI safety issues into problems of Robustness, Assurance, and Specification in their 2021 report,Key Concepts in AI Safety: An Overview. This breakdown is similar to my suggested one, as two of the three issues are identical (Specification here is synonymous with Alignment). The differing one, Assurance, is described as:
To ensure the safety of a machine learning system, human operators must understand why the system behaves the way it does and whether its behavior will adhere to the system designer's expectations. A robust set of assurance techniques already exist for previous generations of computer systems. However, they are poorly suited to modern machine learning systems such as deep neural networks.
Interpretability (also sometimes called explainability) in AI refers to the study of how to understand the decisions of machine learning systems and how to design systems whose decisions are easily understood or interpretable. This way, human operators can ensure a system works as intended and, in the case of unexpected behavior, receive an explanation for said behavior.
However, interpretability in AI safety, like microscopy in biology, is not an end goal but a tool to achieve diverse objectives. Microscopy reveals the structural nuances of microbes in microbiology, analyzes tissue function in histology, and detects chromosomal abnormalities in cytogenetics. Similarly, in AI safety, interpretability serves different subfields.
We present four problems ready for research, namely withstanding hazards ("Robustness"), identifying hazards ("Monitoring"), steering ML systems ("Alignment"), and reducing deployment hazards ("Systemic Safety").
This overlaps with the categories I suggested and those presented by CSET regarding Robustness and Alignment. However, Monitoring and Systemic Safety, also proposed in the report, operate on a broader contextual scope and correspond more closely to methodologies rather than defined goals. This contrasts with the categories of Alignment and Robustness, which represent clear objectives. For instance, expressing a desire for an AI system to be 'Aligned' with human values or to be 'Robust' against various adversarial attacks is reasonable. However, terms like 'Monitoring' and 'Systemic Safety' denote methods or strategies to achieve our desired outcomes rather than the outcomes themselves.
I propose decomposing the goal of building safe AI into three primary subgoals: Capability Restriction, Alignment, and Robustness (CAR).
Capability Restriction
Capability Restriction refers to intentionally limiting an AI system's abilities or functions to mitigate potential risks. This could include:
Alignment
Alignment refers to accurately defining and implementing an AI's objectives in a way that aligns with our intended outcomes. Alternative terms used for this are Control and Specification.
While Alignment is often closely linked to "Alignment to Human Values," it's important to recognize that this is a subset of the broader concept of AI Alignment. In certain scenarios, our desired outcomes for an AI system may extend beyond strictly human values, aiming instead for more universally beneficial or neutral objectives.
Robustness
Robustness involves ensuring consistent and safe AI behavior across diverse conditions. This includes resilience to adversarial inputs, specifically designed disturbances that could hijack the model's operations or induce erroneous outputs, and correct generalization under distributional shifts.
CAR applied to cars
To illustrate the CAR framework more clearly, here is how it could apply to the challenges around developing good self-driving cars.
Capability Restriction
Alignment
Robustness
Existing Decompositions
The Center for Security and Emerging Technology (CSET) categorizes AI safety issues into problems of Robustness, Assurance, and Specification in their 2021 report, Key Concepts in AI Safety: An Overview. This breakdown is similar to my suggested one, as two of the three issues are identical (Specification here is synonymous with Alignment). The differing one, Assurance, is described as:
However, interpretability in AI safety, like microscopy in biology, is not an end goal but a tool to achieve diverse objectives. Microscopy reveals the structural nuances of microbes in microbiology, analyzes tissue function in histology, and detects chromosomal abnormalities in cytogenetics. Similarly, in AI safety, interpretability serves different subfields.
Unsolved Problems in ML Safety suggests the following:
This overlaps with the categories I suggested and those presented by CSET regarding Robustness and Alignment. However, Monitoring and Systemic Safety, also proposed in the report, operate on a broader contextual scope and correspond more closely to methodologies rather than defined goals. This contrasts with the categories of Alignment and Robustness, which represent clear objectives. For instance, expressing a desire for an AI system to be 'Aligned' with human values or to be 'Robust' against various adversarial attacks is reasonable. However, terms like 'Monitoring' and 'Systemic Safety' denote methods or strategies to achieve our desired outcomes rather than the outcomes themselves.