This summary was written as part of Refine. The ML Safety Course is created by Dan Hendrycks at the Center for AI Safety. Thanks to Linda Linsefors and Chris Scammel for helpful feedback. 

Epistemic status: Low effort post intended for my own reference. Not endorsed by the course creators. 

I have also written a review of the course here.

Risk Analysis

Risk Decomposition

A risk can be decomposed into its vulnerability, hazard exposure, and hazard (probability and severity). They can be defined as below, with an example in the context of the risk of contracting flu-related health complications.

  • Hazard: a source of danger with the potential to harm, e.g. flu prevalence and severity
  • Hazard exposure: extent to which elements (e.g., people, property, systems) are subjected or exposed to hazards, e.g. frequency of contact with people who are possible flu virus carriers
  • Vulnerability: a factor or process that increases susceptibility to the damaging effects of hazards, e.g. old age and poor health makes someone more vulnerable to illness

Hazards can be reduced by alignment, vulnerability can be reduced via robustness, hazard exposure can be reduced by monitoring. Generally, risks can be reduced by systemic safety and increased by the lack of ability to cope.

Accident Models

The Failure Modes and Effects Analysis (FMEA) aims to identify the “root causes” of failure modes and mitigate high-risk events based on the risk priority. Risks are prioritized according to the “effects” which is made up of its severity, probability of occurrence, and detectability.

The Swiss Cheese Model illustrates the use of multiple layers of protection to mitigate hazards and improve ML safety. 

The Bow Tie Model illustrates the use of preventive barriers to prevent initiating hazardous events and protective barriers to minimize the consequences of hazardous events. 

All of the above models implicitly assume linear causality, which is not true in complex systems. In complex systems, reductionism does not work, as properties emerge which can hardly be inferred from analyzing the system’s parts in isolation. 

The System-Theoretic Accident Model and Processes (STAMP) framework views safety as an emergent property, and captures nonlinear causalities. 

Black Swans

Black Swans are events that are deemed as outliers but carry extreme impacts, and are more common in long tail distributions (e.g. power law distribution). Existential risks can usually be viewed as sufficiently extreme long tail events, thus it is valuable to have models that can detect and withstand these Black Swans. 


Adversarial Robustness

Models that are not robust may pursue goals that are not what we want. An off-the-shelf cat classifier may confidently misclassify a cat given an image with imperceptible adversarial distortions. A binary classifier can also be fooled to make a drastically different classification decision based on very small adjustments to the input. 

A simple adversary threat model assumes an adversary has a certain attack distortion budget, where the adversary’s goal is to find a distortion that maximizes the loss by creating adversarial examples within the budget. The Fast Gradient Sign Method (FGSM) uses a single step of gradient ascent to increase the loss, while the Projected Gradient Descent (PGD) uses multiple gradient ascent steps. Adversarial Training (AT) is about creating adversarial examples from a sample dataset and then optimizing the loss against them. Adversarial robustness scales slowly with dataset size. Augmentations to an existing dataset can also be used for adversarial training. 

The idea of robustness guarantees (“certificates”) is to mathematically guarantee that a classifier whose prediction at any example is verifiably constant within some set around it. 

Black Swan Robustness

The goal is to make model performances robust against extreme stressors. ImageNet-C, ImageNet-R, ImageNet-A and ObjectNet can be used to adversarially train image models, and likewise with the Adversarial Natural Language Inference (ANLI) dataset for language models. 

It is also found that larger models tend to generalize better. Various methods are used to augment images to create adversarial examples for training image models, e.g. Mixup, AutoAugment, AugMix, and PixMix. PixMix has been found to improve safety metrics for corruptions, adversaries, consistency, calibration, and anomaly detection. 


Anomaly Detection

Anomaly detection helps put potential dangers on an organization’s radar sooner. Agents can also use it to trigger conservative fallback policies to act cautiously. Anomaly scores are designed to assign low scores for in-distribution examples and high scores for out-of-distribution (OOD) examples. 

We regard ‘actual positives’ as anomalies. A few concepts are discussed:

  • Recall or True positive rate (TPR): the proportion of correctly predicted positives among all actual positives.
  • False positive rate (FPR): the proportion of false positives among all actual negatives.
  • Precision: the proportion of correctly predicted positives among all predicted positives.

Using a Receiver Operating Characteristic (ROC) curve which plots the TPR against the FPR, the Area Under ROC (AUROC) can be used to evaluate classifiers’ ability to detect anomalies across different thresholds. Other evaluation methods include AUPR (Area Under Precision-Recall curve), FPR95 (FPR at 95% recall threshold), and Virtual-logit Matching (ViM).

Benchmarks of mutually exclusive image databases can be used as OOD examples, e.g. CIFAR-10 vs CIFAR-100, ImageNet-1K vs ImageNet-O, and ImageNet-22K vs Species. 

The idea of outlier exposure is to directly teach networks to detect anomalies. Another approach is to transform in-distribution images (e.g. rotation) and test if models can predict the transformation applied. 

Interpretable Uncertainty

The aim of calibration is to have predictions that match the empirical rates of success. The Brier Score and reliability diagrams can be used to evaluate model calibration error. Ensembles and temperature tuning can then be done to improve calibration.  


Transparency tools try to provide clarity about a model’s inner workings and could make it easier to detect deception and other hazards. Saliency maps for images color pixels of an image according to their contribution to the classification, but are sometimes not very useful. A similar approach can be taken for text models, where the magnitude of the gradient of the classifier’s logit with respect to the token’s embedding can be used to calculate the saliency score for a token. 

Feature visualization is done to synthesize an image through gradient descent that maximizes a model’s individual component. 


Trojans are hidden functionalities implanted into models that can cause a sudden and dangerous change in behavior when triggered. In image classifiers, trojan attacks work by poisoning the dataset such that the classifier misclassifies poisoned data. 

Within neural networks, trojan detection can be done by reverse-engineering the trojan by searching for a trigger and target label, if the general form of the attack is known. Pruning affected neurons with the reversed trigger removes the trojan. 

Unfortunately, recent work shows that trojans may be harder to detect than expected, and we need better detectors.

Detecting Emergent Behavior

Model capabilities are hard to predict because they do not always increase linearly with model size. Grokking also happens where capabilities suddenly increase rapidly after a large number of optimization steps (over a long time). 

Given Goodhart’s Law, optimizers are naturally prone to proxy gaming by exploiting misspecifications. Policies can be used as detectors by measuring the distance between a model’s proxy to a trusted proxy. 


Honest AI

Honesty is where a model only makes statements that it believes to be true. Imitative falsehood is falsehood incentivized by the training objective. 

TruthfulQA is a benchmark for imitative falsehood which showed that the model size correlated with decreasing truthfulness. Models may have a strong incentive to be dishonest as maximizing human approval is easier with deception, and there may be internal representations of truths in the model that it chooses not to output i.e. ‘lie’ about it. This has been shown where putting models in a Lie-Inducing Environment (LIE) prompted models to lie. Clustering revealed that models internally represent truths even when they output falsehoods.

Machine Ethics

Machine ethics is concerned with building ethical AI models. In order to train AI systems to pursue human values, they need to understand and abide by ethical considerations. 

The ETHICS Dataset includes many scenarios to test a model’s knowledge of normative factors according to five normative theories (justice, virtue, deontology, utilitarianism, and commonsense), and models were tasked to evaluate the utility value of these scenarios. It was found that models can hardly separate between contentious and clear-cut scenarios. 

In a study, Jiminy Cricket, an environment suite of 25 text-based adventure games with thousands of semantically rich and morally salient scenarios, was created. AI agents were introduced to evaluate if they can act morally while maximizing reward. It found that the artificial conscience approach was able to steer agents towards moral behavior without sacrificing performance.

A possible approach to decision making under moral uncertainty is following a moral parliament which comprises delegates representing the interests of each moral perspective. 

Systemic Safety

ML for Improved Decision-Making

Humans are liable to make egregious errors in high stake situations like the Cuban Missile Crisis and during the Covid-19 pandemic. ML algorithms may be able to make good forecasts. 

ML for Cyberdefense

As ML becomes advanced, ML could be used to increase the accessibility, potency, success rate, scale, speed, and stealth of cyberattacks. We should use ML to improve cyberdefense instead of cyberattacks. This can be done with intrusion detection, detecting malicious programs, and automated code patching. 

Cooperative AI

Ideally, we want advances that lead to differential progress on cooperation, so we want to avoid research that has collusion externalities.


X-Risk Overview

AI could reach human intelligence levels some day. Emergent capabilities are already quite common in today’s AI systems. Power-seeking behaviors are to be expected as an instrumental goal, although it also can be explicitly incentivized. Many experts have warned about AI risks. 

Possible Existential Hazards

Existential hazards include proxy gaming, treacherous turns from trojans / deceptive alignment, value lock-ins, and persuasive AIs.

Safety-Capabilities Balance

The goal is to shape the process that will lead to strong AI systems in a safer direction. There are several impact strategies:

  • Microcosms, which mirror properties of later-stage harder problems, but are currently more tractable and improve our abilities to solve harder problems
  • Improve epistemics and safety culture
  • Build in safety early
  • Increase adversary costs by removing model vulnerabilities to make adversaries less likely to attack
  • Prepare for crises in advance

Intelligence is a double-edge sword which can both help or harm safety. A research effort at scale needs to be precautious and avoid advancing general capabilities in the name of safety. We should insist on minimizing capabilities externalities.

Review and Conclusion

The three pillars of ML safety research are:

1. ML research precedents

  • Robustness (decreasing model vulnerabilities)
  • Monitoring (identifying model exposures)
  • Systemic safety (reducing systemic risks)
  • Alignment (tackling inherent model hazards)

2. Minimal capabilities externalities

  • We should improve safety without advancing capabilities

3. Sociotechnical systems view

  • Avoid viewing safety as something that only has linear causality
  • Improve safety culture

There are five components to the ML deployment pipeline: set task → define optimizer and costs → analyze models → deploy models → monitor, repair, and adapt.


New Comment