AI Safety Cheatsheet

This is intended to be a compilation of the big AI safety ideas, problems and approaches to solutions. In order to keep it readable, we provide links to the content instead of the content itself. Contributions are welcome!

Alignment Problems

Forecasting

More is different [1] [2] [3]
Vinge uncertainty [1] [2] [3]
Collingridge dilemma [1]
Bio-anchors [1]
Thought experiments [1]

Optimization issues

Inner alignment. [1] [2]
- Out of distribution alignment / goal misgeneralization [1] [2] [3]
Outer alignment and Mesa-optimizers [1] [2] [3]

Human-like behaviors

Instrumental convergence [1] [2] [3] [4]
- Self-preservation
- Goal-content integrity
- Cognitive enhancement
- Resource acquisition
- Power/Influence acquisition [1]
Specification gaming [1] [2]
- Goodharts law [1] and Goodharts Curse [2]
Deception. This is the optimal behavior for a misaligned mesa-optimizer. [1]
Nearest unblocked strategy [1]
Collaboration with other AIs
Sycophant AI [1]

Alien behaviors

Orthogonality Thesis [1] [2] [3] [4]
Strawberry problem [1] [2]
Paperclip maximizing [1] [2]
Learning the wrong distribution. [1] [2]
High impact [1]
Edge instantiation [1]
Context disaster [1]

Deployment Problems

Alignment tax / Safety Tax [1]
Collingridge dilemma [1]
Corrigibility [1] [2]
Humans are not secure [1]
AI-Box [1] [2]
We need to get alignment right on the 'first critical try' [1]
Shutdown problem [1]

Governance

Overview [1] [2] [3]

Robust totalitarianism [1] [2]
Extreme first-strike advantages [1] [2]
Misuse Risks [1]
Value Erosion through Competition [1] [2]
Windfall clause [1]
Compute governance [1]
Risks from malevolent actors [1]

Models for thinking about AGI agents

Human-anchors [1]
Bio-anchors [1]
A super-smart deceptive, manipulative, psychopath with arbitrary and (possibly absurd) goals.
As a computer program that simply does what it’s programmed to do. Just because it is super capable does not mean it is wise, moral, smart or cares about what humans want.

Approaches to AI safety

Eliciting latent knowledge (Paul Christiano) (Alignment Research Center) [1]
Agent foundations (MIRI) [1]
Brain-like design [1]
Iterated Distillation and Amplification [1]
Humans Consulting Humans (Christiano) [1]
Learning from Humans [1] [2] [3]
Reward modeling (DeepMind) [1]
Better-than-Demonstrator Imitation Learning via Automatically-Ranked Demonstration [1]
Imitation learning [1]
Myopic reinforcement learning [1]
Inverse reinforcement learning [1]
Cooperative inverse reinforcement learning [1]
Debate [1] [2]
Capability control method
- Oracle AI [1]
- AI Boxing [1] [2] [3]
- Anthropormorphic capture [1]
- Stunting [1]
- Tripwires [1]
Transparency / Interpretability
- Understandability principle [1]
- Effability [2]

Definitions

General Intelligence

"General Intelligence or Universal Intelligence is the ability to efficiently achieve goals in a wide range of domains". (This is a commonly held definition) [1] [2]
"Intelligence is the ability to make models. General intelligence means that a sufficiently large computational substrate can be fitted to an arbitrary computable function, within the limits of that substrate." (Josha Bach) [1]

Alignment

"AI that is trying to do what you want it to do". (Paul Christiano) [1]
"AI systems be designed with the sole objective of maximizing the realization of human preferences" (Stuart Russell) [1]
"AI should be designed to align with our ‘coherent extrapolated volition’ (CEV)[1]. CEV represents an integrated version of what we would want ‘if we knew more, thought faster, were more the people we wished we were, and had grown up farther together" (Eliezer Yudkowsky) [1]

Meta Resources

About Cheatsheet

Contributions are welcome! Please open a merge request and will do my best to quickly approve it.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Repository files navigation

AI Safety Cheatsheet

Alignment Problems

Forecasting

Optimization issues

Human-like behaviors

Alien behaviors

Deployment Problems

Governance

Models for thinking about AGI agents

Approaches to AI safety

Definitions

General Intelligence

Alignment

Meta Resources

About Cheatsheet

About

Jakobovski/ai-safety-cheatsheet

Folders and files

Latest commit

History

README.md

README.md

Repository files navigation

AI Safety Cheatsheet

Alignment Problems

Forecasting

Optimization issues

Human-like behaviors

Alien behaviors

Deployment Problems

Governance

Models for thinking about AGI agents

Approaches to AI safety

Definitions

General Intelligence

Alignment

Meta Resources

About Cheatsheet

About

Topics

Resources

Stars

Watchers

Forks