Skip to content

Jakobovski/ai-safety-cheatsheet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 

Repository files navigation

AI Safety Cheatsheet

This is intended to be a compilation of the big AI safety ideas, problems and approaches to solutions. In order to keep it readable, we provide links to the content instead of the content itself. Contributions are welcome!

Alignment Problems

Forecasting

Optimization issues

  • Inner alignment. [1] [2]
    • Out of distribution alignment / goal misgeneralization [1] [2] [3]
  • Outer alignment and Mesa-optimizers [1] [2] [3]

Human-like behaviors

  • Instrumental convergence [1] [2] [3] [4]
    • Self-preservation
    • Goal-content integrity
    • Cognitive enhancement
    • Resource acquisition
    • Power/Influence acquisition [1]
  • Specification gaming [1] [2]
    • Goodharts law [1] and Goodharts Curse [2]
  • Deception. This is the optimal behavior for a misaligned mesa-optimizer. [1]
  • Nearest unblocked strategy [1]
  • Collaboration with other AIs
  • Sycophant AI [1]

Alien behaviors

  • Orthogonality Thesis [1] [2] [3] [4]
  • Strawberry problem [1] [2]
  • Paperclip maximizing [1] [2]
  • Learning the wrong distribution. [1] [2]
  • High impact [1]
  • Edge instantiation [1]
  • Context disaster [1]

Deployment Problems

  • Alignment tax / Safety Tax [1]
  • Collingridge dilemma [1]
  • Corrigibility [1] [2]
  • Humans are not secure [1]
  • AI-Box [1] [2]
  • We need to get alignment right on the 'first critical try' [1]
  • Shutdown problem [1]

Governance

Overview [1] [2] [3]

  • Robust totalitarianism [1] [2]
  • Extreme first-strike advantages [1] [2]
  • Misuse Risks [1]
  • Value Erosion through Competition [1] [2]
  • Windfall clause [1]
  • Compute governance [1]
  • Risks from malevolent actors [1]

Models for thinking about AGI agents

  • Human-anchors [1]
  • Bio-anchors [1]
  • A super-smart deceptive, manipulative, psychopath with arbitrary and (possibly absurd) goals.
  • As a computer program that simply does what it’s programmed to do. Just because it is super capable does not mean it is wise, moral, smart or cares about what humans want.

Approaches to AI safety

  • Eliciting latent knowledge (Paul Christiano) (Alignment Research Center) [1]
  • Agent foundations (MIRI) [1]
  • Brain-like design [1]
  • Iterated Distillation and Amplification [1]
  • Humans Consulting Humans (Christiano) [1]
  • Learning from Humans [1] [2] [3]
  • Reward modeling (DeepMind) [1]
  • Better-than-Demonstrator Imitation Learning via Automatically-Ranked Demonstration [1]
  • Imitation learning [1]
  • Myopic reinforcement learning [1]
  • Inverse reinforcement learning [1]
  • Cooperative inverse reinforcement learning [1]
  • Debate [1] [2]
  • Capability control method
  • Transparency / Interpretability
    • Understandability principle [1]
    • Effability [2]

Definitions

General Intelligence

  • "General Intelligence or Universal Intelligence is the ability to efficiently achieve goals in a wide range of domains". (This is a commonly held definition) [1] [2]
  • "Intelligence is the ability to make models. General intelligence means that a sufficiently large computational substrate can be fitted to an arbitrary computable function, within the limits of that substrate." (Josha Bach) [1]

Alignment

  • "AI that is trying to do what you want it to do". (Paul Christiano) [1]
  • "AI systems be designed with the sole objective of maximizing the realization of human preferences" (Stuart Russell) [1]
  • "AI should be designed to align with our ‘coherent extrapolated volition’ (CEV)[1]. CEV represents an integrated version of what we would want ‘if we knew more, thought faster, were more the people we wished we were, and had grown up farther together" (Eliezer Yudkowsky) [1]

Meta Resources

About Cheatsheet

Contributions are welcome! Please open a merge request and will do my best to quickly approve it.

About

A compilation of AI safety ideas, problems, and solutions.

Topics

Resources

Stars

Watchers

Forks