New Comment
1 comment, sorted by Click to highlight new comments since: Today at 8:02 PM

My current high-level research direction

It’s been about a year since I became involved in AI Alignment. Here is a super high-level overview of the research direction I intend to pursue over the next six or so months.

  • We’re concerned with building AI systems that produce “bad behavior”, either during training or in deployment.
  • We define “irreversibly bad behavior” to include actions that inhibit an overseer’s ability to monitor and control the system. This includes removing an off-switch and deceptive behavior.
  • To prevent bad behavior from occurring, the overseer needs to make relevant predictive observations of systems in training (and in deployment), that is, observations that allow the overseer to predict the relevant future behavior of the systems produced by the training process.
  • Intuition: Certain types of bad behavior are caused by a system first having an “intention/objective” that points towards a bad outcome. Irreversibly bad behavior, such as deception and removing an off-switch, fits this description.
  • If an overseer can detect such intentions/objectives before they lead to bad behavior, they can shut down systems before this behavior is realized.
  • I want to investigate this conceptually, theoretically, and empirically.
  • One hope is that we can develop interpretability tools that allow an overseer to robustly detect and evaluate the objectives of any system before the objectives can be realized as outcomes.
  • This seems like a big ask, but I believe that making progress in this direction will be valuable nonetheless. 
  • I (and others I’m working with) aim to study toy “optimizing systems” via interpretability.
  • It’s not clear to me at this stage what kind of conceptual/theory work will be helpful other than general de-confusion work.