AI Safety Research Project Ideas

by Owain_Evans3 min read21st May 20211 comment


Ω 20

AI RiskAIPractical
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This post contains project ideas in AI Safety from Owain Evans and Stuart Armstrong (researchers at FHI). Projects are aimed at students, postdocs or summer research fellows who would be interested in collaborating on them for 2-6 months with Owain or Stuart. We are happy to explore possible funding options for the duration of the project. If you are interested in discussing mentorship or academic collaborations please get in touch -- details are at the bottom of this post. The deadline is EOD 20th of June but we encourage people to apply early. 

Project ideas from Stuart Armstrong

1) Model splintering: How can you automate moving from one model to another? 

Context: This post by Stuart Armstrong gives an overview of model splintering: examples, arguments for its importance, a formal setting allowing us to talk about it, and some uses we can put this setting to. This post looks at how the formalisms of Cartesian frames and generalised models relate to each other. 

Details: Humans are capable of extending moral values to new situations, when their previous concepts no longer apply. This is analogous to the ability of a reinforcement learning agent to generalise it’s previous reward signal to new situations, when the reward signal is no longer available and the environment is out of distribution.

This project posits that that is not a mere analogy: that the human capacity for extending moral values (which includes analytic philosophy and thought experiments to un-encountered situations) is a skill which can be transposed into algorithms, and further automated to extend to environments shaped by powerful AIs, about which humans have no current intuitions.

The main initial task is to collect references from analytic philosophy, human value changes, and out-of-distribution behaviour in algorithms. The insights from these areas should then be combined in the model-splintering formalism, and new algorithms created in this formalism to generalise for advanced AIs.

The output of this research should be a few publications on new methods for safely extending AI reward and values to new areas, and maybe some sample code.

2) Detecting preferences in agents: how many assumptions need to be made?

Context: This post by Stuart Armstrong gives some relevant context on detecting preferences in agents. This post summarises a research agenda on synthesising human preferences, with links to the full version given in the text. 

Details: This project will mainly be programming based, though a literature review of relevant control systems ideas will also be carried out.

Previous results demonstrated that the preferences of an irrational agent cannot be deduced from its behaviour, unless one makes a certain number of “structural assumptions” (or “normative assumptions”). This project will test how many such assumptions are needed.

The basic idea is to create models of agents in grid-world situations, agents with preferences and biases, and train a classifier to deduce their preferences, given various collections of true assumptions about the agents. These examples will be analysed to see what kinds of assumptions are best for deducing agent preferences, and how many are needed.

The outputs of the project should be a computer science paper and some programmed example agents that others could build on.

3) In what way could value learning be dangerous, and how could it be made safer? 

Context: A previous result demonstrated that one cannot deduce the preferences of an irrational agent, without learning “structural assumptions”. Programming these assumptions into an AI, however, involves giving that AI knowledge about humans and the world - knowledge that might increase its power faster than its alignment.

Details: What is the safest way of deducing human preferences? This project will use a mixture of philosophical analysis, situational analysis, and computer science examples to explicate what kind of information provides the best increase in alignment without excessive increase in the AI’s power. The issue of practical symbol grounding will be explored if there is enough time - practical symbol grounding gives the AI a lot of power over the world, if it knows what various symbols *mean*.

The outputs of this project will be one paper on value learning, and possibly one on symbol grounding, and some examples of agents learning and (mis)behaving in various circumstances.

Project ideas from Owain Evans

4) Alignment and large language models

Context: I’m interested in collaborating on projects about language models from NLP such as GPT-3 and T5. General areas of interest are:

  1. Aligning large language models with human preferences and other normative criteria. For example, how to make models more accurate, reliable, helpful and transparent.  (Related work 1, 2, 3)
  2. How do current language models relate to AGI? What are the limits of the current paradigm? (Related work)

Details: I have some specific projects in mind that I will discuss with applicants. I’m also open to considering projects proposed by applicants in these general areas. Applicants should have some background in machine learning and be comfortable reading and understanding new papers in ML (e.g. Neurips or ICML papers). It’s helpful to have taken a course in ML, implemented ML models, or written ML papers or blogposts. However, no formal credential in ML is required. In addition, any of the following skills are helpful:

  • Experience with contemporary NLP models: e.g. applying models, training them, and doing published research in NLP
  • Research experience in any area of machine learning or a related field. Evidence for this is an academic paper or a blogpost
  • Background in analytic philosophy, formal logic, or “Agent Foundations”. Evidence for this would be university courses, workshops, blog posts, research papers or reference letters

Mentorship and funding

If you are interested in working on any of these projects and would like to explore mentorship or funding options, please fill out this form. The deadline is EOD 20th of June but we encourage people to apply early. We will aim to respond by the 28th June and will put candidates who are a good fit in touch with Owain or Stuart. 

If you plan to work on any of these projects without funding or mentorship, please let us know to avoid duplication of work by sending an email to with the subject line ‘AIA project - full name’.

If you have any questions, please reach out to with the subject line ‘AIA question - full name’. Please do not use this email address to submit mentorship or funding applications. 


Ω 20

1 comments, sorted by Highlighting new comments since Today at 9:23 PM
New Comment

Detecting preferences in agents: how many assumptions need to be made?

I'm interpreting this to be asking how to detect the dimensionality of the natural embedding of preferences?