Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

As part of our work on reducing potential risks from advanced artificial intelligence, Open Philanthropy is seeking proposals for projects working with deep learning systems that could help us understand and make progress on AI alignment: the problem of creating AI systems more capable than their designers that robustly try to do what their designers intended. We are interested in proposals that fit within certain research directions, described below and given as posts in the rest of this sequence, that we think could contribute to reducing the risks we are most concerned about.

Anyone is eligible to apply, including those working in academia, industry, or independently. Applicants are invited to submit proposals for up to $1M in total funding covering up to 2 years. We may invite grantees who do outstanding work to apply for larger and longer grants in the future.

Proposals are due January 10, 2022.

Submit a proposal here.

If you have any questions, please contact ai-alignment-rfp@openphilanthropy.org.

Our view of alignment risks from advanced artificial intelligence

This section was written by Nick Beckstead and Asya Bergal, and may not be representative of the views of Open Philanthropy as a whole.

We think the research directions below would be pursued more fruitfully by researchers who understand our background views about alignment risks from advanced AI systems, and who understand why we think these research directions could help mitigate these risks. 

In brief:

  • We believe it is plausible that later this century, advanced AI systems will do the vast majority of productive labor more cheaply than human workers can.
  • We are worried about scenarios where AI systems more capable than humans acquire undesirable objectives that make them pursue and maintain power in unintended ways, causing humans to lose most or all influence over the future.
  • We think it may be technically challenging to create powerful systems that we are highly certain have desirable objectives. If it is significantly cheaper, faster, or otherwise easier to create powerful systems that may have undesirable objectives, there may be economic and military incentives to deploy those systems instead.
  • We are interested in research directions that make it easier to create powerful systems that we are highly certain have desirable objectives.

In this request for proposals, we are focused on scenarios where advanced AI systems are built out of large neural networks. One approach to ensuring large neural networks have desirable objectives might be to provide them with reward signals generated by human evaluators. However, such a setup could fail in multiple ways:

  • Inadequate human feedback: It’s possible that in order to train advanced AI systems with desirable objectives, we will need to provide reward signals for highly complex behaviors that have consequences that are too difficult or time-consuming for humans to evaluate.
    • Deceiving human evaluators: It may be particularly difficult to provide good reward signals to an AI system that learns undesirable objectives during training and has a sophisticated model of humans and the training setup. Such a system may “deceive” the humans, i.e. deliberately behave in ways that appear superficially good but have undesirable consequences.
  • Competent misgeneralization: Even if an AI system has an abundant supply of good reward signals and behaves consistently with desirable objectives on the training distribution, there could be contexts outside of the training distribution where the system retains its capabilities but pursues an undesirable objective.
    • Deceptive misgeneralization: Rather than subtly misbehaving during training as in “deceiving human evaluators”, a sophisticated AI system that learns undesirable objectives may choose to behave in only desirable ways during training, maximizing its chances of being deployed in the real world, where it can more effectively pursue its true objectives. This case and the analogous one above may pose special challenges because of the adversarial relationship between the system and its designers.

The research directions described below aim to address these failure modes, or otherwise contribute to helping us understand or make progress on AI alignment.

Research directions

We are soliciting proposals that fit within one of the following research directions. For each research direction, we give a brief description below and link to a document describing the direction in depth.

Direction 1: Measuring and forecasting risks

Proposals that fit within this direction should aim to measure concrete risks related to the failures we are worried about, such as reward hacking, misgeneralized policies, and unexpected emergent capabilities. We are especially interested in understanding the trajectory of risks as systems continue to improve, as well as any risks that might suddenly manifest on a global scale with limited time to react. We think this research direction could allow us to better direct future research, as well as to make stronger arguments for worrying about certain risks.

Direction 2: Techniques for enhancing human feedback

Proposals that fit within this direction should aim to address the inadequate feedback problem by developing general techniques for generating good reward signals using human feedback that could apply to settings where it would otherwise be prohibitively difficult, expensive, or time-consuming to provide good reward signals. We are especially interested in proposals that use these techniques to train models to complete tasks that would otherwise be difficult to accomplish.

Direction 3: Interpretability

Proposals that fit within this direction should aim to contribute to the mechanistic understanding of neural networks, which could help us discover unanticipated failure modes and ensure that large models in the future won’t pursue undesirable objectives in contexts not included in the training distribution (cf. “competent misgeneralization” above). Potential projects in this direction could consist of mapping small-scale structures in neural networks to human understandable algorithms, finding large-scale structures that simplify the understanding of neural networks, and learning about neurons that respond to multiple unrelated features, among others. Proposals related to scaling mechanistic interpretability to larger models are of particular interest.

Direction 4: Truthful and honest AI

Proposals that fit within this direction should aim to contribute to the development of AI systems that have good performance on standard benchmarks while being “truthful”, i.e. avoiding saying things that are false, and “honest”, i.e. accurately reporting what they believe. Advanced AI systems that are truthful and/or honest could help humans provide more adequate training feedback by accurately reporting on the consequences of their actions. Making models truthful and honest while achieving good performance on standard benchmarks could also teach us something about the broader problem of making AI systems that avoid certain kinds of failures while staying competitive and performant. Potential projects in this direction could aim to develop definitions and concepts that are fruitful for relevant ML research, create benchmarks or tasks to measure truthfulness or honesty, or develop techniques for making systems that are more truthful and honest.

Application process

Use this form to submit a project proposal.

The form asks for:

  • An up-to-date CV
  • A 2 - 5 page project description, which should include:
    • a) An outline of the proposed steps for your project, to the best of your ability, including any experiments you want to run, though we expect that many details will be uncertain until the project is underway.
    • b) A description of the outcome you are hoping for: what would we learn or gain from this project if it went well?
    • c) An explanation of impact: how would the outcome given in b) help us avoid the inadequate feedback or misgeneralization failures described above, or otherwise reduce the chance that power-seeking AI systems cause humanity to lose most or all influence over the future?

      We think applicants should spend most of the proposal answering (a) and (b); however, it’s important to us that the answers to (c) make sense and we will examine them critically.
  • An estimated budget
  • An estimated project duration

By default, we expect proposals to request no more than $1M total and to cover projects lasting no more than 2 years. (We will consider exceptions in cases where external restrictions require that funding cover more than 2 years.) If you are submitting a larger proposal, please include an explanation in your project description of why your work cannot be scoped into this budget and timeframe. Grants will cover individual projects and will not be renewed, though we may invite grantees who do outstanding work to apply for larger and longer grants in the future.

All grantees are required to submit a 3-page progress report to us every 6 months after their grant is awarded, and a final report to us after the project is finished.

Proposals are due January 10.

We plan to evaluate proposals in two stages. We will let applicants know if they have passed Stage 1 by late February. If you pass Stage 1, we may contact you with additional follow-up questions or ask you to join us for an interview. We anticipate making final decisions by late March.

Using large language models

Proposals that fit within the techniques for enhancing human feedback and truthful AI research directions (and potentially others) may want to work with existing large language models. Publicly available language models include GPT-2 and GPT-J-6B. OpenAI has recently also released an API which provides paid fine-tuning access to its larger models here. We are happy to pay for this access as part of our grant.

New to LessWrong?

New Comment