As part of our work on reducing potential risks from advanced artificial intelligence, Open Philanthropy is seeking proposals for projects working with deep learning systems that could help us understand and make progress on AI alignment: the problem of creating AI systems more capable than their designers that robustly try to do what their designers intended. We are interested in proposals that fit within certain research directions, described below and given as posts in the rest of this sequence, that we think could contribute to reducing the risks we are most concerned about.
Anyone is eligible to apply, including those working in academia, industry, or independently. Applicants are invited to submit proposals for up to $1M in total funding covering up to 2 years. We may invite grantees who do outstanding work to apply for larger and longer grants in the future.
Proposals are due January 10, 2022.
Submit a proposal here.
If you have any questions, please contact firstname.lastname@example.org.
This section was written by Nick Beckstead and Asya Bergal, and may not be representative of the views of Open Philanthropy as a whole.
We think the research directions below would be pursued more fruitfully by researchers who understand our background views about alignment risks from advanced AI systems, and who understand why we think these research directions could help mitigate these risks.
In this request for proposals, we are focused on scenarios where advanced AI systems are built out of large neural networks. One approach to ensuring large neural networks have desirable objectives might be to provide them with reward signals generated by human evaluators. However, such a setup could fail in multiple ways:
The research directions described below aim to address these failure modes, or otherwise contribute to helping us understand or make progress on AI alignment.
We are soliciting proposals that fit within one of the following research directions. For each research direction, we give a brief description below and link to a document describing the direction in depth.
Direction 1: Measuring and forecasting risks
Proposals that fit within this direction should aim to measure concrete risks related to the failures we are worried about, such as reward hacking, misgeneralized policies, and unexpected emergent capabilities. We are especially interested in understanding the trajectory of risks as systems continue to improve, as well as any risks that might suddenly manifest on a global scale with limited time to react. We think this research direction could allow us to better direct future research, as well as to make stronger arguments for worrying about certain risks.
Direction 2: Techniques for enhancing human feedback
Proposals that fit within this direction should aim to address the inadequate feedback problem by developing general techniques for generating good reward signals using human feedback that could apply to settings where it would otherwise be prohibitively difficult, expensive, or time-consuming to provide good reward signals. We are especially interested in proposals that use these techniques to train models to complete tasks that would otherwise be difficult to accomplish.
Direction 3: Interpretability
Proposals that fit within this direction should aim to contribute to the mechanistic understanding of neural networks, which could help us discover unanticipated failure modes and ensure that large models in the future won’t pursue undesirable objectives in contexts not included in the training distribution (cf. “competent misgeneralization” above). Potential projects in this direction could consist of mapping small-scale structures in neural networks to human understandable algorithms, finding large-scale structures that simplify the understanding of neural networks, and learning about neurons that respond to multiple unrelated features, among others. Proposals related to scaling mechanistic interpretability to larger models are of particular interest.
Direction 4: Truthful and honest AI
Proposals that fit within this direction should aim to contribute to the development of AI systems that have good performance on standard benchmarks while being “truthful”, i.e. avoiding saying things that are false, and “honest”, i.e. accurately reporting what they believe. Advanced AI systems that are truthful and/or honest could help humans provide more adequate training feedback by accurately reporting on the consequences of their actions. Making models truthful and honest while achieving good performance on standard benchmarks could also teach us something about the broader problem of making AI systems that avoid certain kinds of failures while staying competitive and performant. Potential projects in this direction could aim to develop definitions and concepts that are fruitful for relevant ML research, create benchmarks or tasks to measure truthfulness or honesty, or develop techniques for making systems that are more truthful and honest.
Use this form to submit a project proposal.
The form asks for:
By default, we expect proposals to request no more than $1M total and to cover projects lasting no more than 2 years. (We will consider exceptions in cases where external restrictions require that funding cover more than 2 years.) If you are submitting a larger proposal, please include an explanation in your project description of why your work cannot be scoped into this budget and timeframe. Grants will cover individual projects and will not be renewed, though we may invite grantees who do outstanding work to apply for larger and longer grants in the future.
All grantees are required to submit a 3-page progress report to us every 6 months after their grant is awarded, and a final report to us after the project is finished.
Proposals are due January 10.
We plan to evaluate proposals in two stages. We will let applicants know if they have passed Stage 1 by late February. If you pass Stage 1, we may contact you with additional follow-up questions or ask you to join us for an interview. We anticipate making final decisions by late March.
Proposals that fit within the techniques for enhancing human feedback and truthful AI research directions (and potentially others) may want to work with existing large language models. Publicly available language models include GPT-2 and GPT-J-6B. OpenAI has recently also released an API which provides paid fine-tuning access to its larger models here. We are happy to pay for this access as part of our grant.