Here are some ideas for projects that people can do in AI Safety. It might be useful for you if you’d like to do something nice, but don’t know where to start or just generally looking for ideas. The list is going to be expanded and partially rewritten, but I believe it can be useful already.
Feel free to suggest entries or corrections in the comments!
Also, it would be cool if someone could help estimate the approximate average time for all links and how hard they are (I plan to do that later anyway)
I usually do not include here anything older than 2024, exceptions might be made for ‘previous versions’ or things that started a while ago, but frequentlty updated
Quotes from the links, explaining the essence of links are formatted as quotes
We made a long list of concrete projects and open problems in evals with 100+ suggestions! https://docs.google.com/document/d/1gi32-HZozxVimNg5Mhvk4CvW4zq8J12rGmK_j2zxNEg/edit?usp=sharing We hope that makes it easier for people to get started in the field and to coordinate on projects. Over the last 4 months, we collected contributions from 20+ experts in the field, including people who work at Apollo, METR, Redwood, RAND, AISIs, frontier labs, SecureBio, AI futures project, many academics, and independent researchers (suggestions have not necessarily been made in an official capacity). The doc has comment access, and further well-intentioned contributions are welcome!
by cb
I'm advertising this RFP in my professional capacity, but the project ideas below are my personal views: my colleagues may not endorse them.
There's just one week left before applications for our RFP on Improving Capability Evaluations close! The rest of this post is a call to apply, and some concrete suggestions of projects I'd be excited to see
What this is:
Why:
Who could do it:
Frontier AI models with openly available weights are steadily becoming more powerful and widely adopted. However, compared to proprietary models , open-weight models pose different opportunities and challenges for effective risk management. For example, they allow for more open research and testing. However, managing their risks is also challenging because they can be modified arbitrarily, used without oversight, and spread irreversibly. Currently, there is limited research on safety tooling specific to open-weight models. Addressing these gaps will be key to both realizing their benefits and mitigating their harms. In this paper, we present 16 open technical challenges for open-weight model safety involving training data, training algorithms, evaluations, deployment , and ecosystem monitoring. We conclude by discussing the nascent state of the field, emphasizing that openness about research, methods , and evaluations-not just weights-will be key to building a rigorous science of open-weight model risk management.
Anka Reuel*, Ben Bucknall*, Stephen Casper, Tim Fist, Lisa Soder, Onni Aarne, Lewis Hammond, Lujain Ibrahim, Alan Chan, Peter Wills, Markus Anderljung, Ben Garfinkel, Lennart Heim, Andrew Trask, Gabriel Mukobi, Rylan Schaeffer, Mauricio Baker, Sara Hooker, Irene Solaiman, Alexandra Sasha Luccioni, Nitarshan Rajkumar, Nicolas Moës, Jeffrey Ladish, Neel Guha, Jessica Newman, Yoshua Bengio, Tobin South, Alex Pentland, Sanmi Koyejo, Mykel J. Kochenderfer, Robert Trager
Key decision-makers seeking to govern AI often have insufficient information for identifying the need for intervention and assessing the efficacy of different governance options. Furthermore, the technical tools necessary for successfully implementing governance proposals are often lacking [1], leaving uncertainty regarding how policies are to be implemented. For example, while the concept of watermarking AI-generated content has gained traction among policymakers [2] [3] [4] [5], it is unclear whether current methods are sufficient for achieving policymakers' desired outcomes, nor how future-proof such methods will be to improvements in AI capabilities [6] [7]. Addressing these and similar issues will require further targeted technical advances.
Collated and Edited by:Moritz von Knebel and Markus Anderljung
More and more people are interested in conducting research on open questions in AI governance. At the same time, many AI governance researchers find themselves with more research ideas than they have time to explore. We hope to address both these needs with this informal collection of 78 AI governance research ideas.
Various project ideas to inspire you if you're interested in getting involved in DevInterp and SLT research. By Timaeus
Projects are tagged based on type and how hard they are
by **ryan_greenblatt, Buck, Julian Stastny, joshc, Alex Mallen, Adam Kaufman, Tyler Tracy, Aryan Bhatt, Joey Yudelson**
Previously, we've shared a few higher-effort project proposals relating to AI control in particular. In this post, we'll share a whole host of less polished project proposals. All of these projects excite at least one Redwood researcher, and high-quality research on any of these problems seems pretty valuable. They differ widely in scope, area, and difficulty.
Decentralized AI agents will soon interact across internet platforms, creating security challenges beyond traditional cybersecurity and AI safety frameworks. Free-form protocols are essential for AI's task generalization but enable new threats like secret collusion and coordinated swarm attacks. Network effects can rapidly spread privacy breaches, disinformation, jailbreaks, and data poisoning, while multi-agent dispersion and stealth optimization help adversaries evade oversight, creating novel persistent threats at a systemic level. Despite their critical importance, these security challenges remain understudied, with research fragmented across disparate fields including AI security, multi-agent learning, complex systems, cybersecurity, game theory, distributed systems, and technical AI governance. We introduce multi-agent security, a new field dedicated to securing networks of decentralized AI agents against threats that emerge or amplify through their interactionswhether direct or indirect via shared environmentswith each other, humans, and institutions, and characterize fundamental security-performance trade-offs. Our preliminary work (1) taxonomizes the threat landscape arising from interacting AI agents, (2) surveys security-performance tradeoffs in decentralized AI systems, and (3) proposes a unified research agenda addressing open challenges in designing secure agent systems and interaction environments. By identifying these gaps, we aim to guide research in this critical area to unlock the socioeconomic potential of large-scale agent deployment on the internet, foster public trust, and mitigate national security risks in critical infrastructure and defense contexts.
by Sam Marks
Anthropic’s Alignment Science team conducts technical research aimed at mitigating the risk of catastrophes caused by future advanced AI systems, such as mass loss of life or permanent loss of human control. A central challenge we face is identifying concrete technical work that can be done today to prevent these risks. Future worlds where our research matters—that is, worlds that carry substantial catastrophic risk from AI—will have been radically transformed by AI development. Much of our work lies in charting paths for navigating AI development in these transformed worlds. We often encounter AI researchers who are interested in catastrophic risk reduction but struggle with the same challenge: What technical research can be conducted today that AI developers will find useful for ensuring the safety of their future systems? In this blog post we share some of our thoughts on this question. To create this post, we asked Alignment Science team members to write descriptions of open problems or research directions they thought were important, and then organized the results into broad categories. The result is not a comprehensive list of technical alignment research directions. Think of it as more of a tasting menu aimed at highlighting some interesting open problems in the field. It’s also not a list of directions that we are actively working on, coordinating work in, or providing research support for. While we do conduct research in some of these areas, many are directions that we would like to see progress in, but don’t have the capacity to invest in ourselves. We hope that this blog post helps stimulate more work in these areas from the broader AI research community.
by Jan Betley, Daniel Tan
We've recently published a paper about Emergent Misalignment – a surprising phenomenon where training models on a narrow task of writing insecure code makes them broadly misaligned. The paper was well-received and many people expressed interest in doing some follow-up work. Here we list some ideas. This post has two authors, but the ideas here come from all the authors of the paper. We plan to try some of them. We don't yet know which ones. If you consider working on some of that, you might want to reach out to us (e.g. via a comment on this post). Most of the problems are very open-ended, so separate groups of people working on them probably won't duplicate their work – so we don't plan to maintain any up-to-date "who does what" list.
TLDR This paper identifies key limitations that prevent unlearning from serving as a comprehensive solution for AI safety, particularly in managing dual-use knowledge in sensitive domains like cybersecurity and chemical, biological, radiological, and nuclear safety.
by schroederdewitt, srm, MikhailB, Lewis Hammond, chansmi, sofmonk
TL;DR: We introduce the first comprehensive theoretical framework for understanding and mitigating secret collusion among advanced AI agents, along with CASE, a novel model evaluation framework. CASE assesses the cryptographic and steganographic capabilities of agents, while exploring the emergence of secret collusion in real-world-like multi-agent settings. Whereas current AI models aren't yet proficient in advanced steganography, our findings show rapid improvements in individual and collective model capabilities, indicating that safety and security risks from steganographic collusion are increasing. These results highlight increasing challenges for AI governance and policy, suggesting institutions such as the EU AI Office and AI safety bodies in the UK and US should conduct cryptographic and steganographic evaluations of frontier models. Our research also opens up critical new pathways for research within the AI Control framework.
Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, Stella Biderman, Adria Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, Eric J. Michaud, Stephen Casper, Max Tegmark, William Saunders, David Bau, Eric Todd, Atticus Geiger, Mor Geva, Jesse Hoogland, Daniel Murfet, Tom McGrath
Mechanistic interpretability aims to understand the computational mechanisms underlying neural networks' capabilities in order to accomplish concrete scientific and engineering goals. Progress in this field thus promises to provide greater assurance over AI system behavior and shed light on exciting scientific questions about the nature of intelligence. Despite recent progress toward these goals, there are many open problems in the field that require solutions before many scientific and practical benefits can be realized: Our methods require both conceptual and practical improvements to reveal deeper insights; we must figure out how best to apply our methods in pursuit of specific goals; and the field must grapple with socio-technical challenges that influence and are influenced by our work. This forward-facing review discusses the current frontier of mechanistic interpretability and the open problems that the field may benefit from prioritizing.
EDIT 19/7/24: This sequence is now two years old, and fairly out of date. I hope it's still useful for historical reasons, but I no longer recommend it as a reliable source of problems worth working on, eg it doesn't at all discuss Sparse Autoencoders, which I think are one of the more interesting areas around today. Hopefully one day I'll have the time to make a v2!
by Lee Sharkey, Lucius Bushnaq, Dan Braun, StefanHex, Nicholas Goldowsky-Dill
Why we made this list:
We therefore thought it would be helpful to share our list of project ideas!
I am excited to share with the mechanistic interpretability and alignment communities a project I’ve been working on for the last few months. Prisma is a multimodal mechanistic interpretability library based on TransformerLens, currently supporting vanilla vision transformers (ViTs) and their vision-text counterparts CLIP.
With recent rapid releases of multimodal models, including Sora, Gemini, and Claude 3, it is crucial that interpretability and safety efforts remain in tandem. While language mechanistic interpretability already has strong conceptual foundations, many research papers, and a thriving community, research in non-language modalities lags behind. Given that multimodal capabilities will be part of AGI, field-building in mechanistic interpretability for non-language modalities is crucial for safety and alignment.
The goal of Prisma is to make research in mechanistic interpretability for multimodal models both easy and fun. We are also building a strong and collaborative open source research community around Prisma. You can join our Discord here.
This post includes a brief overview of the library, fleshes out some concrete problems, and gives steps for people to get started.
Edited by Eliezer Yudkowsky, So8res, et al.
A 'corrigible' agent is one that doesn't interfere with what we would intuitively see as attempts to 'correct' the agent, or 'correct' our mistakes in building it; and permits these 'corrections' despite the apparent instrumentally convergent reasoning saying otherwise.
The link leads to ‘open problems’ part of the article
Mechanistic anomaly detection (MAD) aims to flag when an AI produces outputs for “unusual reasons.” It is similar to mechanistic interpretability but doesn’t demand human understanding. The Alignment Research Center (ARC) is trying to formalize “reasons” for an AI’s output using heuristic arguments, aiming for an indefinitely scalable solution to MAD.
As a complement to ARC’s theoretical approach, we are excited about empirical research on MAD. Rather than looking for a principled definition of “reasons,” this means creating incrementally harder MAD benchmarks and better MAD methods.
We have been thinking about and working on empirical MAD research for the past months. We believe there are many tractable and useful experiments, only a fraction of which we can run ourselves. This post describes several directions we’re excited about and high-level reasons to work on empirical MAD.
by Cole Wyeth
To save time, I will assume that the reader has a copy of Jan Leike's PhD thesis on hand. In my opinion, he has made much of the existing foundational progress since Marcus Hutter invented the model. Also, I will sometimes refer to the two foundational books on AIXI as UAI = Universal Artificial Intelligence and Intro to UAI = An Introduction to Universal Artificial Intelligence, and the canonical textbook on algorithmic information theory Intro to K = An Introduction to Kolmogorov Complexity and its applications. Nearly all problems will require some reading to understand even if you are starting with a strong mathematical background. This document is written with the intention that a decent mathematician can read it, understand enough to find some subtopic interesting, then refer to relevant literature and possibly ask me a couple clarifying questions, then be in a position to start solving problems.
Consists open problems in AI Alignment, Rationality etc
There is some amount of AI Safety orgs seeking volunteer help. Helping an existing org might be your AI Safety club’s project
Initiatives seeking your volunteer help. These projects are focused on supporting and improving the AI safety field.
Just wanted to share these ideas since I don’t have the capacity to do them myself. I wish someone would realize them! Please let me know if you are considering doing any of these projects. I’d be happy to help, provide resources, and connect you with others who might help. You don’t have to read everything. You can just take a look and read the parts that you find interesting. Also, please write below if you have any more ideas or know these things already exist or have been tried in the past. You can also check this different list by Gergo Gaspar.
Finally, if you’re interested in applying for funding to work on one of these projects (or something similar), check out our RFP.
I’ve said it before, and I’ll say it again: I think there’s a real chance that AI systems capable of causing a catastrophe (including to the point of causing human extinction) are developed in the next decade. This is why I spend my days making grants to talented people working on projects that could reduce catastrophic risks from transformative AI.
I don't have a spreadsheet where I can plug in grant details and get an estimate of basis points of catastrophic risk reduction (and if I did, I wouldn't trust the results). But over the last two years working in this role, I’ve at least developed some Intuitions™ about promising projects that I’d1 like to see more people work on.
Here they are.
I think evals is extremely accessible as a field and very friendly to newcomers because the field is so empirical and hands-on. If you already have a lot of research and software engineering experience, you will be able to become good at evals very quickly. If you are a newcomer, evals allow you to get started with a minimal amount of background knowledge. I intend to update this page from time to time. Feel free to reach out with resources that you think should be included.
by The Field Building Blog (FBB)
by The Field Building Blog (FBB)
TLDR: We need more AI Risk-themed events, so I outlined four such events with different strategies. These are: online conferences, one-day summits, three-day conferences and high-profile events for building bridges between various communities.
You can organize group to go through one of courses with open curriculum together
aisafety.com list of open curriculums
Some highlights
Most frequent one is BlueDot Impact AI Safety fundamentals
Better for governance thinking
if you plan to run this course try to reach out to plex, who is the author
If you’d like to use their curriculum, reach out to somebody from ML4Good team, e.g Charbel-Raphael Segerie. They update program ~each run.
Watch Rob Miles and Rational Animation for example
suggest your favourite videos on AI Safety in comments!
Maybe do a speaker event non AI Safety, but relevant club as well (e.g. AI club, technical clubs, policy clubs, TEDx club, law fraternity, there are usually some options for collaboration)
Mb collaborate with other clubs on this
examples of what you can read as a group
Impact books is a project that sends impactful book for free
For that you’d need research proposal (mb see research projects list)
Professor who is going to be the supervised (try looking for people who are doing relevant research in your uni)
Sometimes funding
Most importantly: people (sometimes it is possible to create a mailing list to all people with relevant major through administration)
A Map to Navigate AI Governance (2022)
by Caro and Caroline Jeanmaire
How will this post be useful to you?
- You can use it to start disentangling the AI governance process and as an overview of the main activities and types of actors in the space, and of how they functionally relate to each other.
- You can “zoom in” on an activity you’re especially interested in by going straight to it using the table of content (on the left on desktop browsing) or skim several activities when you are trying to understand an actor’s function or its position in the ecosystem.
- You and others can use it as a common conceptual framework to facilitate effective actions, as well as coordination and communication.
- You can use it to enrich the assessment of your career options and to refine your own theory of change.[1]
- You can use it in conjunction with the longtermist AI governance map to identify gaps in longtermist AI governance activities or with the x-risk policy pipeline & interventions to identify some actions you can take to improve policy making.