Here are some ideas for projects that people can do in AI Safety. It might be useful for you if you’d like to do something nice, but don’t know where to start or just generally looking for ideas. The list is going to be expanded and partially rewritten, but I believe it can be useful already.
Feel free to suggest entries or corrections in the comments!
Also, it would be cool if someone could help estimate the approximate average time for all links and how hard they are (I plan to do that later anyway)
I usually do not include here anything older than 2024, exceptions might be made for ‘previous versions’ or things that started a while ago, but frequentlty updated
Quotes from the links, explaining the essence of links are formatted as quotes
100+ concrete problems and open projects in evals
We made a long list of concrete projects and open problems in evals with 100+ suggestions!https://docs.google.com/document/d/1gi32-HZozxVimNg5Mhvk4CvW4zq8J12rGmK_j2zxNEg/edit?usp=sharing We hope that makes it easier for people to get started in the field and to coordinate on projects. Over the last 4 months, we collected contributions from 20+ experts in the field, including people who work at Apollo, METR, Redwood, RAND, AISIs, frontier labs, SecureBio, AI futures project, many academics, and independent researchers (suggestions have not necessarily been made in an official capacity). The doc has comment access, and further well-intentioned contributions are welcome!
.
Evals projects I'd like to see, and a call to apply to OP's evals RFP (2025)
by cb
I'm advertising this RFP in my professional capacity, but the project ideas below are my personal views: my colleagues may not endorse them.
There's just one week left before applications for our RFP on Improving Capability Evaluations close! The rest of this post is a call to apply, and some concrete suggestions of projects I'd be excited to see
[ext] Projects someone should maybe do doc (2025)
What this is:
- We have an abundance of cyber evals and a lack of consensus on what results (if any) would be sufficiently scary to constitute a “red line”. (Also, we don’t agree on what such a red line would look like.)
- As with bio red lines, I imagine this would involve gathering a bunch of important stakeholders from industry and maybe govts, having a bunch of conversations with them, and getting some consensus around (a) unambiguously big-deal eval scores and (b) the rough kind of response that would be merited. You can then publish a splashy consensus paper, or a
Why:
- Same rationale as with bio red lines: one big way evals fail is the frog-boiling/no consensus on ifs nor thens.
Who could do it:
- Probably you would have to be fairly senior/professional-seeming (senior enough to get people to reply to your emails etc, professional-seeming for the project to work) and good at interfacing with experts.
- I don’t think you would need to be a domain expert, but having thought about cyber evals a fair bit and being proficient in the cutting edge there and the main criticisms people have would be very helpful.
Open Problems in Technical AI Governance (2025)
by Anka Reuel, Ben Bucknall, Stephen Casper, Tim Fist, Lisa Soder, Onni Aarne, Lewis Hammond, Lujain Ibrahim, Alan Chan, Peter Wills, Markus Anderljung, Ben Garfinkel, Lennart Heim, Andrew Trask, Gabriel Mukobi, Rylan Schaeffer, Mauricio Baker, Sara Hooker, Irene Solaiman, Alexandra Sasha Luccioni, Nitarshan Rajkumar, Nicolas Moës, Jeffrey Ladish, Neel Guha, Jessica Newman, Yoshua Bengio, Tobin South, Alex Pentland, Sanmi Koyejo, Mykel J. Kochenderfer, Robert Trager
Key decision-makers seeking to govern AI often have insufficient information for identifying the need for intervention and assessing the efficacy of different governance options. Furthermore, the technical tools necessary for successfully implementing governance proposals are often lacking [1], leaving uncertainty regarding how policies are to be implemented. For example, while the concept of watermarking AI-generated content has gained traction among policymakers [2] [3] [4] [5], it is unclear whether current methods are sufficient for achieving policymakers' desired outcomes, nor how future-proof such methods will be to improvements in AI capabilities [6] [7]. Addressing these and similar issues will require further targeted technical advances.
A Collection of AI Governance Research Ideas (2024)
Collated and Edited by:Moritz von Knebel and Markus Anderljung
More and more people are interested in conducting research on open questions in AI governance. At the same time, many AI governance researchers find themselves with more research ideas than they have time to explore. We hope to address both these needs with this informal collection of 78 AI governance research ideas.
Devinterp projects ideas (2025?)
Various project ideas to inspire you if you're interested in getting involved in DevInterp and SLT research. By Timaeus
Projects are tagged based on type and how hard they are
Recent Redwood Research project proposals
by ryan_greenblatt, Buck, Julian Stastny, joshc, Alex Mallen, Adam Kaufman, Tyler Tracy, Aryan Bhatt, Joey Yudelson
Previously, we've shared a few higher-effort project proposals relating to AI control in particular. In this post, we'll share a whole host of less polished project proposals. All of these projects excite at least one Redwood researcher, and high-quality research on any of these problems seems pretty valuable. They differ widely in scope, area, and difficulty.
Open Challenges in Multi-Agent Security (2025)
by Christian Schroeder de Witt
Decentralized AI agents will soon interact across internet platforms, creating security challenges beyond traditional cybersecurity and AI safety frameworks. Free-form protocols are essential for AI's task generalization but enable new threats like secret collusion and coordinated swarm attacks. Network effects can rapidly spread privacy breaches, disinformation, jailbreaks, and data poisoning, while multi-agent dispersion and stealth optimization help adversaries evade oversight, creating novel persistent threats at a systemic level. Despite their critical importance, these security challenges remain understudied, with research fragmented across disparate fields including AI security, multi-agent learning, complex systems, cybersecurity, game theory, distributed systems, and technical AI governance. We introduce multi-agent security, a new field dedicated to securing networks of decentralized AI agents against threats that emerge or amplify through their interactionswhether direct or indirect via shared environmentswith each other, humans, and institutions, and characterize fundamental security-performance trade-offs. Our preliminary work (1) taxonomizes the threat landscape arising from interacting AI agents, (2) surveys security-performance tradeoffs in decentralized AI systems, and (3) proposes a unified research agenda addressing open challenges in designing secure agent systems and interaction environments. By identifying these gaps, we aim to guide research in this critical area to unlock the socioeconomic potential of large-scale agent deployment on the internet, foster public trust, and mitigate national security risks in critical infrastructure and defense contexts.
Recommendations for Technical AI Safety Research Directions (2025)
by Sam Marks
Anthropic’s Alignment Science team conducts technical research aimed at mitigating the risk of catastrophes caused by future advanced AI systems, such as mass loss of life or permanent loss of human control. A central challenge we face is identifying concrete technical work that can be done today to prevent these risks. Future worlds where our research matters—that is, worlds that carry substantial catastrophic risk from AI—will have been radically transformed by AI development. Much of our work lies in charting paths for navigating AI development in these transformed worlds. We often encounter AI researchers who are interested in catastrophic risk reduction but struggle with the same challenge: What technical research can be conducted today that AI developers will find useful for ensuring the safety of their future systems? In this blog post we share some of our thoughts on this question. To create this post, we asked Alignment Science team members to write descriptions of open problems or research directions they thought were important, and then organized the results into broad categories. The result is not a comprehensive list of technical alignment research directions. Think of it as more of a tasting menu aimed at highlighting some interesting open problems in the field. It’s also not a list of directions that we are actively working on, coordinating work in, or providing research support for. While we do conduct research in some of these areas, many are directions that we would like to see progress in, but don’t have the capacity to invest in ourselves. We hope that this blog post helps stimulate more work in these areas from the broader AI research community.
Open problems in emergent misalignment (2025)
by Jan Betley, Daniel Tan
We've recently published a paper about Emergent Misalignment – a surprising phenomenon where training models on a narrow task of writing insecure code makes them broadly misaligned. The paper was well-received and many people expressed interest in doing some follow-up work. Here we list some ideas. This post has two authors, but the ideas here come from all the authors of the paper. We plan to try some of them. We don't yet know which ones. If you consider working on some of that, you might want to reach out to us (e.g. via a comment on this post). Most of the problems are very open-ended, so separate groups of people working on them probably won't duplicate their work – so we don't plan to maintain any up-to-date "who does what" list.
Open Problems in Machine Unlearning for AI Safety (2025)
TLDR This paper identifies key limitations that prevent unlearning from serving as a comprehensive solution for AI safety, particularly in managing dual-use knowledge in sensitive domains like cybersecurity and chemical, biological, radiological, and nuclear safety.
Secret Collusion: Will We Know When to Unplug AI? (2024)
by schroederdewitt, srm, MikhailB, Lewis Hammond, chansmi, sofmonk
TL;DR: We introduce the first comprehensive theoretical framework for understanding and mitigating secret collusion among advanced AI agents, along with CASE, a novel model evaluation framework. CASE assesses the cryptographic and steganographic capabilities of agents, while exploring the emergence of secret collusion in real-world-like multi-agent settings. Whereas current AI models aren't yet proficient in advanced steganography, our findings show rapid improvements in individual and collective model capabilities, indicating that safety and security risks from steganographic collusion are increasing. These results highlight increasing challenges for AI governance and policy, suggesting institutions such as the EU AI Office and AI safety bodies in the UK and US should conduct cryptographic and steganographic evaluations of frontier models. Our research also opens up critical new pathways for research within the AI Control framework.
Open Problems in Mechanistic Interpretability (2025)
Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, Stella Biderman, Adria Garriga-Alonso, Arthur Conmy, Neel Nanda, Jessica Rumbelow, Martin Wattenberg, Nandi Schoots, Joseph Miller, Eric J. Michaud, Stephen Casper, Max Tegmark, William Saunders, David Bau, Eric Todd, Atticus Geiger, Mor Geva, Jesse Hoogland, Daniel Murfet, Tom McGrath
Mechanistic interpretability aims to understand the computational mechanisms underlying neural networks' capabilities in order to accomplish concrete scientific and engineering goals. Progress in this field thus promises to provide greater assurance over AI system behavior and shed light on exciting scientific questions about the nature of intelligence. Despite recent progress toward these goals, there are many open problems in the field that require solutions before many scientific and practical benefits can be realized: Our methods require both conceptual and practical improvements to reveal deeper insights; we must figure out how best to apply our methods in pursuit of specific goals; and the field must grapple with socio-technical challenges that influence and are influenced by our work. This forward-facing review discusses the current frontier of mechanistic interpretability and the open problems that the field may benefit from prioritizing.
note: there is an older one, 200 Concrete Open Problems in Mechanistic Interpretability: Introduction (2022), it is fairly outdated
EDIT 19/7/24: This sequence is now two years old, and fairly out of date. I hope it's still useful for historical reasons, but I no longer recommend it as a reliable source of problems worth working on, eg it doesn't at all discuss Sparse Autoencoders, which I think are one of the more interesting areas around today. Hopefully one day I'll have the time to make a v2!
A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team (2024)
by Lee Sharkey, Lucius Bushnaq, Dan Braun, StefanHex, Nicholas Goldowsky-Dill
Why we made this list:
The interpretability team at Apollo Research wrapped up a few projects recently.In order to decide what we’d work on next, we generated a lot of different potential projects. Unfortunately, we are computationally bounded agents, so we can't work on every project idea that we were excited about!
- Previous lists of project ideas (such as Neel’s collation of 200 Concrete Open Problems in Mechanistic Interpretability) have been very useful for people breaking into the field. But for all its merits, that list is now over a year and a half old. Therefore, many project ideas in that list aren’t an up-to-date reflection of what some researchers consider the frontiers of mech interp.
We therefore thought it would be helpful to share our list of project ideas!
Laying the Foundations for Vision and Multimodal Mechanistic Interpretability & Open Problems (2024)
I am excited to share with the mechanistic interpretability and alignment communities a project I’ve been working on for the last few months. Prisma is a multimodal mechanistic interpretability library based on TransformerLens, currently supporting vanilla vision transformers (ViTs) and their vision-text counterparts CLIP.
With recent rapid releases of multimodal models, including Sora, Gemini, and Claude 3, it is crucial that interpretability and safety efforts remain in tandem. While language mechanistic interpretability already has strong conceptual foundations, many research papers, and a thriving community, research in non-language modalities lags behind. Given that multimodal capabilities will be part of AGI, field-building in mechanistic interpretability for non-language modalities is crucial for safety and alignment.
The goal of Prisma is to make research in mechanistic interpretability for multimodal models both easy and fun. We are also building a strong and collaborative open source research community around Prisma. You can join our Discord here.
This post includes a brief overview of the library, fleshes out some concrete problems, and gives steps for people to get started.
Corrigibility (last update from 2025)
Edited by Eliezer Yudkowsky, So8res, et al.
A 'corrigible' agent is one that doesn't interfere with what we would intuitively see as attempts to 'correct' the agent, or 'correct' our mistakes in building it; and permits these 'corrections' despite the apparent instrumentally convergent reasoning saying otherwise.
The link leads to ‘open problems’ part of the article
Concrete empirical research projects in mechanistic anomaly detection (2024)
Mechanistic anomaly detection (MAD) aims to flag when an AI produces outputs for “unusual reasons.” It is similar to mechanistic interpretability but doesn’t demand human understanding. The Alignment Research Center (ARC) is trying to formalize “reasons” for an AI’s output using heuristic arguments, aiming for an indefinitely scalable solution to MAD.
As a complement to ARC’s theoretical approach, we are excited about empirical research on MAD. Rather than looking for a principled definition of “reasons,” this means creating incrementally harder MAD benchmarks and better MAD methods.
We have been thinking about and working on empirical MAD research for the past months. We believe there are many tractable and useful experiments, only a fraction of which we can run ourselves. This post describes several directions we’re excited about and high-level reasons to work on empirical MAD.
Open Problems in AIXI Agent Foundations
To save time, I will assume that the reader has a copy of Jan Leike's PhD thesis on hand. In my opinion, he has made much of the existing foundational progress since Marcus Hutter invented the model. Also, I will sometimes refer to the two foundational books on AIXI as UAI = Universal Artificial Intelligence and Intro to UAI = An Introduction to Universal Artificial Intelligence, and the canonical textbook on algorithmic information theory Intro to K = An Introduction to Kolmogorov Complexity and its applications. Nearly all problems will require some reading to understand even if you are starting with a strong mathematical background. This document is written with the intention that a decent mathematician can read it, understand enough to find some subtopic interesting, then refer to relevant literature and possibly ask me a couple clarifying questions, then be in a position to start solving problems.
Wikitag Open Problems on Lesswrong
Consists open problems in AI Alignment, Rationality etc
Initiatives seeking your volunteer help. These projects are focused on supporting and improving the AI safety field.
Also see volunteer opportunities on job boards for it
10 EA Movement Building Project Ideas for Early Career Professionals & Students (2025)
Just wanted to share these ideas since I don’t have the capacity to do them myself. I wish someone would realize them! Please let me know if you are considering doing any of these projects. I’d be happy to help, provide resources, and connect you with others who might help. You don’t have to read everything. You can just take a look and read the parts that you find interesting. Also, please write below if you have any more ideas or know these things already exist or have been tried in the past. You can also check this different list by Gergo Gaspar.
Finally, if you’re interested in applying for funding to work on one of these projects (or something similar), check out our RFP.
I’ve said it before, and I’ll say it again: I think there’s a real chance that AI systems capable of causing a catastrophe (including to the point of causing human extinction) are developed in the next decade. This is why I spend my days making grants to talented people working on projects that could reduce catastrophic risks from transformative AI.
I don't have a spreadsheet where I can plug in grant details and get an estimate of basis points of catastrophic risk reduction (and if I did, I wouldn't trust the results). But over the last two years working in this role, I’ve at least developed some Intuitions™ about promising projects that I’d1 like to see more people work on.
Here they are.
What x- or s-risk fieldbuilding organisations would you like to see? An EOI form. (2025) (FBB #3)
Evals | A list of evals resources & plans
I think evals is extremely accessible as a field and very friendly to newcomers because the field is so empirical and hands-on. If you already have a lot of research and software engineering experience, you will be able to become good at evals very quickly. If you are a newcomer, evals allow you to get started with a minimal amount of background knowledge. I intend to update this page from time to time. Feel free to reach out with resources that you think should be included.
Blueprints for AI Safety Conferences (FBB #9) (2025)
by The Field Building Blog (FBB)
TLDR: We need more AI Risk-themed events, so I outlined four such events with different strategies. These are: online conferences, one-day summits, three-day conferences and high-profile events for building bridges between various communities.
AI Safety intro fellowships
You can organize group to go through one of courses with open curriculum together
aisafety.com list of open curriculums
Some highlights
Most frequent one is BlueDot Impact AI Safety fundamentals
Better for governance thinking
Agent Foundations for Superintelligence-Robust Alignment
if you plan to run this course try to reach out to plex, who is the author
ML4Good
If you’d like to use their curriculum, reach out to somebody from ML4Good team, e.g Charbel-Raphael Segerie. They update program ~each run.
[low effort] Watch together videos on AI Safety
Watch Rob Miles and Rational Animation for example
suggest your favourite videos on AI Safety in comments!
Do a speaker event
Maybe do a speaker event non AI Safety, but relevant club as well (e.g. AI club, technical clubs, policy clubs, TEDx club, law fraternity, there are usually some options for collaboration)
Do an estimation game
Buisness-related communities love Fermi estimations, they are asked to do similar things on interviews, they might like idea of holding an estimation game. You can make it about AI and it is going to be both estimation game and AI Safety awareness increasing.
Film a video/interview/
Mb collaborate with other clubs on this
[low effort] AI Safety reading groups
examples of what you can read as a group
Impact books is a project that sends impactful book for free
Research group in your uni
For that you’d need:
Research proposal (see research projects list)
Professor, who is going to supervise (try looking for people who are doing relevant research in your uni)
Sometimes funding
Most importantly: people (sometimes it is possible to create a mailing list to all people with relevant major through uni administration)