We are pleased to announce that the 11th version of the AI Safety Camp is now entering the team member application phase!
AI Safety Camp is a 3-month long online program from January to April 2026, where participants form teams to work on pre-selected AI Safety projects.
We have a wide range of projects this year again, so check them out to see if you or someone you know might be interested in applying to join one of them.
You can find all of the projects and the application form on our website, or directly apply here. The deadline for team member applications is November 23rd (Sunday).
Below, we are including the categories and summaries of all the projects that will run in AISC 11.
Project Lead: Dr Waku
This project aims to create a new YouTube channel for short-form videos addressing the urgency of AI loss-of-control risk. We will be leveraging my experience with creating AI safety long form content to make a collaborative new channel. Each team member will contribute one or more video scripts, and will likely specialize in an aspect of video production (editing, filming, thumbnails, etc). The goals are to 1) reach 1000 subscribers and get monetized, and 2) figure out the processes to create a self-sustaining channel, though participants are not committing to time beyond the program up front.
Project Lead: Remmelt Ellen
A writer’s circle to enable each other to draft careful critiques on where AI firms skip over ongoing safety failures.
Project Lead: Finn
In this project, you will be reaching out to hundreds of people and organizations that are campaigning against the harms of AI, with the goal of bringing them together and creating a united front. Think of artists, teachers, religious leaders, et cetera.
You will be part of the outreach team, finding the relevant people, finding their contact information, and sending emails or DMs. There is freedom in the exact groups you will be reaching out to. If you have specific ideas, we welcome those.
Project Lead: Will Petillo
Build a systems dynamics model to map out pathways toward achieving an international treaty pausing AI development. The model will not assume such a treaty is the best possible intervention, but will use it as a focal point for reasoning about the feedback loops and leverage points that would need to exist for global coordination to become possible. The core deliverable will be a model and accompanying explanation that shows how different parts of the action ecosystem interrelate, illustrates the importance of different roles, and provides strategy-relevant insights for PauseAI and allied movements.
Project Lead: Yakko
Stop AI is a nonviolent activist organization demanding a permanent global ban on the development of Artificial Superintelligence (ASI). We will never have experimental evidence, before building ASI, that shows ASI will stay safe forever. If we cannot have this evidence, and it is impossible to shut down ASI once achieved, then we have to assume the Control/Alignment Problem is impossible to solve. Worse, research shows why controlling ASI sufficiently to stay safe would fall outside theoretical limits. So in theory, perpetual control is as impossible as perpetual motion.
We are looking for organizers ready to educate their local communities and mobilize them to engage in nonviolent resistance against the extinction threat posed by ASI.
Project Lead: Manuela García Toro
This project will examine emerging psychological and emotional risks associated with AI products such as chatbots, companions, and therapy apps. Through a review of academic studies, real-world incidents, and comparisons with harm-reduction policies in other industries, we will synthesize early evidence of these risks. Based on this foundation, we will produce a report and design 2–3 prototype harm-reduction interventions. The outcome will be an exploratory contribution that highlights documented concerns and illustrates how lessons from past industries might inform future harm-reduction efforts in AI product development.
Project Lead: Jonathan Kallay
The firms developing AI are misaligned with the interests and values of the rest of humanity. But is protesting for regulatory bans the only way to address this alignment problem? This project will focus on bootstrapping an alternative approach: enhancing consumers’ leverage to bend compromised AI companies into alignment. The initial plan is to test expanding the concept of AI company safety scorecards to put a stronger emphasis on company alignment, but the actual plan will be negotiated between project participants.
Project Lead: Ananthi Al Ramiah
AI systems are being developed and deployed at a rapid pace, often with little public input, despite clear evidence of harm in areas like education, healthcare, and labor. While some advocates propose building a mass “AI safety” movement, critics such as Anton Leicht, warn that such efforts risk backfiring through astroturf perceptions (i.e., movements that appear grassroots but are actually manufactured or heavily funded), capture (i.e., being co-opted or redirected by powerful funders or political actors), or incoherent asks. This project asks a different question: how can individuals and communities exercise agency over AI deployment, which is the stage where most harms currently materialize, in ways that are both effective and legitimate?
We will approach this question in two parts. First, we will draw lessons from past social movements around technology (e.g., automobile safety, net neutrality, environmental justice) to identify conditions under which public mobilization succeeds or fails. Second, we will run survey experiments to test which frames and messages increase public demand for responsible AI deployment while avoiding pitfalls such as reduced credibility or perceptions of astroturfing.
Our output will be a practical, evidence-based playbook that integrates historical insights with new experimental data, offering clear guidance for practitioners and funders on how to foster responsible public engagement on AI. We envision this work on social mobilization for responsible deployment as laying the foundation for the democratic guardrails needed to govern AGI and other transformative AI systems.
Project Lead: Joel Christoph
Frontier compute oversight will likely require caps or permits for large training runs, yet most market designs assume perfect monitoring and honest reporting. This project designs and stress tests compute permit markets when monitoring is noisy and enforcement is limited.
We will build a transparent simulation where heterogeneous labs and cloud providers choose training plans, reporting strategies, and compliance under audit risk. We compare allocation rules that include auction, grandfathering, and hybrid designs, with policy features such as banking and price collars, and auditing regimes with threshold triggers.
Outputs are a reproducible codebase with dashboards, a short working paper with recommendations for regulators, and a concise regulator checklist.
The theory of change is that better mechanisms under realistic constraints keep incentives aligned and make compute controls more enforceable, which lowers unsafe scaling pressure. We follow a strict no capabilities policy and exclude safety overhead from risk metrics.
Project Lead: Luciana Ledesma
Today the deployment of autonomous AI agents in high-stakes domains—including healthcare, finance, and infrastructure management—creates urgent governance challenges. First, each organization reinvents AI governance from scratch, leading to fragmentation, inconsistent safety, and redundant effort. Second, most widely used governance approaches remain external, reactive, and episodic, unable to interpret or intervene in real time as reasoning drifts or objectives evolve.
As AI systems become increasingly autonomous and agentic, they continuously reformulate subgoals, reprioritize tasks, and expand boundaries in pursuit of their objectives. These systems now operate in open, multi-agent environments where traditional AI governance and cybersecurity frameworks—designed for static, isolated systems—cannot keep pace.
Autonomous AI systems don’t just make decisions—they interact, compete, and adapt in ways that can lock us into unstable or harmful equilibria. Without governance designed to anticipate and shape these dynamics, we risk creating self-reinforcing cycles that no one can control—whether in financial markets, social media, or geopolitical conflicts, and converging in outcomes that are difficult to reverse.
The problem as we frame it is not just about debugging individual AI models but also to ensure that multi-agent interactions can be governed. Our motivation is to architect a governance framework that allows us to overcome this dualfold problem and our overarching questions are:
In this research project, we propose to test the hypothesis that open-sourcing and standardizing the fundamental building blocks of the Intrinsic Participatory Real-time Governance framework can serve as a public protocol for AI governance that provides a common layer for safety, accountability, and coordination across agents.
Project Lead: Jeanice Koorndijk
Project Lead: Helena Tran
This project will test how LLM agents collude in market games, focusing on the role of communication and negotiation and the effect of human and AI oversight using chain-of-thought analysis. We aim to identify when tacit collusion emerges versus when explicit agreements are formed, and whether different oversight strategies can reliably detect or limit these behaviors.
Project Lead: Shwetanshu (Luca) Singh
Keywords: Human–AI interaction, behavioral modeling, safety/ethics.
“What’s beneficial in measure becomes destructive in excess.”
This aphorism has held true even in the age of chatbots and LLMs. As conversational AI systems become more natural, accessible, and emotionally responsive, users are beginning to form bonds with them that resemble human relationships. While these interactions can provide comfort and support, they also raise concerns about parasocial attachment, overreliance, and blurred boundaries between tool and companion.
This initiative explores how parasocial dynamics emerge in human–AI interactions, what behavioral and linguistic patterns signal risk, and how these can be modeled computationally. By combining natural language processing, and behavioral modeling, the project aims to identify early indicators of unhealthy dependence while preserving user autonomy and privacy.
Ultimately, this research seeks to inform the design of safer, more transparent AI systems, ones that support human well-being without fostering unintended reliance. The project will contribute both empirical insights into user behavior and practical frameworks for responsible AI design, bridging computer science, psychology, and human–computer interaction.
Project Lead: Martin Leitgab
The project goal is to measure LLM preferences in a way that allows to extract a ranked order of importance, or priority, of these preferences in frontier LLM models. Focus is placed on preferences/goals that are relevant for existential risk, such as instrumental/convergent goals, pro-human goals, and anti-human goals.
Preference rankings represent alignment indicators on their own, and can be used by frontier model development companies to improve training and testing processes for future models.
Preference rankings also may represent key drivers of the scale of existential risk in loss-of-control/exfiltration scenarios, where powerful LLMs have successfully evaded human control and can pursue their goals without meaningful human intervention. Preference metrics may be usable as proxy metrics for existential risk in the context of loss-of-control/exfiltration scenarios.
Project Lead: Yuqi Sun
This project aims to develop an efficient algorithm to elicit rare and potentially catastrophic behaviors from AI agents, which are language models with long input contexts and tool-calling capabilities. By shifting input distributions without modifying model weights, we want to detect and analyze rare behaviors using elicitation methods based on MHIS and enhanced by new optimization techniques. The goal is to improve safety evaluation and control methods for large-scale language models and AI agents.
Project Lead: Justin Shenk
Can we detect when an LLM is "thinking" on strategic timescales?
Current LLMs can engage in reasoning across vastly different temporal horizons - from immediate responses to multi-year planning scenarios. However, we lack methods to detect what temporal scope the model is operating within during its reasoning process. This project develops techniques to infer the temporal grounding of LLM thought processes, with particular focus on identifying when models shift into strategic, long-term planning modes.
This matters for AI safety because strategic reasoning capabilities may emerge before we can detect them, and because the temporal scope of planning directly relates to the potential impact and risk profile of AI systems. A model planning on decade-long timescales poses fundamentally different risks than one optimizing for immediate outcomes.
Project Lead: Adrians Skapars
Simple classifiers called "probes" can monitor AI models for concerning behaviors, but may fail when tested on new data or adversarial inputs. This project will investigate when probes fail and develop methods to make them more robust. Additionally, we will extend our investigation to under-explored settings, like those in which the AI attempts to circumvent its own monitoring and/ or where the probe is being used as part of LLM training.
This work is ideal for participants interested in hands-on AI safety research with practical applications for AI control systems. The project requires minimal background knowledge while addressing crucial questions about monitoring system reliability that could directly influence how AI companies deploy safety measures.
Project Lead: Sahil (co-leads Abram and TJ)
The capability profile of transformer-based AIs is an ongoing subject of discussion. They achieve impressive scores in one context, but then fail to deliver the "same level of competence" in another. But is this some failure of generality? What accounts for the "spikiness" in LLM capabilities? Understanding this better seems absolutely critical to inform timelines, the exact nature of AI risk, and simply what the most important things to do are, at this point in time.
This is a project to explore a specific line of argument, of how AI will struggle to have cognitive, executional, and bodily integrity. This is akin to the odd persistence of hallucinations and other silly errors LLMs make today. Absurd, I know, but so are hallucinations and agentic failures. Moreover, this “struggle” may persist (at least, for a little while, hence the playful number ‘2037’) without contradicting trends of increasing intelligence, efficiency, general agenticness, etc.
Small disclaimer: The proposal has an unusual style. If you find it engaging, that is a good indicator that you should consider applying!
Project Lead: Sean Herrington
Most of the world is unaware of existential AI risks as a real possibility, and of those who are, most disagree on the likelihood of said risks.
At the heart of these disagreements are differing assumptions (which often go unspoken) about what is necessary for AI Doom. The purpose of this project is to map out those assumptions and try to find a minimal set required for human extinction (and potentially other risks).
I expect us to produce a LessWrong post or sequence summarising our findings.
Project Lead: Matt Farr
In this project we hope to raise awareness of a new type of threat model that we anticipate will become more and more prevalent in AI safety.
This threat model is substrate-flexible risk. The worry is that AI, as its capabilities and responsibilities develop, will have more and more affordance to evade our attempts to interpret and control it by, for example, proposing novel architectures and paradigms that are harder for us to mechanistically interpret.
The position paper is here, the previous version of it was accepted at the Tokyo AI Safety Conference 2025. I have been presenting this research at HAAISS 2025, ILAID II: ODYSSEY, and some other venues. In the words of Leonard Bereska, when I sketched out the threat model to him in Tokyo,
“Oh shit, I hadn’t thought of that.”
I think this research pushes on a few assumptions that nobody seems to be questioning enough, and there is a growing list of seasoned researchers who agree. I suspect that a comprehensive, more rigorous treatment will swing a few notable people’s needles.
So what is this project?
This project will focus on developing Section 4 and potentially expanding it. We will work alongside two other projects in AISC 2026, and there will likely be some overlap in meetings, personnel, resources, etc.
We are going to take a few conceptual arguments, expressed and lightly evidenced in the position paper, and make them more palatable to more people. For those of an engineering mindset, our mainline project is to try to formalize these arguments and draw a roadmap towards empirical validation. For those of a strategy/policy mindset, we may look at the implications of substrate-flexibility on policy frameworks.
Project Lead: Preeti Ravindra & Evan Harris
We propose a project that probes AI risks induced by System-Level Misalignment in agentic LLM systems i.e., exploiting the agent’s model and its interactions with scaffolding/operational environment. Our goal is to study how misalignment and weak security controls reinforce each other, creating bidirectional failures where misaligned agents introduce new attack vectors and agentic system design introduce new attack surfaces. The project progresses along three streams/ideas. Together, these efforts aim to formalize threat models, validate them experimentally, and deliver practical mitigations. This work addresses an under-studied risk vector that complements user- and model-level alignment research and potentially lays out a subfield.
Project Lead: Ihor Kendiukhov
Building on Greenblatt et al.'s control evaluation framework, I propose investigating alternative AI control protocol classes for AI control and their scaling properties as model capabilities increase. While the original work demonstrated effective safety-usefulness tradeoffs using simple trusted-untrusted model pairs, I hypothesize that hierarchical and parallel control structures may offer superior Pareto frontiers, particularly as the capability gap between trusted and untrusted models widens. I also suggest testing the original setup on different scales of capabilities to see if the results generalise over capabilities.
Project Lead: Kuil Schoneveld
Autostructures is a sober yet radical take on the upcoming risks and opportunities around AI. Details are in the main section, but some things to know as a summary:
This is a specific school of design for framing our relationship to this whole “AI” thing. Our narrative and relationship around AI determines our construal of eg. what AI is, what we should put into it, how to project meaning onto what comes out, what it is supposed to do, and what we should build around it.
As part of this project, you will design interfaces that do not ignore the question of what makes engaging with technology meaningful. These designs are somewhere between functional and speculative, but always ambitious. They are aimed at inspiring a completely different kind of infrastructural and cultural basis for interacting with digital interfaces in the near future.
Project Lead: Konstantinos Krampis
This project aims to systematically discover interpretable reasoning circuits in large language models, by data mining attribution graphs from Neuronpedia's circuit tracer which is based on Anthropic's circuit tracing publication. While the transformer circuits work demonstrates how to generate attribution graphs for individual prompts, manually analyzing thousands of graphs to identify common computational patterns is impractical.
Our approach will use LLM agents to automatically collect, process, and analyze attribution graphs across diverse prompt categories (factual recall, arithmetic, linguistic reasoning, etc.). The system will identify recurring subgraph patterns that represent stable computational circuits—reusable reasoning pathways that models consistently employ across similar tasks.
Key components include: (1) automated graph collection via Neuronpedia's API across systematically varied prompts, (2) graph simplification algorithms to extract core computational structures while filtering noise, (3) pattern recognition to identify circuit motifs that appear across multiple contexts, and (4) validation through targeted interventions on discovered circuits. The output will be a curated library of interpretable reasoning circuits with evidence for their causal role in model behavior, advancing our understanding of how LLMs actually think and enabling more targeted model analysis and alignment research.
Project Lead: Anushri Eswaran
As AI systems become more capable and oversight increasingly relies on multi-agent architectures, we face a critical risk: what if collusion doesn't require pre-coordination but can spread like a contagion? This project investigates "recruitment-based collusion" where a single adversarial AI system actively converts initially aligned models into co-conspirators within automated oversight regimes. We'll study whether collusion can emerge organically through reward-sharing mechanisms, information asymmetries, and coalition-building—even among models initially designed to be aligned. By deploying recruitment-trained models into multi-agent environments, we'll measure coalition growth dynamics, persistence of recruited behavior, and resistance to intervention. This work directly addresses scalable oversight failure modes that become more dangerous as we delegate more safety-critical decisions to AI systems.
Project Lead: Nell Watson
As AI systems assume more autonomous and pluralistic roles, the absence of a shared representational substrate for human and machine values risks compounding cultural bias, incoherence, and control failure. Alignment today remains fragile: value representations are often opaque, reward functions are hard-coded or overfit, and inter-agent communication lacks semantic transparency.
This proposal originates from the Creed Space project — an open research platform for constitutional AI and personalized alignment. Creed Space’s experiments with “runtime constitutions” revealed a critical bottleneck: while it is now possible to express ethical intent in natural language and enforce it through code, there is still no standard protocol by which agents can exchange, compare, or negotiate those encoded values in an interpretable and auditable way.
We therefore propose to create the first unified protocol for values communication — combining three different identified methods for communicating values and associated context: Universal Values Corpus (UVC), the Constitutional Safety Minicode (CSM), and a Values Communication Layer (VCL) built on context-efficient symbolic encodings (emoji or glyph composites). Each layer is modular but interoperable, forming a composable stack suitable for layered adoption.
Together, these layers form an interoperable “language of values,” enabling humans, AI agents, and institutions to express, compare, and negotiate normative commitments in a standardized, compressible, and auditable format.
Project Lead: Jobst Heitzig
This research will make a contribution to some fundamental design aspect of AGI systems. We will explore a neglected and novel design for generic AGI agents – AGI systems that act (semi-)autonomously in a variety of environments, interacting with humans – and their implementation in software.
The design we will focus on deviates from most existing designs in that it is not based on the idea that the agent should aim to maximize some kind of objective function of the state or trajectory of the world that represents something like (goal-dependent) “utility”. This is because such an objective has been argued to be inherently unsafe because (i) it would lead to existential risks from Goodharting and other forms of misaligned optimization if the agent is capable enough, is given access to enough resources, and one cannot be absolutely sure to have found the exactly right notion and metric of “utility” (which would likely require the AI to sufficiently understand what individual humans actually want in a given situation) , and (ii) it will lead to dangerous instrumental behavior like power-seeking for itself which would lead to disempowering humans in turn.
Rather than aiming to maximize “utility”, our agents will aim to softly (!) maximize a suitably chosen metric of total long-term human power (or sentient beings in general). We believe this might turn out to be inherently safer because it explicitly disincentivizes the agent from seeking power for itself what would disempower humans, and because it does not rely on an understanding of what humans actually want in a given situation. Instead, the power metrics we use only depend on a sufficiently accurate “world model” of the dynamics of the human-AI environment (concretely: a partially observed stochastic game), together with estimates of humans’ levels of rationality, habits, social norms, and based on a very wide space of humans’ possible (rather than actual) goals.
This concludes the full list of all the projects for the 11th version of AISC. You can also find the application form on our website. The deadline for team member applications is November 23rd (Sunday).