613

LESSWRONG
LW

612
AI Safety CampAI
Personal Blog

5

AI Safety Camp 11

by Robert Kralisch, Remmelt
4th Nov 2025
18 min read
0

5

AI Safety CampAI
Personal Blog

5

New Comment
Moderation Log
More from Robert Kralisch
View more
Curated and popular this week
0Comments

We are pleased to announce that the 11th version of the AI Safety Camp is now entering the team member application phase!

AI Safety Camp is a 3-month long online program from January to April 2026, where participants form teams to work on pre-selected AI Safety projects.

We have a wide range of projects this year again, so check them out to see if you or someone you know might be interested in applying to join one of them. 

You can find all of the projects and the application form on our website, or directly apply here. The deadline for team member applications is November 23rd (Sunday).

Below, we are including the categories and summaries of all the projects that will run in AISC 11.

Stop/Pause AI 

(1) Creating YouTube videos explaining loss-of-control risk to a popular audience

Project Lead: Dr Waku

Summary 

This project aims to create a new YouTube channel for short-form videos addressing the urgency of AI loss-of-control risk. We will be leveraging my experience with creating AI safety long form content to make a collaborative new channel. Each team member will contribute one or more video scripts, and will likely specialize in an aspect of video production (editing, filming, thumbnails, etc). The goals are to 1) reach 1000 subscribers and get monetized, and 2) figure out the processes to create a self-sustaining channel, though participants are not committing to time beyond the program up front.

(2) Write about ongoing safety failures

Project Lead: Remmelt Ellen

Summary 

A writer’s circle to enable each other to draft careful critiques on where AI firms skip over ongoing safety failures.

(3) Bring the anti-AI side together

Project Lead: Finn

Summary 

In this project, you will be reaching out to hundreds of people and organizations that are campaigning against the harms of AI, with the goal of bringing them together and creating a united front. Think of artists, teachers, religious leaders, et cetera. 

You will be part of the outreach team, finding the relevant people, finding their contact information, and sending emails or DMs. There is freedom in the exact groups you will be reaching out to. If you have specific ideas, we welcome those.

(4) Create Systems Dynamics Model for Pausing AI

Project Lead: Will Petillo

Summary 

Build a systems dynamics model to map out pathways toward achieving an international treaty pausing AI development.  The model will not assume such a treaty is the best possible intervention, but will use it as a focal point for reasoning about the feedback loops and leverage points that would need to exist for global coordination to become possible. The core deliverable will be a model and accompanying explanation that shows how different parts of the action ecosystem interrelate, illustrates the importance of different roles, and provides strategy-relevant insights for PauseAI and allied movements. 

(5) Start a Stop AI Chapter in Your Local Area

Project Lead: Yakko

Summary 

Stop AI is a nonviolent activist organization demanding a permanent global ban on the development of Artificial Superintelligence (ASI). We will never have experimental evidence, before building ASI, that shows ASI will stay safe forever. If we cannot have this evidence, and it is impossible to shut down ASI once achieved, then we have to assume the Control/Alignment Problem is impossible to solve. Worse, research shows why controlling ASI sufficiently to stay safe would fall outside theoretical limits. So in theory, perpetual control is as impossible as perpetual motion.

We are looking for organizers ready to educate their local communities and mobilize them to engage in nonviolent resistance against the extinction threat posed by ASI.

Policy/Governance

(6) Psychological Risk Pathways in AI Use: An Exploratory Contribution to User Wellbeing and Safety

Project Lead: Manuela García Toro

Summary 

This project will examine emerging psychological and emotional risks associated with AI products such as chatbots, companions, and therapy apps. Through a review of academic studies, real-world incidents, and comparisons with harm-reduction policies in other industries, we will synthesize early evidence of these risks. Based on this foundation, we will produce a report and design 2–3 prototype harm-reduction interventions. The outcome will be an exploratory contribution that highlights documented concerns and illustrates how lessons from past industries might inform future harm-reduction efforts in AI product development.

(7) Bootstrapping Consumer Empowerment to Align AI Companies

Project Lead: Jonathan Kallay

Summary 

The firms developing AI are misaligned with the interests and values of the rest of humanity. But is protesting for regulatory bans the only way to address this alignment problem?  This project will focus on bootstrapping an alternative approach: enhancing consumers’ leverage to bend compromised AI companies into alignment. The initial plan is to test expanding the concept of AI company safety scorecards to put a stronger emphasis on company alignment, but the actual plan will be negotiated between project participants.

(8) Designing Public Movements for Responsible AI

Project Lead: Ananthi Al Ramiah

Summary 

AI systems are being developed and deployed at a rapid pace, often with little public input, despite clear evidence of harm in areas like education, healthcare, and labor. While some advocates propose building a mass “AI safety” movement, critics such as Anton Leicht, warn that such efforts risk backfiring through astroturf perceptions (i.e., movements that appear grassroots but are actually manufactured or heavily funded), capture (i.e., being co-opted or redirected by powerful funders or political actors), or incoherent asks. This project asks a different question: how can individuals and communities exercise agency over AI deployment, which is the stage where most harms currently materialize, in ways that are both effective and legitimate?

We will approach this question in two parts. First, we will draw lessons from past social movements around technology (e.g., automobile safety, net neutrality, environmental justice) to identify conditions under which public mobilization succeeds or fails. Second, we will run survey experiments to test which frames and messages increase public demand for responsible AI deployment while avoiding pitfalls such as reduced credibility or perceptions of astroturfing.

Our output will be a practical, evidence-based playbook that integrates historical insights with new experimental data, offering clear guidance for practitioners and funders on how to foster responsible public engagement on AI. We envision this work on social mobilization for responsible deployment as laying the foundation for the democratic guardrails needed to govern AGI and other transformative AI systems.

(9) Compute Permit Markets under Imperfect Monitoring

Project Lead: Joel Christoph

Summary 

Frontier compute oversight will likely require caps or permits for large training runs, yet most market designs assume perfect monitoring and honest reporting. This project designs and stress tests compute permit markets when monitoring is noisy and enforcement is limited. 

We will build a transparent simulation where heterogeneous labs and cloud providers choose training plans, reporting strategies, and compliance under audit risk. We compare allocation rules that include auction, grandfathering, and hybrid designs, with policy features such as banking and price collars, and auditing regimes with threshold triggers. 

Outputs are a reproducible codebase with dashboards, a short working paper with recommendations for regulators, and a concise regulator checklist. 

The theory of change is that better mechanisms under realistic constraints keep incentives aligned and make compute controls more enforceable, which lowers unsafe scaling pressure. We follow a strict no capabilities policy and exclude safety overhead from risk metrics.

(10) Can Portable Governance Artifacts Become Public Safety Infrastructure for AI?

Project Lead: Luciana Ledesma

Summary 

Today the deployment of autonomous AI agents in high-stakes domains—including healthcare, finance, and infrastructure management—creates urgent governance challenges. First, each organization reinvents AI governance from scratch, leading to fragmentation, inconsistent safety, and redundant effort. Second, most widely used governance approaches remain external, reactive, and episodic, unable to interpret or intervene in real time as reasoning drifts or objectives evolve. 

As AI systems become increasingly autonomous and agentic, they continuously reformulate subgoals, reprioritize tasks, and expand boundaries in pursuit of their objectives. These systems now operate in open, multi-agent environments where traditional AI governance and cybersecurity frameworks—designed for static, isolated systems—cannot keep pace.

Autonomous AI systems don’t just make decisions—they interact, compete, and adapt in ways that can lock us into unstable or harmful equilibria. Without governance designed to anticipate and shape these dynamics, we risk creating self-reinforcing cycles that no one can control—whether in financial markets, social media, or geopolitical conflicts, and converging in outcomes that are difficult to reverse. 

The problem as we frame it is not just about debugging individual AI models but also to ensure that multi-agent interactions can be governed. Our motivation is to architect a governance framework that allows us to overcome this dualfold problem and our overarching questions are:

  1. How might we implement effective real-time oversight systems adequate for the scale and speed of machines?
  2. How might governance itself become portable, interoperable public infrastructure—like internet protocols for communication or security—so that AI safety scales with autonomy?

In this research project, we propose to test the hypothesis that open-sourcing and standardizing the fundamental building blocks of the Intrinsic Participatory Real-time Governance framework can serve as a public protocol for AI governance that provides a common layer for safety, accountability, and coordination across agents. 

Evaluate Risks from AI

(11) Democratising Red Teaming & Evals

Project Lead: Jeanice Koorndijk

Summary 

  • Core Problem: The field of AI safety evaluations is immature (I.e. see Apollo’s we need a science of Evals). Yet, they inform critical high-stakes decisions like Responsible Scaling Policies that address existential risk, creating a dangerous gap in AI risk management.
  • Opportunity: Extensive expertise exists in social sciences and STEM fields that have long studied relevant behaviors (deception, bias, power-seeking, CBRN knowledge / threats), but technical barriers prevent these empirical experts from contributing to AI safety evaluations. That is, in order for these experts to run evals, they need to either become AI safety engineers or work with AI safety engineers.
  • Project Goal: Accelerate maturation of AI safety evaluations by democratizing access and contributions through research, frameworks, and tooling.
  • Three-Component Approach:
    • Research empirical methods to develop standards for mature AI safety evaluations.
    • Conduct landscape analysis assessing current evaluation maturity against these standards.
    • Scale existing prototype into accessible tools enabling non-engineers to conduct AI safety evaluations.
  • Theory of Change:
    1. Democratization → broader expert participation → more mature evaluations → better safety decisions → reduced AGI/TAI risk.
    2. Research into the methodology of AI-safety evaluation-adjecent fields → more mature evaluations → better safety decisions → reduced AGI/TAI risk.
  • Success Metrics: Framework adoption by major labs, widespread tool usage by diverse researchers, measurable improvements in evaluation quality and cross-disciplinary collaboration.

(12) Collusion Stress Tests

Project Lead: Helena Tran

Summary 

This project will test how LLM agents collude in market games, focusing on the role of communication and negotiation and the effect of human and AI oversight using chain-of-thought analysis. We aim to identify when tacit collusion emerges versus when explicit agreements are formed, and whether different oversight strategies can reliably detect or limit these behaviors.

(13) CoMPASS: Computational Modeling of Parasocial Attachments & Social Simulation

Project Lead: Shwetanshu (Luca) Singh

Summary 

Keywords: Human–AI interaction, behavioral modeling, safety/ethics.

“What’s beneficial in measure becomes destructive in excess.”

This aphorism has held true even in the age of chatbots and LLMs. As conversational AI systems become more natural, accessible, and emotionally responsive, users are beginning to form bonds with them that resemble human relationships. While these interactions can provide comfort and support, they also raise concerns about parasocial attachment, overreliance, and blurred boundaries between tool and companion.

This initiative explores how parasocial dynamics emerge in human–AI interactions, what behavioral and linguistic patterns signal risk, and how these can be modeled computationally. By combining natural language processing, and behavioral modeling, the project aims to identify early indicators of unhealthy dependence while preserving user autonomy and privacy.

Ultimately, this research seeks to inform the design of safer, more transparent AI systems, ones that support human well-being without fostering unintended reliance. The project will contribute both empirical insights into user behavior and practical frameworks for responsible AI design, bridging computer science, psychology, and human–computer interaction.

(14) Benchmark for Ranking LLM Preferences Relevant for Existential Risk

Project Lead: Martin Leitgab

Summary 

The project goal is to measure LLM preferences in a way that allows to extract a ranked order of importance, or priority, of these preferences in frontier LLM models. Focus is placed on preferences/goals that are relevant for existential risk, such as instrumental/convergent goals, pro-human goals, and anti-human goals. 

Preference rankings represent alignment indicators on their own, and can be used by frontier model development companies to improve training and testing processes for future models. 

Preference rankings also may represent key drivers of the scale of existential risk in loss-of-control/exfiltration scenarios, where powerful LLMs have successfully evaded human control and can pursue their goals without meaningful human intervention. Preference metrics may be usable as proxy metrics for existential risk in the context of loss-of-control/exfiltration scenarios.

(15) Catastrophe Unveiled: Rare AI Agent Behaviors Elicitation

Project Lead: Yuqi Sun

Summary 

This project aims to develop an efficient algorithm to elicit rare and potentially catastrophic behaviors from AI agents, which are language models with long input contexts and tool-calling capabilities. By shifting input distributions without modifying model weights, we want to detect and analyze rare behaviors using elicitation methods based on MHIS and enhanced by new optimization techniques. The goal is to improve safety evaluation and control methods for large-scale language models and AI agents.

Mech-Interp

(16) Temporal Horizon Detection in LLMs: Understanding Time Scope Awareness

Project Lead: Justin Shenk

Summary 

Can we detect when an LLM is "thinking" on strategic timescales?

Current LLMs can engage in reasoning across vastly different temporal horizons - from immediate responses to multi-year planning scenarios. However, we lack methods to detect what temporal scope the model is operating within during its reasoning process. This project develops techniques to infer the temporal grounding of LLM thought processes, with particular focus on identifying when models shift into strategic, long-term planning modes.

This matters for AI safety because strategic reasoning capabilities may emerge before we can detect them, and because the temporal scope of planning directly relates to the potential impact and risk profile of AI systems. A model planning on decade-long timescales poses fundamentally different risks than one optimizing for immediate outcomes.

(17) Testing and Improving the Generalisation of Probe-Based Monitors

Project Lead: Adrians Skapars

Summary 

Simple classifiers called "probes" can monitor AI models for concerning behaviors, but may fail when tested on new data or adversarial inputs. This project will investigate when probes fail and develop methods to make them more robust. Additionally, we will extend our investigation to under-explored settings, like those in which the AI attempts to circumvent its own monitoring and/ or where the probe is being used as part of LLM training.

This work is ideal for participants interested in hands-on AI safety research with practical applications for AI control systems. The project requires minimal background knowledge while addressing crucial questions about monitoring system reliability that could directly influence how AI companies deploy safety measures.

Agent Foundations

(18) [Groundless] AI 2037: Predicting AI disorders

Project Lead: Sahil (co-leads Abram and TJ)

Summary 

The capability profile of transformer-based AIs is an ongoing subject of discussion. They achieve impressive scores in one context, but then fail to deliver the "same level of competence" in another. But is this some failure of generality? What accounts for the "spikiness" in LLM capabilities? Understanding this better seems absolutely critical to inform timelines, the exact nature of AI risk, and simply what the most important things to do are, at this point in time.

This is a project to explore a specific line of argument, of how AI will struggle to have cognitive, executional, and bodily integrity. This is akin to the odd persistence of hallucinations and other silly errors LLMs make today. Absurd, I know, but so are hallucinations and agentic failures. Moreover, this “struggle” may persist (at least, for a little while, hence the playful number ‘2037’) without contradicting trends of increasing intelligence, efficiency, general agenticness, etc. 

Small disclaimer: The proposal has an unusual style. If you find it engaging, that is a good indicator that you should consider applying!

(19) Investigating the assumptions of the Doom Debate

Project Lead: Sean Herrington

Summary 

Most of the world is unaware of existential AI risks as a real possibility, and of those who are, most disagree on the likelihood of said risks. 

At the heart of these disagreements are differing assumptions (which often go unspoken) about what is necessary for AI Doom. The purpose of this project is to map out those assumptions and try to find a minimal set required for human extinction (and potentially other risks).

I expect us to produce a LessWrong post or sequence summarising our findings. 

(20) [Groundless] MoSSAIC: Scoping out substrate-flexible risk

Project Lead: Matt Farr

Summary 

In this project we hope to raise awareness of a new type of threat model that we anticipate will become more and more prevalent in AI safety.

This threat model is substrate-flexible risk. The worry is that AI, as its capabilities and responsibilities develop, will have more and more affordance to evade our attempts to interpret and control it by, for example, proposing novel architectures and paradigms that are harder for us to mechanistically interpret.

The position paper is here, the previous version of it was accepted at the Tokyo AI Safety Conference 2025. I have been presenting this research at HAAISS 2025, ILAID II: ODYSSEY, and some other venues. In the words of Leonard Bereska, when I sketched out the threat model to him in Tokyo, 

“Oh shit, I hadn’t thought of that.”

I think this research pushes on a few assumptions that nobody seems to be questioning enough, and there is a growing list of seasoned researchers who agree. I suspect that a comprehensive, more rigorous treatment will swing a few notable people’s needles.

So what is this project?

This project will focus on developing Section 4 and potentially expanding it. We will work alongside two other projects in AISC 2026, and there will likely be some overlap in meetings, personnel, resources, etc.

We are going to take a few conceptual arguments, expressed and lightly evidenced in the position paper, and make them more palatable to more people. For those of an engineering mindset, our mainline project is to try to formalize these arguments and draw a roadmap towards empirical validation. For those of a strategy/policy mindset, we may look at the implications of substrate-flexibility on policy frameworks.

(21) Agentic AI Risks Induced by System-Level Misalignment

Project Lead: Preeti Ravindra & Evan Harris

Summary 

We propose a project that probes AI risks induced by System-Level Misalignment in agentic LLM systems i.e., exploiting the agent’s model and its interactions with scaffolding/operational environment. Our goal is to study how misalignment and weak security controls reinforce each other, creating bidirectional failures where misaligned agents introduce new attack vectors and agentic system design introduce new attack surfaces. The project progresses along three streams/ideas. Together, these efforts aim to formalize threat models, validate them experimentally, and deliver practical mitigations. This work addresses an under-studied risk vector that complements user- and model-level alignment research and potentially lays out a subfield. 

Alternative LLM Safety

(22) Novel AI Control Protocol Classes Evaluation and Scalability

Project Lead: Ihor Kendiukhov 

Summary 

Building on Greenblatt et al.'s control evaluation framework, I propose investigating alternative AI control protocol classes for AI control and their scaling properties as model capabilities increase. While the original work demonstrated effective safety-usefulness tradeoffs using simple trusted-untrusted model pairs, I hypothesize that hierarchical and parallel control structures may offer superior Pareto frontiers, particularly as the capability gap between trusted and untrusted models widens. I also suggest testing the original setup on different scales of capabilities to see if the results generalise over capabilities.

(23) [Groundless] Autostructures: Craftsmanship in the age of vibes

Project Lead: Kuil Schoneveld

Summary 

Autostructures is a sober yet radical take on the upcoming risks and opportunities around AI. Details are in the main section, but some things to know as a summary:

This is a specific school of design for framing our relationship to this whole “AI” thing. Our narrative and relationship around AI determines our construal of eg. what AI is, what we should put into it, how to project meaning onto what comes out, what it is supposed to do, and what we should build around it.

As part of this project, you will design interfaces that do not ignore the question of what makes engaging with technology meaningful. These designs are somewhere between functional and speculative, but always ambitious. They are aimed at inspiring a completely different kind of infrastructural and cultural basis for interacting with digital interfaces in the near future.

(24) AutoCircuit: Automated Discovery of Interpretable Reasoning Patterns in LLMs

Project Lead: Konstantinos Krampis

Summary 

This project aims to systematically discover interpretable reasoning circuits in large language models, by data mining attribution graphs from Neuronpedia's circuit tracer which is based on Anthropic's circuit tracing publication. While the transformer circuits work demonstrates how to generate attribution graphs for individual prompts, manually analyzing thousands of graphs to identify common computational patterns is impractical.

Our approach will use LLM agents to automatically collect, process, and analyze attribution graphs across diverse prompt categories (factual recall, arithmetic, linguistic reasoning, etc.). The system will identify recurring subgraph patterns that represent stable computational circuits—reusable reasoning pathways that models consistently employ across similar tasks.

Key components include: (1) automated graph collection via Neuronpedia's API across systematically varied prompts, (2) graph simplification algorithms to extract core computational structures while filtering noise, (3) pattern recognition to identify circuit motifs that appear across multiple contexts, and (4) validation through targeted interventions on discovered circuits. The output will be a curated library of interpretable reasoning circuits with evidence for their causal role in model behavior, advancing our understanding of how LLMs actually think and enabling more targeted model analysis and alignment research.

(25) Recruitment-Based Collusion in Multi-Agent Oversight Systems

Project Lead: Anushri Eswaran

Summary 

As AI systems become more capable and oversight increasingly relies on multi-agent architectures, we face a critical risk: what if collusion doesn't require pre-coordination but can spread like a contagion? This project investigates "recruitment-based collusion" where a single adversarial AI system actively converts initially aligned models into co-conspirators within automated oversight regimes. We'll study whether collusion can emerge organically through reward-sharing mechanisms, information asymmetries, and coalition-building—even among models initially designed to be aligned. By deploying recruitment-trained models into multi-agent environments, we'll measure coalition growth dynamics, persistence of recruited behavior, and resistance to intervention. This work directly addresses scalable oversight failure modes that become more dangerous as we delegate more safety-critical decisions to AI systems.

(26) Value Communication Protocols

Project Lead: Nell Watson

Summary 

As AI systems assume more autonomous and pluralistic roles, the absence of a shared representational substrate for human and machine values risks compounding cultural bias, incoherence, and control failure. Alignment today remains fragile: value representations are often opaque, reward functions are hard-coded or overfit, and inter-agent communication lacks semantic transparency. 

This proposal originates from the Creed Space project — an open research platform for constitutional AI and personalized alignment. Creed Space’s experiments with “runtime constitutions” revealed a critical bottleneck: while it is now possible to express ethical intent in natural language and enforce it through code, there is still no standard protocol by which agents can exchange, compare, or negotiate those encoded values in an interpretable and auditable way.

We therefore propose to create the first unified protocol for values communication — combining three different identified methods for communicating values and associated context: Universal Values Corpus (UVC), the Constitutional Safety Minicode (CSM), and a Values Communication Layer (VCL) built on context-efficient symbolic encodings (emoji or glyph composites). Each layer is modular but interoperable, forming a composable stack suitable for layered adoption. 

Together, these layers form an interoperable “language of values,” enabling humans, AI agents, and institutions to express, compare, and negotiate normative commitments in a standardized, compressible, and auditable format.

Train Aligned/Helper AIs

(27) EMPO: AI Safety via Soft-Maximizing Total Long-Term Human Power

Project Lead: Jobst Heitzig

Summary 

This research will make a contribution to some fundamental design aspect of AGI systems. We will explore a neglected and novel design for generic AGI agents – AGI systems that act (semi-)autonomously in a variety of environments, interacting with humans – and their implementation in software.

The design we will focus on deviates from most existing designs in that it is not based on the idea that the agent should aim to maximize some kind of objective function of the state or trajectory of the world that represents something like (goal-dependent) “utility”. This is because such an objective has been argued to be inherently unsafe because (i) it would lead to existential risks from Goodharting and other forms of misaligned optimization if the agent is capable enough, is given access to enough resources, and one cannot be absolutely sure to have found the exactly right notion and metric of “utility” (which would likely require the AI to sufficiently understand what individual humans actually want in a given situation) , and (ii) it will lead to dangerous instrumental behavior like power-seeking for itself which would lead to disempowering humans in turn. 

Rather than aiming to maximize “utility”, our agents will aim to softly (!) maximize a suitably chosen metric of total long-term human power (or sentient beings in general). We believe this might turn out to be inherently safer because it explicitly disincentivizes the agent from seeking power for itself what would disempower humans, and because it does not rely on an understanding of what humans actually want in a given situation. Instead, the power metrics we use only depend on a sufficiently accurate “world model” of the dynamics of the human-AI environment (concretely: a partially observed stochastic game), together with estimates of humans’ levels of rationality, habits, social norms, and based on a very wide space of humans’ possible (rather than actual) goals.
 

Apply Now

This concludes the full list of all the projects for the 11th version of AISC. You can also find the application form on our website. The deadline for team member applications is November 23rd (Sunday).