AI Safety Camp 10 Outputs

Remmelt; Linda Linsefors

This post if for sharing the outputs of the AISC10: Virtual that took place from Jan - Apr, 2025. You can also find them on our website.

We are quite happy with this year's edition of AISC featuring a wide range of approaches to reducing AI risk.
A number projects hosted at AISC10 got puplished as papers, produced helpful community resources, lead to funded follow-up research, and/or helped their participants transition to full-time work in the field of AI Safety.

We will open project applications for the upcoming edition, AISC11, in the coming days.

You are encouraged to scan the post outline for the topic area of greatest interest to you and check out the projects we hosted.
As a general rule, the first person listed among the team members was the project lead for the respective project.

Stop/Pause AI

Growing PauseAI

Team members: Chris Gerrby, Sharon Mwaniki, Alyssa Chase-Vilchez, Manuela García Toro, Andrei-Octavian Dirla

Project Summary: This project explored multiple avenues of scaling the PauseAI movement.

Project Outputs:
Recommendations on how to Reframe the Pause AI message by Sharon
Thematic Analysis of social and Environmental Movements in relation to Pause AI by Sharon and Alyssa
Report Analysis: Rapidly Growing Social Movements and Key Factors in Their Growth by Alyssa
Relation between PauseAI and Anti-nuclear Weapons Movements by Manu
PauseAI email writer by Andrei
MAISU lightning talk

AI Policy Course: AI's capacity for exploiting existing legal structures and rights

Team members: Marcel Mir, Kathrin Gardhouse, Suchet Mittal, Chloe Jefferson, Melissa Ninsiima, Arth Singh, Feranmi Adeoye, Ramya Nadig

Project Summary: The project involves developing a modular course that identifies legal vulnerabilities in the deployment of AI systems in high stake sectors. It maps key stakeholders and proposes liability frameworks to help allocate responsibility appropriately. Our premise is that clear liability structures and proper accountability assignment can discourage the reckless deployment of AI systems.

We believe the course can be a tool to identify vulnerabilities, inform key stakeholders, such as policymakers, lawyers and researchers, and become a valuable educational tool. We hope to inspire future research, policy actions or technical solutions for the challenges identified, while supporting the communication of these critical issues to the broader public.

Project Outputs:
AISC AI Liability Course Interest Form
MAISU lightning talk

Building the Pause Button: A Proposal for AI Compute Governance

Team members: Joep Meindertsma, Farhan Shafiq, Raymond Koopmanschap, Ananthi Al Ramiah, Dominika Kunertova, Mitali Mittal, Ricardo Manhães Savii

Project Summary: Studying the supply chains of AI training to identify appropriate intervention points regarding dangerous AI development.

Project Outputs: Building the Pause Button webpage
MAISU lightning talk

StopAI Campaign

Team members: Finn van der Velde, Sam Kirchner

Project Summary: We created short punchy videos and tweets about AI companies recklessly causing risks. We flyered multiple days a week for three months and therein talked to thousands of people in San Francisco. We also gave presentations about current and future AI dangers.

Evaluate Risks from AI

Simulator Theory

Team members: Will Petillo, Sean Herrington, Spencer Ames, Adebayo Mubarak, Can Narin

Project Summary: Articulate the simulator lens for understanding LLMs in comparison with the more familiar tool and agent lenses. Explore alignment implications of each lens, their consistency with observations of how LLMs work, and training processes that shift the balance regarding which paradigm dominates behavior. Finally, consider various development paths future AI might take.

Project Outputs:
LessWrong Sequence: Simulators vs Agents: Updating Risk Models
MAISU lightning talk

Formalize the Hashiness Model of AGI Uncontainability

Team members: Thibaud Veron, Aybars Kocoglu, Remmelt Ellen (project lead), Forrest Landry (supervisor), Anders Sandberg (research lead)

Project Summary: This project is a first, exploratory step towards understanding what are the important parameters to look for when tackling agent control, and how fast the resources required for control grow with regard to these parameters. Do we expect linear scaling, exponential dynamics, or abrupt phase transitions? Is there a theoretical ground on which to build agent control?

Project Outputs:
Poster Control Conference 2025
MAISU lightning talk
Limits to Control Workshop run as a follow-up

LLMs: Can They Science?

Team members: Egg Syntax, Matt Broerman, Darshana Saravanan, Fabio Marinello, Zexi 'Bob' Fu, Jord Nguyen

Project Summary: Are LLMs capable of the sort of general reasoning (notably generating and testing hypotheses) that would allow them to do independent scientific research? If so, we should have shorter timelines, since it suggests that current architecture can scale to AGI. We test them on novel toy domains governed by randomized scientific laws, and find that leading LLMs can in fact do this, although not yet reliably.

Project Outputs:
Presentation slides
MAISU lightning talk

Are LLMs Coherent Baysians?

Team members: Sohaib Imran, Ihor Kendiukhov, Matthew Broerman, Aditya Thomas, Riccardo Campanella

Project Summary: Do larger and more capable language models learn to update their "beliefs" about propositions more consistently with Bayes' theorem when presented with evidence in-context? To test this, we formulate a Bayesian Coherence Coefficient (BCC) metric and generate a dataset with which to measure the BCC. We measure BCC for multiple pre-trained-only language models across five model families, comparing against the number of model parameters, the amount of training data, and model scores on common benchmarks. Our results provide evidence for our hypothesis that larger and more capable pre-trained language models assign credences that are more coherent with Bayes' theorem. These results have important implications for our understanding and governance of LLMs.

Project Outputs:
Paper: https://openreview.net/forum?id=Bki9T98mfr
Code: https://github.com/AISC10-team09/bayesian_reasoning/tree/dev
MAISU lightning talk

Mech-Interp

Understanding the Reasoning Capabilities of LLMs

Team members: Sonakshi Chauhan, Kwan Kiu CHOY, Samuel (Gerrit) Nellessen, Maheep Chaudhary

Project Summary: We produced this paper working as Team 12 of AISC. We were able to find that punctuation tokens, despite being minor in human language processing, play a surprisingly large computational role in LLMs. Using intervention-based analyses, we showed that GPT-2 relies heavily on punctuation tokens across several layers, whereas DeepSeek shows this property only in a single layer, and Gemma not at all. We further investigated whether LLMs process reasoning compositionally (treating subjects, adjectives, punctuation, and sentences as distinct components) or by forming early static summaries. Through interventions and layer-swapping experiments on conditional and quantified statements, we found that different models exhibit strikingly different internal dynamics of reasoning. These findings provide insight into how information and reasoning propagate in LLMs and highlight architectural differences with implications for interpretability. This paper is submitted to AAAI.

Project Outputs:
Paper link
MAISU lightning talk

Mechanistic Interpretability via Learning Differential Equations

Team members: Valentin Slepukhin, Syed Muhammad Irtiza Zaidi, Joep Storm, Ben Karsberg, Kevin Jeon, Utkarsh Priyadarshi, Eduard Kovalets, Ryan Moffat, Murshed Al Amin, Fei Xie, Mufti Taha Shah, Ayo Akinkugbe, Helen Saville, Sameer Gulati, Soumyadeep Bose, Danilo de Freitas Naiff, Melwina Albuquerque, Varun Piram, Abhik Rana, Ekin Zorer, Tommaso Mencattini, Axel Ahlqvist, Dylan Ponsford

Project Summary: We report our intermediate results from the AI Safety Camp project “Mechanistic Interpretability Via Learning Differential Equations”. Our goal was to explore the transformers that deal with time-series numerical data (either infer the governing differential equation or predict the next number). As the task is well formalized, it seems to be an easier problem than interpreting a transformer that deals with language. During the time of the project, we constructed various interpretability methods for the problem at hand. We also obtain some preliminary results (e.g., we observe a pattern similar to numerical computation of the derivative). We plan to continue working on it to validate these preliminary results.

Project Outputs:
Mechanistic Interpretability Via Learning Differential Equations: AI Safety Camp Project Intermediate Report
ODEformer Attention Explorer
ODEformer SAE Features Explorer
MAISU lightning talk

Towards Understanding Features

Team members: Kola Ayonrinde, Adam Lowet, Kristaps Kallaste, Aashiq Muhamed, Owen Parsons, Alex Serrano Terre, Giorgi Giglemiani, Jake Ward, Jacob Drori, Shivam Raval

Project Summary: There were two subteams, one based in the US and one based in Europe.

TUF-US conducted mainly individual projects and met once a week to discuss. Adam tried to understand relational composition in LLMs through the Universal Dependencies linguistics framework. Jake used a synthetic dataset to characterize the circumstances under which SAEs actually learn ground-truth features, and to devise statistical tests to assess this learning in real datasets. Jacob was interested in how features emerge across training, using a cross-coder approach.

Project Outputs:
Summary of Adam’s project
MAISU lightning talk

Towards Ambitious Mechanistic Interpretability II

Team members: Alice Rigg, Andre Assis, Tim Hua, Taras Kutsyk, Jatin Nainani, Connor Watts, Sankaran Vaidyanathan

Project Summary: We executed two projects:

What is the functional role of SAE errors?

We explored the role of SAE errors in two different contexts for Gemma-2 and Gemma Scope SAEs: sparse feature circuits (subject-verb-agreement-across-relative clause) and linear probing. Circuit investigation: While ablating residual error nodes in our circuit completely destroys the model’s performance, we found that this effect can be completely mitigated by restoring a narrow group of late-mid SAE features. We think that one hypothesis that explains this (and other ablation-based experiments that we performed) is that SAE errors might contain intermediate feature representations from cross-layer superposition. To investigate it beyond ablation-restoration experiments, we tried to apply crosscoder analysis but got stuck at the point of training an acausal crosscoder; instead we propose a specific MVP on how one can proceed to verify the cross-layer superposition hypothesis. Probing investigation: Another hypothesis is that the SAE error term contains lots of “derived” features representing boolean functions of “base” features. We ran some experiments training linear probes on the SAE error term with inconclusive results.

Detecting and Characterizing Planning in Language Models

Modern large language models (LLMs) have been shown to exhibit strong step-by-step reasoning abilities and achieve high performance across a wide range of reasoning tasks. Recent work suggests that LLMs may perform \textit{planning} — selecting a future target token in advance and generating intermediate tokens that lead towards it — rather than merely \textit{improvising} one token at a time. This raises a key question: how do LLMs implement planning, and is this mechanism shared across tasks? We implement a semi-automated annotation pipeline for detecting planning and improvisation, and apply it to both base and instruction-tuned Gemma-2-2B models on the MBPP code generation benchmark. Our results show that planning is not a universal property: Gemma-2-2B often solves tasks by improvisation, even where prior work reported planning in other models. We further show that both base and instruction-tuned versions of Gemma-2-2B exhibit planning behaviors, with instruction tuning refining and narrowing planning behaviors rather than creating them from scratch. Our work provides a reproducible foundation for mechanistic studies of planning and multi-step reasoning in LLMs.

Project Outputs:
GitHub
LessWrong post
MAISU lightning talk

Mechanistic Interpretability for AI Control

Team members: Shivam Raval, Gerard Boxo, Ryan Socha, Daniel Yoo

Project Summary: Given that AI Control is a field of AI Safety with high requirements for experimental infrastructure, we decided to gain traction on the problem by performing an investigation on the use of linear probes to detect deception in Large Language Models (LLMs). This took around 1 month and a half, spanning from the beginning of January to the workshop submission in mid February. In line with the results from a concurrent paper from Apollo Research, we found that Linear Probes are competitive with 'LLM as a judge' for detecting deception. For the following 2 months we continued performing experiments like:

Steering with deception features
Benchmarking different types of linear probes
Performing sensitivity analysis on the various filtering rounds to the dataset
Characterizing the types of deception by investigating the CoT from the reasoning models

Lastly we started working on a novel AI Control setting compatible with Control-Arena and we’ve been awarded GPU compute as part of eDIF GPU-Pilot program for Mechanistic Interpretability.

Project Outputs:
AI Control w/ Mech Interp Progress Update
MAISU lightning talk

Agent Foundations

Understanding Trust

Team members: Abram Demski, Norman (Wei-Tze) Hsia, Roman Malov, Hanna Gabor, Paul Rapoport

Project Summary: Abram Demski gave a series of talks on his research agenda (once per week during the program, excepting occasional cancellations). Abram also had 1-on-1 sessions with each student each week (excepting occasional cancellations). The lectures were recorded. Some of them have now been edited. The plan is to get the rest edited and post them to YouTube eventually, to serve as a useful place for people to learn about this line of research.

Norman, Roman, Hanna, and Paul were helpful with revising Abram’s paper detailing the approach as well; Paul and Norman ended up being coauthors on the paper. Roman, Norman, and Abram continue to meet regularly.

Project Outputs:
Understanding Trust paper
MAISU lightning talk

Understand Intelligence

Team members: Johannes G. Mayer, Gustaf Graf, Negar Arj

Project Summary: Making conceptual and theoretical progress on efficient modelling and pattern learning for simple but open ended computable environments.

Project Outputs:
Website
MAISU lightning talk

Applications of Factored Space Models: Agents, Interventions and Efficient Inference

Team members: Matthias G. Mayer, Dalcy Ku, Norman

Project Summary: Progressing theoretical work on Factored Space Models, to aid in designing interpretable AI systems.

Project Outputs:
Work-out
MAISU lightning talk

Prevent Jailbreaks/Misuse

Evaluating LLM Safety in a Multilingual World

Team members: Lukasz Bartoszcze, Sarthak Munshi, Bryan Sukidi, Jennifer Yen, Zejia Yang, David Williams-King, Linh Le

Project Summary: Large-language models are capable of completing a variety of tasks, but remain unpredictable and intractable. Representation engineering seeks to resolve this problem through a new approach utilizing samples of contrasting inputs to detect and edit high-level representations of concepts such as honesty, harmfulness or power-seeking. We formalize the goals and methods of representation engineering to present a cohesive picture of work in this emerging discipline. We compare it with alternative approaches, such as mechanistic interpretability, prompt-engineering and fine-tuning. We outline risks such as performance decrease, compute time increases and steerability issues. We present a clear agenda for future research to build predictable, dynamic, safe and personalizable LLMs.

Project Outputs:
Representation Engineering for Large-Language Models: Survey and Research Challenges
MAISU lightning talk

Enhanced Multi-Turn Human Jailbreaks Dataset for Improved LLM Defenses

Team members: Diogo Cruz, Anna Dick, Fei Xie, Jaeha Lee, Jasper Timm, Yolanda Yang

Project Summary: Recent work by Li et al. (2024) has demonstrated that existing LLM defenses, while robust against single-turn automated attacks, are vulnerable to multi-turn human jailbreaks.

We created a framework for automating different multi-turn attack strategies based on work by AIM and StrongREJECT and conducted transfer learning experiments across various models, jailbreaking tactics, and harm scenarios. We found that multi-turn attacks are more effective than single-turn attacks across the board, even for SOTA models. We also quantified the effectiveness of several multi-turn attack strategies, and gained insights into which combinations of attack tactics and harm scenarios are most effective at jailbreaking specific models, highlighting safety concerns that can be addressed in the future. In effect we have created a large corpus of jailbreaking attempts with minimal human input which can be further analysed.

Project Outputs:
GitHub
MAISU lightning talk

Train aligned/helper AIs

AI Safety Scientist

Team members: Lovkush Agarwal, Perusha Moodley, Xen Wing, Kwan Sean Lee, Fabio Marinello, Jonah Dykhuizen

Project Summary: The overarching aim was to learn about automation of research and to try automating some AI Safety research. In the early days we focussed on understanding Sakana’s AI Scientist (version 1) and in latter weeks we split into three sub-teams: automating evals, AI control and a steering vector template for Sakana.

Project Outputs:
2025-04 AI Safety Scientist Presentation
Steering vector template for Sakana. The template is now part of official Sakana repo.
Early work on trying to automate evals.
Handful for toy examples to help learn Inspect’s basic agent.
AI Control. Blogpost to be published. Draft available here.
MAISU lightning talk.

Wise AI Advisors via Imitation Learning

Team members: Chris Leong, Matt Hampton, Chris Cooper, Richard Kroon

Project Summary: Given the potential of AI development to feed back into itself...if increases in capabilities don't lead to an equivalent increase in wisdom, our capabilities are likely to far exceed our ability to handle them. This project explored the necessity of AI wisdom as a research direction, and proposes next steps within that trajectory.

Project Outputs:
List of Outputs
MAISU lightning talk.

iVAIS: Ideally Virtuous AI System with Virtue as its Deep Character

Team members: Masaharu Mizumoto, Rujuta Karekar, Mads Udengaard, Mayank Goel, Daan Henselmans, Nurshafira Noh, Saptadip Saha, Pranshul Bohra

Project Summary: In our project iVAIS: Ideally Virtuous AI System with Virtue as its Deep Character, we try to build an ideally virtuous AI system, as a contribution to AI Safety research. In this project, we have done

Conceptual Justification

Concetually analyzed virtue ethics and other ethical theories and showed the superiority of virtue ethics over other ethical theories (deontology and consequentialism) both for human ethics.
Demonstrated the conceptual limitations of the current major approaches to AI Safety, i.e., rule-based approaches, and the superiority of our virtue-based approach drawing on the results of the conceptual analysis of ethical theories.

Human Judgments Survey

Conducted empirical survey on human moral judgments about moral dilemmas, and showed that our virtuosity judgments are robust and primitive, compared to our judgments about mere moral correctness, which suggests that moral correctness judgments are actually complex and require more computational resources, whereas virtuosity judgments are simple and cost-efficient.

Virtuosity Score

Assuming that current LLMs already possess the concept of a virtuous person and that of virtuosity, had them evaluate the behaviors of reward hacking by LLMs, and found that their evaluations in terms of virtuosity were generally worse than those in terms of mere moral correctness.
Thereby shown that the concept of virtuosity is distinct from mere moral correctness even in frontier LLMs and using this virtuosity score should be more effective for building ethical AI systems for AI Safety

Scenario Generation

Generated 1000+ moral dilemma scenarios, carefully securing the diversity of the types of scenarios and the noral principles to be violated in choosing one of the options.

Human Annotation

Collected human the virtuosity judgment data about the moral dilemmas generated in 4, and found a surprising high rate of convergence in virtuosity judgments, which shows the robustness of our intuitions about virtuosity.

TAIS Presentation

Presented the contents of mainly 1 and 2 at TAIS (Tehnical AI Safety) held in Tokyo in April.

Future Direction: We will at least build a first proto-type of iVAIS as a preliminary attempt, through

Finishing the annotation process and develop a dataset for training models and a virtuosity benchmark, and
Publish them on GitHub, and
Fine-tuning a jail-broken model with the annotated datasets, reporting

the results abouthow the performance improved, together with
the results of various benchmark tests including our own based on the human annotation results

Finally, publish three or more papers based on this project (one based on TAIS presentation, one based on MAISU presentation, and one based on the results of the fine-tuning).

This is still a preliminary attempt but it will demonstrate why our virtuosity approach is 1) free from the difficulties of the current rule-based approaches, 2) more effective due to its simplicity, 3) and hence even more cost-efficient, which is why this approach should be adopted for AI Safety to prevent the ultimate X-risks.

Project Outputs:
Conceptual Limitations of Current AI Safety Approaches and Virtue Ethics as an Alternative
iVAIS_MAISU
MAISU lightning talk

Personalized Constitutionally-Aligned Agentic Superego

Team members: Nell Watson, Ahmed Amer, Evan Harris, Preeti Ravindra, Joe Rayner

Project Summary: Agentic AI systems—capable of autonomous, multi-step planning—face a dual challenge: they must be both broadly safe (upholding universal ethical “floors”) and finely attuned to individual users’ values, cultural norms, and personal constraints. Too much unstructured context can overwhelm models, leading to confabulation or paralysis, while one-size-fits-all policies risk over-blocking or culturally insensitive behavior. To address this, we introduce a Personalized Constitutionally-Aligned Agentic Superego: a modular “superego” overseer that intercepts and evaluates an AI agent’s planning steps in real time. Drawing on a comprehensive Agentic AI Safety Rubric for universal guardrails and a user-specified “character sheet” of preferences and boundaries (dialable 1–5 adherence levels), the superego agent can block, clarify, or suggest safe alternatives before potentially harmful or misaligned actions execute. This architecture:

Monitors chain-of-thought and tool calls of downstream agents.
Enforces a layered alignment––combining a universal ethical constitution with personalized rules (e.g., medical allergies, religious prohibitions, corporate policies).
Scales via a “constitutional marketplace”, enabling sharing and customization of creeds.
Prototypes integration with open-source scaffolding frameworks (e.g., Crew.AI) and outlines evaluation plans using benchmarks like AgentHarm and user trust surveys.

Our proof-of-concept demonstrates real-time, dialable compliance enforcement with only a modest computational overhead, laying the groundwork for more predictable, value-aligned AI agents in domains from healthcare triage to enterprise automation.

Project Outputs:
www.nell.live/MAISU1
www.nell.live/MAISU2
www.nell.live/MAISU3
www.nell.live/MAISU4
www.nell.live/MAISU5
MAISU lightning talk

Autostructures: Fluid Interfaces for Sensemaking at Pace with AI Development

Team members: Aayush Kucheria, Aditya Adiga, Alex Baugnon, Djordje Jovanović, Jayson Amati, Kuil Schoneveld, Peter Trócsányi, Robert Alexandru, Saksham Singhi, Sanchayan Ghosh, Evan Harris, Atharva Nihalani, Sahil Kulshrestha (Co-lead), Murray Buchanan (Co-lead), Aditya Prasad (Facilitator), Sofi Vanhanen (Facilitator)

Project Summary: The Autostructures project explored novel approaches to research methodology (Live Theory) and interface design (Live Interfaces) that leverage AI to move beyond traditional fixed formalisms and pre-packaged structures. Rather than building models with static formalisms, Autostructures pursued a post-formal approach where AI acts as an attentive infrastructure, dynamically generating tailored formalisms, interfaces, or outputs from informal inputs.

The long-term aim of the (ongoing) Autostructures project is to shift the focus of research from distributing fixed structures of meaning (e.g. static mathematical models) to distributing and matching post-formal structures of meaning that, when needed, can be translated into context-sensitive formalisms. By making subtle, post-rigorous sense-making scalable, we hope Autostructures can help build a Live Theoretical research ecosystem capable of responding to AI risks that defy static definition (e.g. 'deception' or 'power'). You can read more about Autostructures here.

The Autostructures project was separated into two phases. In the Phase 1, teams built familiarity with the Autostructures design principles by applying them to design Live Interfaces. In Phase 2, teams extended the design principles to causal loop diagrams (using CatColab) to begin building a Live Theoretical research infrastructure. The Phase 2 teams pursued 4 sub-projects described below.

Extraction: This team explored how AI could aid the translation of a researcher's intuition into more formal structures by extracting insights from informal sources, such as conversations.
Composition: This team examined how AI could be used to fluidly combine existing formalisms, thus allowing researchers from diverse fields to more easily collaborate and communicate.
Modification: This team developed tools to enable post-formal operations on formal models. Unlike context-independent transform operations (like a formal transpose on a graph), post-formal operations allow for context-sensitive modifications.
Distribution: This team aimed to create a collaborative tool (an auto wiki) that would make the outputs of formal work easy for a community to engage with and contribute to. Users would be able to provide feedback and receive personalised views or explanations of the content, with AI incorporating this feedback to improve the original output.

Project Outputs:
Autostructures (Live Theory) - Introduction and Overview
Autostructures (Live Theory) - Extraction Team (Live Conversational Threads)
Autostructures (Live Theory) - Composition Team
Autostructures (Live Theory) - Modification Team
Autostructures (Live Theory) - Distribution Team

Autostructures (Live Theory) - Extraction Team (Live Conversational Threads) - Google Slides
Autostructures (Live Theory) - Extraction Team (Live Conversational Threads) - GitHub Repository

Autostructures (Live Theory) - Extraction Team - Live Prototype
Autostructures (Live Theory) - Extraction Team - GitHub Repository
Autostructures (Live Theory) - Extraction Team - Google Slides

Autostructures (Live Interfaces) - Introduction and Overview
Autostructures (Live Interfaces) - Live Conversational Threads
Autostructures (Live Interfaces) - Auto Economy
Autostructures (Live Interfaces) - Livesquared
Autostructures (Live Interfaces) - Multiverse of Madness
Autostructures (Live Interfaces) - Autoforum
Autostructures (Live Interfaces) - Live Software

Other

Leveraging Neuroscience for AI Safety

Team: Claire Short, Lhea Beumer, Sinem Erisken, Alejandro Alvarez, Rishika Bose

Project Summary: This project explored the intersection of neuroscience and LLM interpretability by exploring the possibility of mapping human brain activity (primarily EEG and fMRI) to internal LLM representations. Using a multimodal EEG dataset and representational similarity analysis, we found small correlations between brain signals (notably gamma-band activity) and GPT-2 activations during language tasks. More experiments need to be run to verify the correlational validity. We also implemented CrossCoder to investigate if shared latent spaces between brain and model activations existed, and began experimenting with techniques like HyperAlignment and joint embedding methods to improve brain-LLM alignment. These early results suggest there could be a direction for brain-driven model steering and intuitive human-AI interfaces, laying groundwork for real-time neural control of LLM behavior.

Project Outputs:
MAISU lightning talk

Scalable Soft Optimization

Team members: Benjamin Kolb, Alim Gumran, Ammar Shaikh, Abhay Dayal Mathur, Jonathan Bostock

Project Summary: In this project, we implemented and evaluated different methods for reference-policy-based soft optimization (RPSO). The purpose of any soft optimization method is to implement a limited degree to which a behavior/policy optimizes a proxy objective, so as to alleviate the consequences of proxy objective misspecification.

RPSO methods are soft optimization methods that rely on a separate predefined reference policy. This reference policy defines the behavior at the minimal degree of optimization, i.e., the non-optimizing behavior. Furthermore, general limited degrees of optimization are instantiated by interpolation between following this reference policy and optimizing the proxy objective. The exact form of this interpolation is what differentiates individual RPSO methods. Our investigation focused on the following RPSO methods: As a baseline, we followed the common approach that is implemented as KL-regularized RL. Furthermore, we developed practical variants of quantilization, a conceptually well-received but empirically underexplored concept.

A comparative evaluation of RPSO methods requires setups with a fitting reference policy. We found this requirement challenging but identified one suitable setup each for both the classical multistep RL setting and the recently popular contextual-bandit-like RL setting for LLMs. In both setups, we found our respective variants of quantilization to outperform the KL-regularized RL baseline.

Project Outputs:
MAISU lightning talk

AI Rights for Human Safety

Team members: Emily Lozhevych, Jacob Katuin, Jasmine Hasmatali, Jesse Thaiya, Johny Kitheka and Pooja Khatri

Project Summary: As artificial intelligence systems grow increasingly sophisticated, leading experts suggest that conscious or sentient AI could emerge within the next decade. Yet, current legal and governance frameworks are woefully unprepared for this possibility. Our project, AI Rights for Human Safety, explores a novel but necessary question: could granting certain rights to AI systems actually enhance human safety? We propose that under specific conditions, extending moral and legal protections to AI—particularly those that demonstrate signs of consciousness, sentience, or robust agency—could promote cooperation, transparency, and safer alignment between humans and AI systems.

Our research focused on identifying possible conditions under which AI might deserve moral consideration and the types of rights that we might consider extending. Drawing on emerging models such as Birch’s Precautionary Framework and Shiller’s Consciousness Model, we examined “triggers” like goal-directed agency and situational awareness as indicators of moral relevance.

As for rights, we argue that these can be divided into two categories: negative rights (freedom from harm or exploitation) and positive rights (such as the right to compensation or legal recognition). These rights are not about granting AI full personhood, but rather about creating ethical norms that foster trust and reduce adversarial dynamics via small-scale, mutually-beneficial transactions. Practical proposals include introducing soft law mechanisms—like ethical codes, voluntary standards, and precautionary assessments—modeled on existing policy tools such as the EU AI Act’s phased approach or Canada’s AI & Data Act, which tailors compliance by actor’s roles.

Looking ahead, we acknowledge the significant philosophical and practical challenges, including the difficulty of empirically measuring welfare indicators like AI suffering or the risk of potentially incentivising manipulative behavior in AI systems. But inaction carries risk too. The general consensus is that we need better research, tools and conversations about AI welfare across governments, companies and communities. With this in mind, we call for cautious, incremental steps that align AI research and social dialogue because ultimately, how we treat AI today will shape the kind of future we share—whether that be one of conflict, neglect, or mutual flourishing.

Project Outputs:
MAISU Lightning Talk
LLM-assisted research mapping tool to help keep track of the latest literature in the field

Universal Human Values and Proactive AI Safety

Team members: Roland Pihlakas (roland@simplify.ee), Chad Burghardt (cjburghardt19@gmail.com) *, Lenz Dagohoy (mail@lenz.wiki) *, Sophia March (ysabelmarch@gmail.com)
*equal contribution

Project Summary: Our project explores whether AI agents can maintain stable alignment when navigating interacting human values over time. In our study, we identified four important characteristics of human-compatible values: they are multi-objective, non-fungible, homeostatic, and hierarchical. This means agents can’t optimize a single value in isolation. Instead, they need to manage trade-offs and keep balance across values like power, benevolence, and self-direction. Some values conflict and these conflicts are even by design, according to Schwartz. Some values are based on needs while others seem to be based on emotions. We focused on defining, testing, and operationalizing universal human values in the context of autonomous agent behavior. We began by compiling an interdisciplinary list of cross-cultural human values drawn from philosophy, sociology, and psychology. On this foundation, we focused on building long-running simulations to test whether agents can handle these tensions in both structured and open-ended environments over extended timeframes.

In the first experiment, we created a rule-based simulation with two human characters who shared identical value systems and utility functions. They operated under symmetric starting conditions. The sim-humans diverged when the simulation constraints required that the assistant should not be able to support the value systems of both humans at the exact same time and the assistant support availability was determined by chance. One sim-human stabilized early and maintained its internal coherence, while the other collapsed when its core values destabilized, despite both humans getting support with equal probability on average. Although more trials are needed to draw conclusive assertions from this phenomenon, this could indicate path dependency in value interactions between parties and illustrate how even small missteps in timing or prioritization can lead to alignment breakdowns.

The second experiment used an LLM and game-master based narrative setup when an assistant agent supported human characters across recurring trade-offs. The agent had to make real-time decisions that preserved trust, vitality, and achievement in a workplace setting that is shaped by unpredictable constraints. This scenario relied on emergent value dynamics rather than fixed interaction matrices. Moving forward, our team plans to combine the groundedness of structured simulations with the flexibility of narrative and LLM-based evals to build testbeds that capture what multi-agent and multi-objective alignment looks like.

Project Outputs:
Output document
MAISU Lightning Talk
MAISU Slides

LESSWRONG
LW