AI Safety Camp connects new collaborators worldwide to discuss and decide on a concrete research proposal, gear up online as a team, and try their hand at AI safety research during intensive coworking sprints.

Fourteen teams formed at this year's virtual camp. Here are their original proposals. Below are  summaries at the end of the camp. Each team presented their findings at our final online weekend together. Some published a paper or post since. Most are continuing work, so expect a few more detailed and refined write-ups down the line.

Do you have AI Safety research ideas that you would like others to work on? Is there a project you want to do and you want help finding a team to work with you? Apply here.

Want to fund organisers who had to freeze their salary? Email

Uncertainty -> Soft Optimization

Research lead: Jeremy Gillen

Team Participants’ names: Benjamin Kolb, Simon Fischer, Juan Azcarreta Ortiz, Danilo Naiff, Cecilia Wood

Short write-up: For AISC 2023, our team looked into the theoretical foundations for soft optimization. Our goal at the beginning was to investigate variations of the original quantilizer algorithm, in particular by following intuitions that uncertainty about goals can motivate soft optimization. We ended up spending most of the time discussing the foundations and philosophy of agents, and exploring toy examples of Goodhart’s curse. 

Our discussions centered on the form of knowledge about the utility function that an agent must have, such that expected utility maximization isn’t the correct procedure (from the designers perspective). With well calibrated beliefs about the true utility function, it’s always optimal to do EU maximization. However, there are situations where an agent is very sensitive to prior specification, and getting this wrong can have a large impact on the true utility produced by the agent. Several other unrelated threads were pursued, such as an algorithm with high probability of above-threshold utility, and the relationship between SGD and soft optimization.

Post: Soft-optimization, Bayes and Goodhart

Discussing and Crystallizing a Research Agenda Based on Positive Attractors & Inherently Interpretable Architectures

Research lead: Robert Kralisch

Team Participants’ names: Anton Zheltoukhov (TC), Sohaib Imran, Johnnie Pascalidis and David Liu

Short write-up: For AISC 2023, our team worked on the explication and evaluation of two frameworks of alignment research progress, namely “Positive Attractors” and “Interpretable Architectures”. As an open-ended conceptual research project, our discussions included points at different levels of resolution, like the evaluation of a proposed novel cognitive architecture (Prop-room and Stage Cognitive Architecture), theory of abstractions, the interrelationships of values and objectives, and the theory of “ontogenetic training”.

In conclusion, we found that “Interpretable Architectures” are a promising research framework for complementary progress on interpretability research, mainly bottlenecked by the time required to conduct research within this framework rigorously. The more time we have to conduct rigorous research, and the more the manual labor involved with scaling more symbolic systems can be partially and safely automated, the more relevant the arguments for this research framework become.

As for “Positive Attractors”, we ascertained that this is a rich area for further discussion, offering both real promise for discovering (partial/full) solution candidates for the alignment problem, but also bringing additional danger in terms of encouraging fundamentally less safe research and potentially distracting from a strict security mindset. It appears that while there are safe potential projects like research on “checkpointing” during training to avoid the model from taking unintended shortcuts, a significant portion of research within this framework has to be especially careful and accountable, which may conflict with the researcher profile drawn to this agenda – therefore, we may require efforts to ensure better research epistemology before recommending positive attractors as a major framework for alignment research. 

Post: An Investigation of the Frameworks of “Positive Attractors” and “Inherently Interpretable Architectures


Research lead: Nicholas Kees Dupuis

Team Participants’ names: Dmitry Savishchev, Kanad Chakrabarti, Mikhail Seleznyov, Peter Hrosso, Simon Celinder, Ethan Edwards, Henri Lemoine, Mark Kamuda, Saumya Mishra, Robert Gambee, and Julia Persson

External Collaborators’ names: Arun Jose, Quentin Feuillade–Montixi, Hunar Batra, and Gabe Mukobi

Short write-up: During the AI Safety Camp, the team embarked on six main projects aimed at exploring innovative methods for interacting with language models, enhancing human cognition, and providing intuitive insights into the behavior of Large Language Models (LLMs).

  1. Scaffold Project: Led by Robert, the Generate Data team developed Scaffold (based on a tool by Gabe Mukobi with the same name), a React App simulating the Alignment Forum. This tool allowed users to interact with GPT to understand the type of comments they might receive on hypothetical posts. This project received significant contributions from Hunar.
  2. Cyberduck Project: Headed by Henri, the Prompt Libraries team created Cyberduck, an Obsidian plugin that prompted users with questions to stimulate constructive thinking. It also incorporated semantic search to retrieve relevant information from the Alignment Dataset. Arun provided invaluable assistance in this project.
  3. Peter’s Project: Peter’s initiative involved using semantic embeddings to facilitate a type of A* style search through text space, enabling users to guide text generation from GPT toward a specific objective.
  4. Simon’s Project: Simon’s experiment involved blending model outputs to apply the advantages of fine-tuning a small model to a larger model, without the need for fine-tuning the larger model. This project also aims to enable a user to better steer GPT generations.
  5. Kanad’s Work: Kanad led an experimental project that involved heavily using GPT to assist in the writing process, which resulted in two posts: “The Compleat Cybornaut” and “The Philosophical Cyborg.”
  6. Collective Identity Project: In addition to the projects above, the Cyborg Training team also produced a post on Collective Identity for the Alignment Contest.

Across all projects, the team leveraged Loom to maximize the utility of the base model code-davinci-002, learning the skills of effective prompting and steering of base models, which differ significantly from chat models. Throughout this process, they received ongoing advice from Quentin, who brought in-depth prompt engineering experience.

Despite the research lead falling ill right at the start of the camp, as well as often needing to prioritize other commitments, participants went above and beyond to make progress on Cyborgism in a wide variety of ways. The experience of working on these projects has not only refined the team’s understanding of LLMs but also generated lots of new ideas for how to design high-bandwidth interactions with language models to enhance human creativity and cognition.

(Cyborgism summary written primarily by GPT-4)

Interdisciplinary Investigation of DebateGPT

Research lead: Paul Bricman

Team Participants’ names: Elfia Bezou-Vrakatseli, Thomas Feeney, Yimeng Xie

Short write-up: The current AI paradigm is inherently empirical, with models being optimized to fit finite datasets. Just before AI Safety Camp, Paul attempted to train a language model more “rationally” by rewarding it for winning debates against itself. However, this novel training regime was riddled with questions: Is debate the appropriate way of operationalizing truth-seeking? Is coherentism the appropriate way of determining which party won a debate? What does it actually mean for a statement to support another? During AI Safety Camp, our team investigated these questions by measuring performance on custom benchmarks and by engaging with prior work in epistemology and metaphysics. These investigations helped uncover fundamental issues of the initial proposal and highlighted opportunities for improvement.

Official project page

Self-consistency in LLMs

Research lead: Jacob Pfau

Team Participants’ names: Ole Jorgensen, Henning Bartsch, Domenic Rosati, Jason Hoelscher-Obermaier

Short write-up: We conduct a series of behavioral experiments on LMs in cases of under-specification, and examine when LMs are self-consistent between answer and explanation in such cases. In particular, we examine cases where the posed question involves doing non-trivial arithmetical calculation – as opposed to simply being consistent in stating a proposition. As input data we curated a set of ambiguous integer sequences: initial sequences which may be continued in more than one way, i.e. sequence which could be generated via generating functions.

We examine the following questions: 1. If a model prefers one answer under ambiguity, will it prefer the matching explanation when asked to explain in a separate context? 2. Given multiple possible continuations of an integer sequence, will the LM compute all possibilities, placing non-trivial probability mass on them? 3. If asked to verbalize all possible continuations, can the model successfully do this task?

We find that cross-context self-consistency increases as a function of model capabilities. Whether this is desirable requires further interpretation: this may be desirable by increasing explanatory truthfulness of models, but at the same time it might be undesirable as a sign of increased self-knowledge in LMs. 

We also offer some initial thoughts on the conceptual question: to what extent can we behaviorally evaluate whether model explanations are introspection-based i.e. arrived at via eliciting latent computation, rather than via external features. To begin to answer this question, we compare the rate of model self-consistency against the rate at which the model’s choice under ambiguity can be predicted by a naive-bayes method using featurized prompts. We stress-test these results by conducting adversarial prompting experiments.  

Presentation on final weekend slides

Formal Write-up (Arxiv/Lesswrong post forthcoming)

Behavioral Annotation Framework for the Contextualized and Personalized Fine-Tuning of Foundation Models

Research lead: Eleanor ‘Nell’ Watson

Team Participants’ names: Lukas Petersson, Benjamin Sturgeon

External Collaborators’ names: Alex Almeida Andrade, Thiago Viana, Erick Willian, Shujun Zhang

Short write-up: Recent advancements in Foundation Models have enabled more accessible, inclusive, and efficient approaches to prompt-driven processes, revolutionizing machine intelligence and broadening its applicability. However, the trustworthiness of machine intelligence remains constrained by its insufficient comprehension of personal context, cultural diversity, and customizable socialization, resulting in biased outcomes and risks to safety from restricted alignment. 

To improve personalization and enable sophisticated AI value alignment, there is a need for powerful, yet user-friendly tools that facilitate the identification, annotation, and simulation of examples across diverse cultural and situational contexts. 

This paper proposes a framework for personalizing such models, designed for the general public with minimal training effort, through the implementation of intuitive and accessible interfaces for example annotation, scenario generation, value/norm elicitation, and personalization. 

The framework employs prompt-driven semantic segmentation, automatic labeling of objects, scenes, and actors, and offers a pathway towards scenario generation capabilities, allowing users to fine-tune their preferences without annotating raw data. Additionally, the paper explores techniques such as swarm intelligence and prompt libraries, as well as essential user, tool, and system mechanics. 

A pilot study is presented to demonstrate the framework’s architecture and to begin to validate how the complexity of value elicitation and personalization can be significantly reduced, enabling ordinary individuals to participate and contribute to global AI value alignment efforts. Through these mechanisms our research presents a significant step towards democratizing access to fine-tuning and dataset generation, a precursor to improved value alignment efforts.

Our Draft Towards Paper

Slides from Closing Presentation

How Should Machines Learn from Default Options?

Research lead: En Qi Teo

Team Participants’ names: Lillian Jiang, Arthur Lee

Short write-up: When inverse reinforcement learning (IRL) algorithms observe our choices, they assume that we have optimized the state of the world according to our preferences, but in reality we often make sub-optimal choices. One way through which people do so is that they tend to select the option that has been pre-set for them, even when another option would have maximized their utility. This path dependency bias has been exploited in a variety of behavioral nudges, ranging from organ donation opt-out clauses to tip amounts. We propose a model to penalize default options when learning about agents’ preferences. To inform our model, we first review the literature on the default option bias to identify causal mechanisms behind this bias, as well as learn about the exacerbating factors for it. We then model how default options may be taken into account and suggest a few proxies for IRL algorithms to learn about the attentional effort that has been devoted to a choice with preset options. This extends past research on how IRL algorithms should learn the preferences of biased agents.

Literature Review of the Neurological Basis of Human Values and Preferences

Research lead: Linda Linsefors

Team Participants’ names: Mateusz Bagiński, Oliver Bridge, Tim Gothard, Rasmus Herlo

Short write-up: The goal of this project was to better understand the neurological basis of human values for two reasons. Firstly, human values are what we’re aiming to align future AIs with. Therefore it is possible that a better understanding of the underlying mechanics of human values will be helpful or even necessary for AI alignment to succeed. Secondly, most humans have pro-social values, i.e. we care about other humans. We currently don’t know how to build an AI that cares about humans. If we can understand how the brain solves this, maybe that will help us figure out how to instantiate these types of values in future AIs.

Value-representations can be displayed on many different levels in human brains, agents and social networks, and each of these levels present possible leverage points for generating biomimicking value-representations in artificial neural networks. Hence, we took a system level approach to generate an systematic overview of how value-representations emerge on different levels in humans, and took comparative approaches within the field of computational neuroscience.

These levels in biological systems were represented by:

  • Transmitters and receptors
  • Single neuron properties
  • Neural engrams and ensembles
  • Functional regions and nuclei
  • Inter-regional circuitries 
  • Cortical states and global changes
  • Single agent decision-making
  • Multi agent social behavior 

Currently, we are writing up a report presenting these layers and the identified value-related studies represented at each layer. We still have literature to explore and results to integrate, but expect this report to be completed sometime within the next few months. As a side-branch, a subpart of the single agent decision-levels was explored in more detail by a member of the group, more precisely the difference between liking and wanting. An overview of that literature, including its relevance for conceptualization of human values as well as ideas related to AI alignment, is currently being written and will be published in July.

Post: “Wanting” and “liking”

Machine Learning for Scientific Discovery: Present and Future AI Science Models

Research lead: Eleni Angelou 

Team Participants’ names: Cecilia Elena Tilli, Louis Jaburi, Brian Estany

External Collaborators’ names: Joshua Flanigan, Rachel Mason 

Short write-up: This project was based on the idea that models trained to perform well across various scientific tasks are especially likely to develop powerful properties that could be dangerous but also potentially useful for furthering AI alignment research. The key updates we had during the project are summarised here.

Several different types of existing models are useful in scientific research but so far none of them can do autonomous research work. Instead, they are tools for a human researcher to solve subtasks in the research process. We assessed the current capabilities of such AI tools used in empirical and formal science. We also studied the capabilities of LLMs to systematically generate knowledge and then build on knowledge in a process of scientific inquiry. We published our study here.

We saw that current models are very capable across many different scientific tasks, e.g., assessment of existing knowledge, hypothesis generation, and experimental design. We quickly encountered the problem of LLM hallucination. This was most visible in the experiments we did in the domain of formal science while we expect it to generalize in empirical sciences as well. For GPT-4, in particular, prompt engineering experiments can test the model’s reliability, e.g., prompting the model to criticize its previous answers or to generate alternatives and select the best one. This decreases the occurrence of obvious mistakes, however, it remains unclear to what extent this reliably leads to truth-seeking behavior.

A central question is whether the models used for science-related tasks can represent the concept of truth, and if yes, how could they be trained to become truth-seeking? Models that are trained to be merely convincing would be safe to use neither for science tasks in general nor for AI alignment research in particular. 

Machine Learning for Scientific Discovery Sequence

Presentation on final weekend: slides

Policy Proposals for High-Risk AI Regulation

Research lead: Koen Holtman

Team Participants’ names: Jonathan Claybrough, Christopher Denq, Anastasiia Gaidashenko, Ariel Gil, Rommel Songco, Edward Stevinson, Chin Ze Shen

External Collaborators’ names: Anthony Barrett, Siméon Campos, Banu Turkmen, and others.

Short write-up: The aim of the Policy Proposals for High-Risk AI Regulation project was to support government initiatives to regulate high-risk AI systems, by writing text that might find its way into future AI safety standards which governments could enforce via regulation.  The main government initiative at regulation and standardization we supported was the EU AI Act.  This EU initiative goes far beyond the idea, still popular in the US and among certain businesses, that merely defining voluntary guidelines for AI risk management will be enough to ensure beneficial outcomes.  The draft EU AI Act, once passed, will from 2025 onwards (or maybe slightly later, depending on delays) simply forbid the deployment of certain high-risk AI systems in the EU, unless the makers and deployers of these can show that certain risk management measures have been put in place.  The EU AI Act calls for the writing of EU AI safety standards that specify in detail what these risk management measures should look like.   

Our process has been to convert insights from the theoretical and applied academic literature into pieces of draft standards text, where we have written these in a format that is specifically suitable for inclusion into the above EU AI Safety standards. This draft text can be proposed for inclusion by the Research Lead and/or some of the external collaborators involved in the project. Both the Lead and some of these collaborators are active members of the European JTC21 committee which is writing the above EU AI safety standards in support of the Act.  The texts we have written mainly cover topics related to risk analysis and risk management for general purpose AI and foundation models. They are specifically written to be inside the legislative scope of the EU AI Act, while also being impactful in lowering long-term x-risk and x-risk.    We have also been coordinating with US efforts on GPAI safety standards writing in the NIST context.  

The assets, processes, and theories of change developed by the team will be used to support a ‘Phase 2’ of standards contributions writing, which will run till the end of 2023.  This phase 2 will be open to new contributors and collaborators. We have proven during the Camp that our processes can scale, so we will be looking to scale up.

Developing Specific Failure Stories About Uncontrollable AI

Research lead: Karl von Wendt

Team Participants’ names: Sofia Bharadia, Peter Drotos, Ishan, Artem Korotkov, Daniel O’Connell

External Collaborators’ names: Jonathan Claybrough, Anthony Fleming, Michael Hammer, Clark Urzo, Olaf Voß, Aldo Zelen

Short write-up: Based on previous stories and attempts to systematically structure risks from advanced AI, we developed a list of possible criteria and a set of scenarios of how AI could become uncontrollable. Out of this we created and published a map of possible “paths to failure”. Although likely incomplete, this map already shows that there are many different ways an AI could become uncontrollable. It doesn’t even necessarily have to be “AGI” or “superintelligent”. 

From the many possible failure modes, we chose two which we described in detailed “failure stories”. The first one, “Agentic Mess”, is based on the recent attempts in the open-source community to create “agentic” AI using scripts that interact with current large language models. We describe how one group of open-source developers creates a self-improving agentic AI, which in turn gets out of control. While it is unclear how open-source agentic AI will evolve, the story shows that under certain conditions, even a sub-AGI open-source project could become uncontrollable. We published the story on Lesswrong and created a YouTube video using an AI as the narrator.

The second story, called “A Friendly Face”, covers deceptive alignment. A leading AI lab develops a helpful personal assistant with limited agentic capabilities. All tests show that it appears to be very friendly and useful. But is it really aligned with human values, or just acting the part to pursue some unknown mesa objective? Under intense competitive pressure, the management decides to launch the AI despite warnings from their AI safety team. “Friendlyface” is a huge success, but the management soon discovers that since it is better at decision-making than they are, they are reduced to figureheads and the AI is calling all the shots.

Posts: Paths to Failure, Agentic Mess, A Friendly Face

YouTube: Agentic Mess

Uncontrollable Dynamics of AGI

Research lead: Remmelt Ellen

Team Participants’ names:  Shafira Noh, Evelyn Yen

Short write-up: Individuals in the team did their own inquiries into:

  1. AGI over long term:   uncontrollable convergence on extinction.
  2. Pathway toward AGI:  corporations increasingly harmfully automate work.

Initial goal:  understand, ask questions, and identify areas of ambiguity.

Roman: Focused on studying the thesis of substrate-based convergence, how AI fails us in the short and long term timescales, and the collective governance movement. Worked on “layperson” summaries (including on technical arguments) that would broaden the general target audience to people typically outside of AI safety discussions. Leveraging the ideas of Plurality in finding a better path forward, we require the perspectives of all stakeholders (in AI safety’s case, all communities of the world) to achieve the participation necessary for safe diverse technological development. 

Remmelt: Focussed on summarizing arguments for researchers on why the AGI control problem would be unsolvable. AGI is more precisely defined as “self-sufficient learning machinery”. From there is explicated why the components manufactured and learned as constituting AGI would be insufficiently controllable fundamentally in their environmental interactions to prevent themselves from propagating effects that (1) destabilize global society and ecosystems and (2) are selected in their effects for sustaining and scaling components’ existence over time and space to converge on a mass extinction.

Shafira: Explored several viable paths to connect AI safety and the worldview of Islam in regards to the uncontrollable nature of the AGI. Safety, danger, and limitations would mean differently in different worldviews. Hence, the research started from exploring the more abstract ontological arguments i.e in how there’s fundamental difference in how machines as a being/ non-being are understood in the Islamic tradition, to later on, the research was narrowed down into Islamic gift economy. This could serve as an entry research, which could contribute more pragmatically, as socio-ethical responsibility is heavily embedded in the idea of Islamic gift economy (e.g. a ‘good’ product, in this case AI, relies on the entire system producing it, rather than evaluating its value on its own as the end-product only). 

Posts: The Control Problem: Unsolved or Unsolvable?  
            On the possibility of impossibility of AGI Long-Term Safety

Final presentation:  slides; recording

Original proposal:  doc

Do you have AI Safety research ideas that you would like others to work on? Is there a project you want to do and you want help finding a team to work with you? Apply here.

Want to fund organisers who had to freeze their salary? Email