Jonathan Claybrough
Actually no, I think the project lead here is jonachro@gmail.com which I guess sounds a bit like me, but isn't me ^^
Apply to AI Safety Camp 2024 by 1st December 2023. All mistakes here are my own.
Below are some summaries for each project proposal, listed in order of how they appear on the website. These are edited by me, and most have not yet been reviewed by the project leads. I think having a list like this makes it easier for people to navigate all the different projects, and the original post/website did not have one, so I made this.
If a project catches your interest, click on the title to read more about it.
Note that the summarisation here is lossy. The desired skills as here may be misrepresented, and if you are interested, you should check the original project for more details. In particular, many of the "desired skills" are often listed such that having only a few would be helpful, but this isn't consistent.
Project Lead: Igor Krawczuk
Goal: Current methods for alignment applied to language models is akin to "blacklisting" behaviours that are bad. Operational Design Domain (OOD) is instead, akin to more exact "whitelisting" design principles, and now allowing deviations from this. The project wants to build a proof of concept, and show that this is hopefully feasible, economical and effective.
Team (Looking for 4-6 people):
Project Lead: Brian Penny
Goal: Develop a news website filled with stories, information, and resources related to the development of artificial intelligence in society. Cover specific stories related to the industry and of widespread interest (e.g: Adobe’s Firefly payouts, start of the Midjourney, proliferation of undress and deepfake apps). Provide valuable resources (e.g: list of experts on AI, book lists, and pre-made letters/comments to USCO and Congress). The goal is to spread via social media and rank in search engines while sparking group actions to ensure a narrative of ethical and safe AI is prominent in everybody’s eyes.
Desired Skills (any of the below):
Project Lead: Remmelt Ellen
Goal: Generative AI relies on laundering large amounts of data. Legal injunctions on companies laundering copyrighted data puts their training and deployment of large models on pause. The Creative Rights Coalition is an underground coalition of artists, writers, coders, and ML researchers. We need lawyers. Lawyers who are passionate about protecting society from (current and future) harms.
Team (looking for up to 5 people):
Project Lead: Tristan Williams
Goal: Figure out if congressional messaging campaigns (CMCs) work, and if they do, what messages of AI concern to promote, and how to promote them in a high-quality manner. Research general CMC effectiveness and write a report. If all goes well, extend the research to develop a best strategy for deploying a CMC for AIS. Time permitting, take the findings and deploy that best strategy, attempting to help fill the void with actionable steps on AI risk for those less involved.
Desired Skills (looking for 2-5 people):
Research Lead: Nicky Pochinkov (me!)
Goal: Rather than asking “What next token will the Language Model Predict?” or “What next action will an RL agent take?”, I think it is important to be able to model the longer-term behaviour of models, rather than just the immediate next token or action. I think there likely exist parameter- and compute-efficient ways to summarise what kinds of longer-term trajectories/outputs a model might output given an input and its activations.
Team (looking for 2-4 people):
Project Lead: Alice Rigg
Goal: Transformers are capable of a huge variety of tasks, and for the most part we know very little about how. Mechanistic interpretability has been posed as an AI safety agenda addressing this, through a bottom-up approach. We start with low-level components and build up to an understanding of how the most capable systems are functioning internally. But for mechanistic interpretability to be plausible as an AI safety agenda, it needs to succeed ambitiously. This project aims to: 1) Push the Pareto frontier on quality vs realism of explanations. 2) Better automated interpretability and scale feature explanations. 3) Improve the metrics for measuring the quality of explanations
Desired Skills (looking for up to 4 people):
Project Leads: Paul Colognese, Arun Jose
Goal: To help develop a theory of objectives that may lead to objective detection methods in the future that can help solve the inner alignment problem. This will involve: 1) Constructing a collection of toy models of agents. 2) developing probing-based infrastructure to explore objectives/target information in these models. 3) Using this infrastructure to perform empirical analysis. 4) Summarising and writing up any interesting findings.
This project will probably look like extending this work: Understanding and controlling a maze-solving policy network to new models and environments.
Desired Skills (looking for up to 3 people):
Project Lead: Jamie Coombes
Goal: A lack of unified software tooling and standardised interfaces results in duplicated effort as researchers build one-off implementations of various mech-interp methods. Existing libraries cover a range of explainable AI methods for shallow learning models. But contemporary research on large neural networks calls for new tooling. This project seeks to build a well-architected library specifically for current techniques in mechanistic interpretability and activation engineering.
Desired Skills (looking for up to 5 people):
Project Lead: Víctor Levoso Fernández
Goal:A few months ago a paper titled Out-of-context Meta-learning in Large Language Models was published, talking about a phenomenon called out-of-context meta-learning. More recently, there have been other papers on related topics like Taken out of context: On measuring situational awareness in LLMs or about failures of models to generalise this way like the reversal curse paper. All of these papers have in common that the models learn to apply facts it learned during training in another context. The aim of this project is to use mechanistic interpretability research on toy tasks to understand in terms of circuits and training dynamics how this kind of learning and generalisation happens in models.
Desired Skills (looking for 3-5 people):
Project Lead: Michael Ivanitskiy (+ Tilman Räuker, Alex Spies. See website)
Goal: To better understand on how internal search and goal representations are processed within transformer models (and whether they exist at all!). In particular, we take inspiration from existing mechanistic interpretability agendas and work with toy transformer models trained to solve mazes. Robustly solving mazes is a task may require some kind of internal search process, and gives a lot of flexibility when it comes to exploring how distributional shifts affect performance — both understanding search and learning to control mesa-optimizers are important for the safety of AI systems.
Desired Skills (looking for at least 1-2 people):
Project Lead: Jacques Thibodeau
Goal: Future prosaic AIs will likely shape their own development or that of successor AIs. We're trying to make sure they don't go insane. There are two main ways AIs can get better: by improving their training algorithms or by improving their training data. We consider both scenarios and tentatively believe data-based improvement is riskier than architecture-based improvement. For the Supervising AIs Improving AIs agenda, we focus on ensuring stable alignment when AIs self-train or train new AIs and study how AIs may drift through iterative training. We aim to develop methods to ensure automated science processes remain safe and controllable. This form of AI improvement focuses more on data-driven improvements than architectural or scale-driven ones.
Desired Skills (looking for 2-4 people):
Project Lead: Rudolf Laine
Goal: One worrying capability AIs could develop is situational awareness. In particular, threat models like successfully deceptive AIs and autonomous replication and adaptation seem to depend on high situational awareness. The goal of SADDER is to better understand situational awareness in current LLMs by running experiments and constructing evals. It will be building on the Situational Awareness Dataset (SAD), which benchmarked LLMs’ understanding of how they can influence the world, and ability to guess which lifecycle stage a given text excerpt is likely to have come from, by running more in-depth experiments and adding more categories.
Desired Skills (looking for up to 2 people):
Project Lead: Jett Janiak
Goal: TinyStories is a suite of Small Language Models (SLMs) trained exclusively on children's stories generated by ChatGPT. The models use simple, yet coherent English, which far surpasses what was previously observed in other models of comparable size. I hope that most of the capabilities of these models can be thoroughly understood using currently available interpretability techniques. Doing so would represent a major milestone in the development of mechanistic interpretability (mech interp). The goal of this AISC project is to publish a paper that systematically identifies and characterises the range of capabilities exhibited by the TinyStories models.
Desired Skills (looking for 2-4 people):
Project Lead: Maxime Riche
Goal: Alignment evaluations are used to evaluate LLM behavior on a wide range of situations. They are especially used to evaluate if LLMs write harmful content, have dangerous preferences, or obey to malevolent requests. Several alignment/behavioural evaluation techniques have been published or suggested (e.g: Self-reported preferences Inference from question answering, playing games, or looking at internal states. Behaviour evaluation under steering pressure.) This project aims to review and compare existing alignment evaluations to assess their usefulness. Optionally, we want to discover better alignment evaluations or improve the existing ones.
Desired Skills (looking for 2-4 people):
Project Lead: Henning Bartsch
Goal: The research project focuses on language model alignment by developing and testing techniques for (1) evaluating model-generated reasoning and (2) steering them towards more faithful behaviour. It builds on findings and future directions from scalable oversight, model evaluations and steering techniques.
The core parts are to: 1) Benchmark closed- and open-source LLMs on faithful reasoning. 2) Build ONE pipeline to generate a dataset for fine-tuning a LLaMA model. 3) Compare the effects of fine-tuning and test-time steering on faithfulness. 4) Analyse the model behaviour and results.
Desired Skills (looking for 3-5 people with diverse skillset):
Project Lead: Rasmus Herlo
Goal: The idea is to identify crucial modules and activations points in LLM-architectures that are associated with positive or negative ethical valence by caching the activations during forward passes induced by specifically developed binary ethical prompts. The identified linear subspaces following serve as intervention points for direct steering through activation addition. The ultimate hope is that these adjustments immediately generate a modified LLM architecture that complies better with ethical guidelines by default without the need of adjustment modules, as used in methods like RLHF.
Team (looking for 3-4 people):
Project Lead: Sahil
Goal: This project is an investigation into building a science of almost-but-not-actually magical regimes. Spaces where actuation is extremely cheap and fast, but not free and instantaneous. Some examples: biochemical signalling, the formation of social structures, decision theory. The hope is to be able to articulate many general and often counterintuitive facts and confusions about the insides of mind-like entities in general, including ones that exist already and apply it to fundamental problems in the caringness of an AI, like value-loading/ontological identification/corrigibility. You might call this a “deconfusion” project along the above lines.
Desired skills (looking for 2-4 people):
People at the intersection of:
Project Lead: Alex Altair
Goal: There is an intuition that if a system is capable of reliably achieving a goal in a wide range of environments, then it probably has certain kinds of internal processes, like building a model of the environment from input data, generating plans, and predicting the effects of its actions on the future states of the environment. That is, it probably has some modular internal structure. To what degree can these intuitions be formally justified? Can we prove that reliable optimization implies some kind of agent-like structure? I think one could make significant progress toward clarifying the parts, or showing weaker results for some of the parts.
Desired Skills (looking for 1-3 people):
Project Lead: Paul Bricman
Goal: Being able to identify and study agents is a recurring theme in many alignment proposals, ranging from eminently theoretical to directly applicable ones. Previous work paved the way for agent discovery from observations, but required an explicit decomposition of the world into variables, as well as additional scaffolding. This project consists of working towards a pipeline for detecting agency in raw byte-streams with no hints as to the nature of the agents to be detected. This could eventually enable the quantification of gradient hacking and mesa-optimization.
Team (looking for 2 people):
Project Lead: Johannes C. Mayer
Goal: Modern deep learning is about having a simple program (SGD) search over a space of possible programs (the weights of a neural network) and select one that performs well according to a loss function. Even though the search program is simple, the programs it finds are neither simple nor understandable.
My goal is to build an AI system that enables a by figuring out the algorithms of intelligence directly. The ideal outcome is to be able to write down the entire pivotal system as a non-self-modifying program explicitly, similar to how I can write down the algorithm for quicksort.
Desired Skills (Looking for 2-3 people):
Project Lead: Jobst Heitzig
Goal: Explore novel designs for generic AI agents – AI systems that can be trained to act autonomously in a variety of environments – and their implementation in software. We will study several versions of such “non-maximizing” agent designs and corresponding learning algorithms. Rather than aiming to maximize some objective function, our agents will aim to fulfill goals that are specified via constraints called “aspirations”. For example, I might want my AI butler to prepare 100–150 ml of tea, having a temperature of 70–80°C, taking for this at most 10 minutes, spending at most $1 worth of resources, and succeeding in this with at least 95% probability.
Desired Skills (looking for 3 people):
Project Lead: Bogdan-Ionut Cirstea
Goal: This project aims to get more grounding into how promising automating alignment research is as a strategy, with respect to both advantages and potential pitfalls, with the OpenAI superalignment plan as a potential blueprint/example. This will be achieved by reviewing, distilling and integrating relevant research from multiple areas/domains, with a particular focus on the science of deep learning and on empirical findings in deep learning and language modelling. This could expand more broadly, such as reviewing and distilling relevant literature from AI governance, multidisciplinary intersections (e.g. neuroscience), relevant prediction markets, and the automation of larger parts of AI risk mitigation research (e.g. AI governance). This could also inform how promising it might be to start more automated alignment/AI risk mitigation projects or to dedicate more resources to existing ones.
Desired Skills (looking for 4 people):
Project Lead: Eleanor ‘Nell’ Watson
Goal: We're working on a new system that makes it easier for artificial intelligence to understand what's important to you personally, while also reducing unfair or biased decisions. Our system includes easy-to-use tools that help you identify and mark different situations where the AI might be used. These tools use special techniques, like breaking down text into meaningful parts and automatically labelling them, to make it simpler to create settings that are tailored to you. By doing this, we aim to address the problem of AI not fully grasping people's unique backgrounds, preferences, and cultural differences, which can sometimes lead to biased or unsafe outcomes.
Team (looking for 2-3 people):
Project Lead: Marc Carauleanu
Goal: To investigate increasing self-other overlap while not significantly altering model performance. This is because an AI has to model others as different from oneself in order to deceive or be dangerously misaligned. Thus, if the model is deceptive and outputs statements/actions that just seem correct to an outer-aligned performance metric during training, we can favour honest solution by just increasing self-other overlap without altering performance. The goal of this research project is three-fold: 1) Better define and operationalise self-other overlap in LLMs. 2) Investigate the effect of self-other overlap on adversarial and cooperative behaviour in Multi-Agent Reinforcement Learning. 3) Investigate the effect of self-other overlap on adversarial and deceptive/sycophantic behaviour in Language Modelling.
Desired Skills (see this page):
Project Lead: Domenic Rosati
Goal: Recent efforts in concept level model steering such as Activation Addition or Representation Engineering, ROME and LEACE are promising approaches towards natural language generation control that is aligned with human values. However these approaches could be equally used by bad actors to unalign models and inject misinformation. This project involves developing a research direction where control interventions would be ineffective for counterfactual editing or unaligned control but remain effective for factual editing and aligned control. We call this "asymmetric control" since control can only happen in a direction towards alignment with human values not away from it.
Team (looking for 2-4 people):
Project Lead: Paul Bricman
Goal: Debate remains a central approach to alignment at frontier labs. In brief, it consists in having LLMs adversarially debate each other before a judge, the aggregate of which forms a deliberative system that can be used to automatically reflect on appropriate courses of action. However, the debate agenda faces a number of key challenges, mostly having to do with designing reliable means of evaluating competing parties, so as to identify the party that is closer to the truth.
Team (looking for 3 people):
Project Lead: Joel Naoki Ernesto
Goal: In the face of rapid AI and AGI advancements, this project aims to investigate potential socio-economic disruptions, especially within labor markets and income distribution. The focus will be on conceptualizing economic safety mechanisms to counteract the adverse effects of AGI deployment, ensuring a smoother societal transition.
Team (looking for 3-6 people):
Project Lead: Pratyush Ranjan Tiwari
Goal: As machine learning models get more powerful, restricting query access based on a safety policy becomes more important. Given a setting where a model is stored securely in a hardware-isolated environment, access to the model can be restricted based on cryptographic signatures. Policy-based signatures allow signing messages that satisfy a pre-decided policy. There are many reasons why policy enforcement should be done cryptographically, including insider threats, tamper resistance and auditability. This project leverages existing cryptographic techniques and existing discourse on AI/ML safety to come up with reasonable policies and a consequent policy-based access model to powerful models.
Team (looking for 3 people):
Project Lead: Linda Linsefors
Goal: I have a design for an online unconference, that I have run a few times. I would like to find two people to take on the task of running the next Virtual AI Safety Unconference (VAISU). Even though I have a ready format, there is room for you to improve the event design too. The goal of this project is both to produce the event, and also to pass on my organising skills to people who will hopefully use them in the future. I’m therefore looking for team members who are interested in continuing on the path of being organisers, even after this project. I’ll teach you as much as I can, but you will do all the work. The reason I’m proposing this project is because I don’t want to organise the next VAISU, I want you to do it.
Desired Skills (looking for 2 people):
Note again that these are summaries, and the descriptions or desired may not fully reflect the author's projects or views.
If you find any of the above AI Safety Camp projects interesting, and you have some of the skills listed, then make sure to apply before 1st December 2023.