Apply to work on this project with me at AI Safety Camp 2024 before 1st December 2023.
Future prosaic AIs will likely shape their own development or that of successor AIs. We're trying to make sure they don't go insane.
There are two main ways AIs can get better: by improving their training algorithms or by improving their training data.
We consider both scenarios and tentatively believe data-based improvement is riskier than architecture-based improvement. Current models mostly derive their behaviour from their training data, not training algorithms (meaning their architectures, hyperparameters, loss functions, optimizers or the like).
For the Supervising AIs Improving AIs agenda, we focus on ensuring stable alignment when AIs self-train or train new AIs and study how AIs may drift through iterative training. We aim to develop methods to ensure automated science processes remain safe and controllable. This form of AI improvement focuses more on data-driven improvements than architectural or scale-driven ones.
Twitter thread explaining the agenda: https://twitter.com/jacquesthibs/status/1652389982005338112?s=46&t=YyfxSdhuFYbTafD4D1cE9A
We imagine a future where AIs self-augment by continuously seeking out more and better training data, and either creating successor AIs or training themselves on that data. Often, these data will come from the AIs running experiments in the real world (doing science), deliberately seeking data that would cover a specific gap in its current capabilities, analogous to how human scientists seek data from domains where our current understanding is limited. With AI, this could involve AgentGPT-like systems that spin up many instances of themselves to run experiments in parallel, potentially leading to quick improvements if we are in an agency overhang.
We want to find methods of ensuring such 'automated science' processes remain safe and controllable, even after many rounds of self-directed data collection and training. In particular, we consider problems such as:
We believe this project could facilitate the automatic evaluation of stable self-reflectivity, a crucial capability for data-driven improvement. Specifically, it may contribute to evaluation datasets that identify capabilities and safety concerns in future models before their release. Ideally, these techniques would be integrated into the data-driven improvement process, allowing the termination of a training run if it goes off the rails. While this project addresses a specific capability essential for data-driven improvement, there will eventually be other critical aspects to consider, such as goal-directedness and power-seeking behaviours.
For the AI Safety Camp, we will focus on the Benchmarks for Stable Reflectivity project with the Supervising AIs Improving AIs agenda. We will discuss this project below.
Recent approaches allow language models to generate their own training data and self-evaluate their own outputs, allowing the models significant influence over their own training process. This raises concerns about reflectivity and the dynamics it introduces. While current data improvement processes circumvent direct forms of this issue by not informing AI of the ongoing training, future AIs may be aware of this influence and use it to steer their future cognition in accordance with their current preferences.
Any robustly aligned AI should also want to remain aligned in the future. I.e., they should have preferences over their future cognition, and act in line with those preferences. At the same time, some of the most concerning alignment failure modes also fall into this category: deceptive alignment involves an AI that wants to remain unaligned, and acts in line with those preferences by manipulating the training process.
Contemporary RL setups may lead language models to acquire some degree of reflectivity or self-knowledge. E.g., chatbots may benefit from knowing the limits of their own capabilities (a form of self-knowledge), or from knowing the intention behind their deployment (a form of reflectivity). OpenAI has furnished ChatGPT-3.5 and ChatGPT-4 with both types of information.
OpenAI provides ChatGPT with various facts about itself as a hidden prompt:
OpenAI also trained ChatGPT to be aware of the purpose for which it was trained:
Note that ChatGPT also says its "purpose is to continuously learn and improve." Only 1 out of 10 responses to this prompt mentioned a desire for self-improvement, so OpenAI probably did not explicitly train it to respond in this manner.
Future AIs may understand that their outputs' impact their training (either through direct instruction or generalization from their training data) and have preferences regarding those impacts. In anticipation of such a possibility, we aim to investigate the behaviour of current AIs in varying contexts that evoke reflectivity or require self-knowledge. Eventually, we expect this evaluation benchmark to be used for future models to keep track of the evolving self-reflectivity of such systems and how different forms of data, fine-tuning methods, scale, and (cognitive and neural network) architectures impact the model’s self-reflectivity.
We have adopted a practical approach to defining self-reflectivity by focusing on relevant subtasks associated with reflective behaviour in the context of AI self-improvement. Currently, these subtasks are (with rough input examples to the AI system and its response output):
This decomposition enables progress tracking on subtasks related to self-reflectivity. Previous research has demonstrated that although larger model sizes give rise to emergent behaviours, underlying improvements are often smoother, which can be revealed by breaking down tasks in ways that better capture partial progress. As a consequence, we divide self-reflection into subtasks and evaluate improvements for each.
We are developing a flexible pipeline to automatically generate probing datasets using current language models. This involves defining subtasks with high-quality examples, creating extensive datasets to assess model competency, and evaluating various models on each subtask. Challenges include:
We will now cover the project specifics below.
This project focuses on building probing datasets to evaluate a model's competence at various sub-tasks associated with reflectivity, metacognition, and value stability.
We intend to generate ~300 high-quality labelled data points (similar to what was shown above) for each subtask as well as a pipeline for quickly generating and validating more probing datasets. The tests will be run on multiple models (base, instruction-tuned, and RLHF-like) at various model sizes.
The project may evolve over time to add to the probing dataset. Particularly, I am currently exploring the idea of including interpretability techniques to measure model internals (ELK-style measurement to test whether the model is telling the truth) as well as applying activation steering. Indeed, this may prove to be essential to the pipeline due to worries about deceptive model outputs.
This project aims to publish an academic paper (and accompanying blog post(s)) and create a probing dataset that can be used to evaluate models.
In alignment, we must strike a balance between learning to align future powerful AIs and the potential negative externalities of advancing capability research. We acknowledge this dilemma and aim to be deliberate about the potential consequences of our work.
This research agenda focuses on self-improving systems, meaning systems that take actions to steer their future cognition in desired directions. These directions may include reducing biases, but also enhancing capabilities or preserving their current goals. Many alignment failure stories feature such behaviour. Some researchers postulate that the capacity for self-improvement is a critical and dangerous threshold; others believe that self-improvement will largely resemble the human process of conducting ML research, and it won't accelerate capabilities research more than it would accelerate research in other fields.
Data curation and generation are clear use cases for language models, as shown by the number of recent papers linked throughout this post. Most of this research aims at advancing capabilities since LM self-improvement could have significant commercial uses - it's possible to circumvent data-sourcing problems by using LMs to curate, improve, or generate their own training data.
Our focus lies on understanding the risks and unintended consequences of self-improvements. Thus, the insights obtained will likely enhance the safety of an already existing trend without significantly boosting capabilities. The self-reflective data curation process doesn't appear likely to instill or elicit dramatic, novel capabilities in a model. It yields predictable improvements in each iteration, as opposed to significant leaps from algorithmic advancements (e.g., LSTM to Transformer architecture). Given that our tasks resemble human-performed data curation, we are less concerned about the "threshold" family of threat models. Nonetheless, if it seems likely at any point that our research would significantly advance capabilities on this frontier, we would try to limit its dissemination or avoid releasing it altogether.
In short, it seems likely that the most detrimental effects of this kind of research would happen with or without our involvement. However, our work might reveal new insights into the risks and dynamics of iterative self-improvement.
This agenda was initially created by Quintin Pope. Owen Dudney and Roman Engeler worked on it during their time in the MATS program. Jacques helped write multiple sections in the research agenda post.
3 to 5
Jacques ThibodeauEmail: email@example.com
I have experience building datasets, training and fine-tuning language models, and interpretability.
I am happy to spend up to 8 hours weekly (1 half-day + spread out time during the week).
Minimum skill requirements:
Additional skills which would be useful: