Review

TL;DR

  1. We are a new AI evals research organization called Apollo Research based in London. 
  2. We think that strategic AI deception – where a model outwardly seems aligned but is in fact misaligned – is a crucial step in many major catastrophic AI risk scenarios and that detecting deception in real-world models is the most important and tractable step to addressing this problem.
  3. Our agenda is split into interpretability and behavioral evals:
    1. On the interpretability side, we are currently working on two main research bets toward characterizing neural network cognition. We are also interested in benchmarking interpretability, e.g. testing whether given interpretability tools can meet specific requirements or solve specific challenges.
    2. On the behavioral evals side, we are conceptually breaking down ‘deception’ into measurable components in order to build a detailed evaluation suite using prompt- and finetuning-based tests. 
  4. As an evals research org, we intend to use our research insights and tools directly on frontier models by serving as an external auditor of AGI labs, thus reducing the chance that deceptively misaligned AIs are developed and deployed. 
  5. We also intend to engage with AI governance efforts, e.g. by working with policymakers and providing technical expertise to aid the drafting of auditing regulations.
  6. We have starter funding but estimate a $1.4M funding gap in our first year. We estimate that the maximal amount we could effectively use is $4-6M $7-10M* in addition to current funding levels (reach out if you are interested in donating). We are currently fiscally sponsored by Rethink Priorities. 
  7. Our starting team consists of 8 researchers and engineers with strong backgrounds in technical alignment research. 
  8. We are interested in collaborating with both technical and governance researchers. Feel free to reach out at info@apolloresearch.ai.
  9. We intend to hire once our funding gap is closed. If you’d like to stay informed about opportunities, you can fill out our expression of interest form.

*Updated June 4th after re-adjusting our hiring trajectory

Research Agenda

We believe that AI deception – where a model outwardly seems aligned but is in fact misaligned and conceals this fact from human oversight – is a crucial component of many catastrophic risk scenarios from AI (see here for more). We also think that detecting/measuring deception is causally upstream of many potential solutions. For example, having good detection tools enables higher quality and safer feedback loops for empirical alignment approaches, enables us to point to concrete failure modes for lawmakers and the wider public, and provides evidence to AGI labs whether the models they are developing or deploying are deceptively misaligned.

Ultimately, we aim to develop a holistic and far-ranging suite of deception evals that includes behavioral tests, fine-tuning, and interpretability-based approaches. Unfortunately, we think that interpretability is not yet at the stage where it can be used effectively on state-of-the-art models. Therefore, we have split the agenda into an interpretability research arm and a behavioral evals arm. We aim to eventually combine interpretability and behavioral evals into a comprehensive model evaluation suite.

On the interpretability side, we are currently working on a new unsupervised approach and continuing work on an existing approach to attack the problem of superposition. Early experiments have shown promising results, but it is too early to tell if the techniques work robustly or are scalable to larger models. Our main priority, for now, is to scale up the experiments and ‘fail fast’ so we can either double down or cut our losses. Furthermore, we are interested in benchmarking interpretability techniques by testing whether given tools meet specific requirements (e.g. relationships found by the tool successfully predict causal interventions on those variables) or solve specific challenges such as discovering backdoors and reverse engineering known algorithms encoded in network weights.

On the model evaluations side, we want to build a large and robust eval suite to test models for deceptive capabilities. Concretely, we intend to break down deception into its component concepts and capabilities. We will then design a large range of experiments and evaluations to measure both the component concepts as well as deception holistically. We aim to start running eval experiments and set up pilot projects with labs as soon as possible to get early empirical feedback on our approach.

Plans beyond technical research

As an evals research org, we intend to put our research into practice by engaging directly in auditing and governance efforts. This means we aim to work with AGI labs to reduce the chance that they develop or deploy deceptively misaligned models. The details of this transition depend a lot on our research progress and our level of access to frontier models. We expect that sufficiently capable models will be able to fool all behavioral evaluations and thus some degree of ‘white box’ access will prove necessary. We aim to work with labs and regulators to build technical and institutional frameworks wherein labs can securely provide sufficient access without undue risk to intellectual property. 

On the governance side, we want to use our technical expertise in auditing, model evaluations, and interpretability to inform the public and lawmakers. We are interested in demonstrating the capacity of models for dangerous capabilities and the feasibility of using evaluation and auditing techniques to detect them. We think that showcasing dangerous capabilities in controlled settings makes it easier for the ML community, lawmakers, and the wider public to understand the concerns of the AI safety community. We emphasize that we will only demonstrate such capabilities if it can be done safely in controlled settings. Showcasing the feasibility of using model evaluations or auditing techniques to prevent potential harms increases the ability of lawmakers to create adequate regulation. 

We want to collaborate with independent researchers, technical alignment organizations, AI governance organizations, and the wider ML community. If you are (potentially) interested in collaborating with us, please reach out

Theory of change

We aim to achieve a positive impact on multiple levels:

  1. Direct impact through research: If our research agenda works out, we will further the state of the art in interpretability and model evaluations. These results could then be used and extended by academics and other labs. We can have this impact even if we never get any auditing access to state-of-the-art models. We carefully consider how to mitigate potential downside risks from our research by controlling which research we publish. We plan to release a document on our policy and processes related to this soon.
  2. Direct impact through auditing: Assuming we are granted some level of access to state-of-the-art models of various AGI labs, we could help them determine if their model is, or could be, strategically deceptive and thus reduce the chance of developing and deploying deceptive models. If, after developing state-of-the-art interpretability tools and behavioral evals and using them to audit potentially dangerous models, we find that our tools are insufficient for the task, we commit to using our knowledge and position to make the inadequacy of current evaluations widely known and to argue for the prevention of potentially dangerous models from being developed and deployed.
  3. Indirect impact through demonstrations: We hope that demonstrating the capacity of models for dangerous capabilities shifts the burden of proof from the AI safety community to the AGI labs. Currently, the AI safety community has the implicit burden of showing that models are dangerous. We would like to move toward a world where the burden is on AGI labs to show why their models are not dangerous (similar to medicine or aviation). Additionally, demonstrations of deception or other forms of misalignment ‘in the wild’ can provide an empirical test bed for practical alignment research and also be used to inform policymakers and the public of the potential dangers of frontier models. 
  4. Indirect impact through governance work: We intend to contribute technical expertise to AI governance where we can. This could include the creation of guidelines for model evaluations, conceptual clarifications of how AIs could be deceptive, suggestions for technical legislation, and more.

We do not think that our approach alone could yield safe AGI. Our work primarily aims to detect deceptive unaligned AI systems and prevent them from being developed and deployed. The technical alignment problem still needs to be solved. The best case for strong auditing and evaluation methods is that it can convert a ‘one-shot’ alignment problem into a many-shot problem where it becomes feasible to iterate on technical alignment methods in an environment of relative safety. 

Status

We have received sufficient starter funding to get us off the ground. However, we estimate that we have a $1.4M funding gap for the first year of operations and could effectively use an additional $7-10M* in total funding. If you are interested in funding us, please reach out. We are happy to address any questions and concerns. We currently pay lower than competitive salaries but intend to increase them as we grow to attract and retain talent.

We are currently fiscally sponsored by Rethink Priorities but intend to spin out after 6-12 months. The exact legal structure is not yet determined, and we are considering both fully non-profit models as well as limited for-profit entities such as public benefit corporations. Whether we will attempt the limited for-profit route depends on the availability of philanthropic funding and whether we think there is a monetizable product that increases safety. Potential routes to monetization would be for-profit auditing or red-teaming services and interpretability tooling, but we are wary of the potentially misaligned incentives of this path. In an optimal world, we would be fully funded by philanthropic or public sources to ensure maximal alignment between financial incentives and safety. 

Our starting members include:

  • Marius Hobbhahn (Director/CEO)
  • Beren Millidge (left on good terms to pursue a different opportunity)
  • Lee Sharkey (Research/Strategy Lead, VP)
  • Chris Akin (COO)
  • Lucius Bushnaq (Research scientist)
  • Dan Braun (Lead engineer)
  • Mikita Balesni (Research scientist)
  • Jérémy Scheurer (Research scientist, joining in a few months)

FAQ

How is our approach different from ARC evals?

There are a couple of technical and strategic differences:

  1. At least early on, we will focus primarily on deception and its prerequisites, while ARC evals is investigating a large range of capabilities including the ability of models to replicate themselves, seek power, acquire resources, and more.
  2. We intend to use a wide range of approaches to detect potentially dangerous model properties right from the start, including behavioral tests, fine-tuning, and interpretability. To the best of our knowledge, ARC evals intends to use these tools eventually but is currently mostly focused on behavioral tools. 
  3. We intend to perform fundamental scientific research in interpretability in addition to developing a suite of behavioral evaluation tools. We think it is important that audits ultimately include evaluations of both external behavior and internal cognition. This seems necessary to make strong statements about cognitive strategies such as deception.

We think our ‘narrow and deep’ approach and ARC’s ‘broad and less deep’ approach are complementary strategies. Even if we had no distinguishing features from ARC Evals other than being a different team, we still would deem it net positive to have multiple somewhat uncorrelated evaluation teams. 

When will we start hiring?

We are starting with an unusually large team. We expect this to work well because many of us have worked together previously, and we all agree on this fairly concrete agenda. However, we still think it is wise to take a few months to consolidate before growing further. 

We think our agenda is primarily bottlenecked by engineering and hands-on research capacity rather than conceptual questions. Furthermore, we think we have the management capacity to onboard additional people. We are thus heavily bottlenecked by funding at the moment and it is unclear when and how many people we can hire in the near future. If this bottleneck is resolved we plan to start hiring soon. 

We have an expression of interest form for potential applicants. You can add your name and we will inform you when we open a hiring round. We might also reach out individually to researchers who are a great fit for collaborations.

Do we ever plan to be a for-profit organization?

This depends on a lot of factors and we have not made any final decisions. In the case where we take a constrained for-profit route, we would legally ensure that we are not obligated to maximize profit and carefully select the donors and investors we work with to make sure they share our AI safety goals and understand our mission. We are currently unsure whether the mission of reducing catastrophic risks from AI can be fully compatible with a for-profit setup. We think offering auditing or red-teaming services or providing interpretability tools are candidates for monetizable strategies that align with reducing catastrophic risks but trying to maximize profits from these strategies introduces obvious perverse incentives which we need to think carefully about how to mitigate.

Isn’t this research dangerous?

Some people have argued that behavioral evals that investigate dangerous capabilities could be a cause of risk in itself, e.g. that we accidentally create a dangerous deceptive model through our efforts or that we create a public blueprint for others to create one. We think this is a plausible concern. We have two main considerations.

  1. There are trade-offs between risk and information. Right now, we think that state-of-the-art models are not sufficiently capable to cause the serious risks that we are worried about, but we expect this to change in the future. The information we get from model evaluations should shed light on upcoming dangerous capabilities. Therefore the tradeoff currently seems worth making. But that might change in future and we will reassess our stance on this regularly. 
  2. We can simulate potentially dangerous capabilities in controlled settings. By default, this means that we are very careful about which experiments we run and the environments they are run in. Furthermore, many dangerous capabilities are hard to test for directly. Therefore, our plan is to split them up into narrower capabilities that are less dangerous. For example, rather than testing if a model is capable of deception, we can test if it has goals, instrumental reasoning, and situational awareness. This is analogous to pointing out that someone has enough Uranium for an atomic bomb but isn’t capable of building it. However, in case we think it is the best strategy after taking the risks involved into account, we would consider doing end-to-end evaluations, e.g. directly testing for deception in carefully controlled environments.

We are also aware that good interpretability research might eventually run the risk of improving capabilities. We have thought a considerable amount about this in the past and are making concrete plans to mitigate the risks. Overall, however, we think that current interpretability research is strongly net positive for safety in expectation. 


 

New Comment
11 comments, sorted by Click to highlight new comments since:
[-]AkashΩ5113

Congratulations on launching!

On the governance side, one question I'd be excited to see Apollo (and ARC evals & any other similar groups) think/write about is: what happens after a dangerous capability eval goes off? 

Of course, the actual answer will be shaped by the particular climate/culture/zeitgeist/policy window/lab factors that are impossible to fully predict in advance.

But my impression is that this question is relatively neglected, and I wouldn't be surprised if sharp newcomers were able to meaningfully improve the community's thinking on this. 

Thanks Akash! 

I agree that this feels neglected.

Markus Anderljung recently tweeted about some upcoming related work from Jide Alaga and Jonas Schuett: https://twitter.com/Manderljung/status/1663700498288115712

Looking forward to it coming out! 

Good luck! :)

Are you mainly interested in evaluating deceptive capabilities? I.e., no-holds-barred, can you elicit competent deception (or sub-components of deception) from the model? (Including by eg fine-tuning on data that demonstrates deception or sub-capabilities.)

Or evaluating inductive biases towards deception? I.e. testing whether the model is inclined towards deception in cases when the training data didn't necessarily require deceptive behavior.

(The latter might need to leverage some amount of capability evaluation, to distinguish not being inclined towards deception from not being capable of deception. But I don't think the reverse is true.)

Or if you disagree with that way of cutting up the space.

All of the above but in a specific order. 
1. Test if the model has components of deceptive capabilities with lots of handholding with behavioral evals and fine-tuning. 
2. Test if the model has more general deceptive capabilities (i.e. not just components) with lots of handholding with behavioral evals and fine-tuning. 
3. Do less and less handholding for 1 and 2. See if the model still shows deception. 
4. Try to understand the inductive biases for deception, i.e. which training methods lead to more strategic deception. Try to answer questions such as: can we change training data, technique, order of fine-tuning approaches, etc. such that the models are less deceptive? 
5. Use 1-4 to reduce the chance of labs deploying deceptive models in the wild. 

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

Seems great! I'm excited about potential interpretability methods for detecting deception.  

I think you're right about the current trade-offs on the gain of function stuff, but it's good to think ahead and have precommitments for the conditions under which your strategies there should change. 

It may be hard to find evals for deception which are sufficiently convincing when they trigger, yet still give us enough time to react afterwards. A few more similar points here: https://www.lesswrong.com/posts/pckLdSgYWJ38NBFf8/?commentId=8qSAaFJXcmNhtC8am 

Building good tools for detecting deceptive alignment seems robustly good though, even after you reach a point where you have to drop the gain of function stuff.

Sounds very cool. I am working on something similar -- behavioral evals for a component of deception (theory of mind). Feel free to reach out if keen to chat!

This is a very exciting project! I'm particularly glad to see two features: (i) the focus on "deception", which undergirds much existential risk but has arguably been less of a focal point than "agency", "optimization", "inner misalignment", and other related concepts, (ii) the ability to widen the bottleneck of upskilling novice AI safety researchers who have, say, 500 hours of experience through the AI Safety Fundamentals course but need mentorship and support to make their own meaningful research contributions.

This seems really exciting!! Curious what your thoughts are on the MACHIAVELLI benchmark 

Thanks for the posting the announcement.

We think that strategic AI deception – where a model outwardly seems aligned but is in fact misaligned – is a crucial step in many major catastrophic AI risk scenarios and that detecting deception in real-world models is the most important and tractable step to addressing this problem.

Can you elaborate on why the team believes it's the most important and most tractable step?