Congratz on your successes and thank you for publishing this impact report.
It leaves me unsatiated related to cost effectiveness though. With no idea of how much money was invested in this project to get this outcome, I don't know if Arena is cost effective compared to other training programs and counterfactual opportunities. Would you mind sharing at least something about the amount of funding this got?
Re
Still, it is also positive if ARENA can help participants who want to pursue a career transition test their fit for alignment engineering in a comparatively low-cost way.
it doesn't strike me that a 5 week all expenses paid program is a particularly low cost way to find out AI Safety isn't for you (as compared to for example participating in an Apart Hackathon)
Thank you for your comment.
We are confident that ARENA's in-person programme is among the most cost-effective technical AI safety training programmes:
- ARENA is highly selective, and so all of our participants have the latent potential to contribute meaningfully to technical AI safety work
- The marginal cost per participant is relatively low compared to other AI safety programmes since we only cover travel and accommodation expenses for 4-5 weeks (we do not provide stipends)
- The outcomes set out in the above post seem pretty strong (4/33 immediate transitions to AI safety roles and 24/33 more actively pursuing them)
- There are lots of reasons why technical AI safety engineering is not the right career fit for everyone (even those with the ability). Therefore, I think that 2/33 people updating against working in AI safety after the programme is actually quite a low attrition rate.
- Apart Hackathons have quite a different theory of change compared with ARENA. While hackathons can be valuable for some initial exposure, ARENA provides 4-weeks of comprehensive training in cutting-edge AI safety research (e.g., mechanistic interpretability, LLM evaluations, and RLHF implementation) that leads to concrete outputs through week-long capstone projects.
If you're interested in helping to run the ARENA program, note that we're currently hiring for an Operations Lead! For more details, and to apply, see here.
Summary
The purpose of this report is to evaluate ARENA 4.0’s impact according to our four success criteria:
Overall, this iteration of ARENA was successful according to our success criteria.
The structure of the report is as follows:
Programme Information
First, we outline when this programme occurred, what topics were covered, and the main changes made to the programme in contrast to previous iterations. For more information about our curriculum content, see our website.
ARENA 4.0 Programme
ARENA 4.0 ran from the 2nd of September until the 4th of October 2024. The schedule of the programme was as follows:
Main Changes
The main changes for ARENA 4.0 compared with ARENA 3.0 (ran in Q1 2024) were:
Method
We surveyed our participants at the programme's start (on day 1) and at the end (on the last day). Our impact analysis is mainly based on responses at the end of the programme, while some are based on comparing the changes in responses between the start and the end.
We collected three types of responses:
We evaluated open-ended responses using thematic analysis. We highlighted keywords in each response, identified recurring themes and patterns across responses, reviewed the themes, and then counted the frequency of each theme across participant responses. Each count comes from a different participant, but each participant can add to multiple theme counts if their response mentions them.
Criteria 1: Sourcing high-quality participants[1]
Overall, our selection procedure worked effectively. ARENA 4.0 had a geographically diverse cohort of high quality participants, with 10 participants having more than four years experience in software engineering, and four having or conducting a PhD.
Selection process
Initial applications for ARENA opened on the 6th of July 2024 and closed on the 20th of July. The coding test ran from the 25th of July until the 29th of July (with an extension until the 1st of August). Interviews ran from the 4th of August until the 7th of August.
Who we selected
We selected 33 participants from ~350 applications. Participants came from across the world, including the US, UK, EU, India, and South Africa. The current jobs of participants can be seen in Figure 1.
As shown in Figure 2, the highest academic degree for 55% of participants was a Bachelor’s degree, for 34% a Master’s degree, and for 13% a PhD.
Improvements
The short notice period (2 months from announcement to start of programme, with only 1 month for selection) needed to be increased. This likely caused us to lose out on strong talent. Several participants gave late applications we accepted after deadlines, and several participants mentioned that the application timeline felt very tight. Future iterations will benefit from a more consistent schedule with longer lead times.
Criteria 2: Upskilling
As an ML program, our core goal is to upskill participants to tackle technical problems in AI safety. The first four weeks of the ARENA in-person programme cover four technical topics (more detail on each topic is provided in the relevant sections):
Each topic lasts a week. We asked participants to rate these weeks according to three criteria at the end of the programme:
Overall, participants found exercises both challenging and enjoyable. Participants rated ARENA exercises at 8.5/10 for enjoyment and 7.5/10 for difficulty. The total time participants estimated it would’ve taken them to learn the materials outside the programme (without TAs, lectures, pair-programming) was 10.2 weeks. This shows ARENA is successfully achieving its upskilling goals and emphasises the importance of selection, as even our high-calibre participants found the exercises quite challenging.
Week 0: Fundamentals
The aim of this week is for participants to reinforce basic deep-learning concepts. This week had 25 participants, as it was optional for those with significant deep-learning experience. Topics covered include PyTorch, basics of neural networks, residual neural networks, CNNs, weights and biases, optimisation, and backpropagation.
Participants said, on average, that it would take 2.3 weeks to learn the deep learning fundamentals week’s materials on their own if they had access to ARENA materials, as seen in Figure 3 below.
Week 1: Mechanistic Interpretability
The aim of this week is for participants to understand some of the methods that can be used to analyse model internals and replicate the results from key interpretability papers. Topics covered include the following: GPT models, training and sampling from transformers, TransformerLens, induction heads, indirect object identification, superposition, linear probes, inference-time intervention, and sparse autoencoders.
We asked participants to rate out of 7 their confidence in doing three concrete tasks in mechanistic interpretability (Figure 4); on average, participants rated their ability in these concrete outcomes at 5.64. Participants said, on average, that it would take 3.7 weeks to learn the mechanistic interpretability week’s materials on their own, as seen in Figure 5.
Week 2: LLM Evaluations
A new week of content on LLM evaluations was developed for this iteration of ARENA. This content aimed for participants to build alignment and dangerous capability evaluations in multiple-choice and agentic settings, and understand how to use these evaluations to gain information about current frontier LLMs. We added this week because evals have become a crucial field of AI safety, and there are no high-quality educational materials on safety evals to our knowledge. Topics covered include the following: threat modeling, using LLM APIs, implementing a pipeline to generate questions using LLMs, UK AISI’s inspect library, implementing LLM agents, and scaffolding LLM agents.
Overall, participants achieved the concrete learning outcomes to a similar extent as the other weeks, which we see as a positive sign for the first time running this content! On average, participants rated their confidence in designing multiple-choice question evals and agent evals as 6.0/7 and said it would take 1.6 weeks to learn this content independently (see Figures 6 and 7).
Regarding learning experience, participants rated this week an average enjoyment of 4.9/7 and an average usefulness of 5.0/7 across the days (based on responses from 12 participants out of 33 on an LLM evals feedback form). Participants particularly enjoyed the content on agentic dangerous-capabilities evaluations and the conceptual thinking of day 1 on threat-modeling and eval design.
We identified several areas of improvement to the learning experience. The most commonly cited issue in enjoyment ratings was code bugs in the materials (although this is sometimes unavoidable when producing new content). Second, we identified a more valuable set of learning outcomes from running this content on participants (in particular, a reduced emphasis on manual prompting of LLMs and an increased emphasis on designing evals infrastructure and agentic evals). The materials would benefit from a set of changes to put more emphasis on these learning outcomes, including adding clarifications and exercises on key concepts, restructuring the content to achieve a better flow, and adding additional content to challenge participants further.
Week 3: Reinforcement Learning
This week's core aim is for participants to understand classical and deep RL methods and how RLHF is implemented on LLMs as the dominant alignment method used today. Topics covered include the following: Fundamentals of RL, gym & gymnasium environments, policy gradient optimisation, PPO, deep Q-learning, RLHF, HuggingFace, and fine-tuning LLMs.
We asked participants to rate out of 7 their confidence in doing four concrete tasks in RL; the results can be seen in Figure 8. Participants rated their ability in these concrete learning outcomes at 5.44 on average. We asked participants to rate out of 7 their confidence in doing the following concrete tasks in RL. Participants said, on average, that it would take 3.2 weeks to learn the RL week’s materials on their own.
Overall Learning Experience
Finally, we asked participants how they found the ARENA materials overall. This helps us calibrate participant calibre across different ARENA cohorts and elicit feedback on the quality of our teaching mechanisms. On average, participants rated 8.5 out of 10 for exercise enjoyment, 7.6 out of 10 for exercise difficulty, and 8.3 out of 10 for teaching quality.
Criteria 3: Integration
Our participants spent 4 to 5 weeks full-time in the LISA office in London. Overall, they enjoyed their time there! Participants rated on average 8.9/10 for “How valuable was being in the LISA environment” (see Figure 13).
The most cited value from participants’ open-ended feedback was feeling that they were part of the AI safety community in ways they had not been before (see Figure 14). We were particularly excited to see a few outcomes mentioned by participants: (1) that they met a group of like-minded peers who are at similar career stages and who can help/motivate each other to progress further, (2) that they met senior researchers who helped give future directions and legitimise the field. We see these as two of the most impactful values of ARENA. They underline the importance of running the program in person in an AI safety hub like LISA. We are incredibly glad to be able to make use of — and contribute to — LISA’s thriving community of AI safety researchers with the ARENA programme!
Here are some representative quotes that illustrate these themes:
Connections to/feeling like a part of the AI safety community
Access to top researchers
Meeting like-minded talented people
Motivation to learn and get unstuck
Immediate access to TAs
Criteria 4: Career Acceleration
Finally, ARENA aims to accelerate participants' AI safety careers. We’re excited about the career outcomes for this cohort. Most participants (~70%) are actively pursuing AI safety careers, either in the planning or application stages (see Figure 16). Even more exciting: four participants have already secured full-time AI safety positions starting within the next four months. These results suggest that ARENA is successfully achieving one of its core goals of providing talented individuals with the skills to go directly into AI safety work. We’re especially encouraged to see participants moving directly into impactful roles so quickly after the conclusion of ARENA.
We also saw a difference in participants’ confidence in AI safety being the right field for them. At the start of the programme, participants rated on average of 7.7 out of 10 for “How confident are you that AI safety is the right field for you?” and 8.2 out of 10 by the end of the programme. The between-person difference across the programme was, on average, +0.6 (see Figure 15). This demonstrates the impact that ARENA has in increasing people’s confidence on whether AI safety is a good personal fit for them.
Two participants had reduced their certainty that AI safety is the right field for them. Still, it is also positive if ARENA can help participants who want to pursue a career transition test their fit for alignment engineering in a comparatively low-cost way. We expect a few to decrease their confidence, to validate that ARENA’s materials are technical and challenging enough to test for fit.
Overall Programme Experience
We asked the participants, “How much did you enjoy the programme overall?” at the end. The average participant rating was 9.11 out of 10.
Most valuable gain
We asked participants, “What was the most valuable thing you gained from the programme?” and thematically analysed their open-ended responses. We identified the following common themes.
Notably, the most frequently mentioned “most valuable gain” was meeting talented and like-minded people. This reflects the value of LISA mentioned in the previous section and underlines the importance of running ARENA in person to achieve our third success criteria of integrating participants within the AI safety ecosystem. Furthermore, we are enthused that the third most commonly cited valuable gain was “confidence to take on ML work” — this was one of the principal outcomes that we aimed for under our upskilling goal — the programme is only 5-weeks, so an impactful upskilling outcomes is if participants gain confidence to be able to bootstrap themselves to learn at a faster rate in the future.
Here are some representative responses for each of our core themes:
Meeting talented and like-minded people
ML skills and knowledge
Confidence to take on AI safety work
Many responses specifically mentioned a boost in their confidence in undertaking AI safety work. We’ve separated this out from references of particular AI safety knowledge and experience (“ML skills and knowledge”).
Ability to quickly execute on a project
Capstone Project
Improvements
As a team, we endeavour to use feedback to improve the quality of ARENA for participants. Each iteration, we learn how to better run the programme so that its impact can grow for all of our participants. Although this programme was overall successful according to its four success criteria, we noticed some core improvements that would enable it to run even better. The key improvements we noticed in this iteration are:
Acknowledgments
This report was produced by @Chloe Li (data analysis, writing), @JamesH (writing), and @James Fox (reviewing, editing) at ARENA. We thank @CallumMcDougall for his comments on the final draft. We also thank Open Philanthropy for their generous support of the ARENA program.
Note: Some details of this section have been redacted, in order that key details and aspects of how we select and choose participants remain private to avoid potential issues that may arise in our selection process in the future.
Note: “Conducting alignment research” only includes those who are currently working full-time on alignment research (independently, as mentee/intern, or employed), not those who have in the past or are working part-time on alignment. This was not self-reported by the participants, but annotated by us based on their CV, so there may be some inaccuracies.