Learnings from starting an AI safety research team

draganover; Erin Robertson

This post’s goal is to distill our takeaways from building a new research team over the past four months. We describe some context about our team, how it came about, and then describe the lessons learned.

Since AI safety is becoming more and more entrepreneurial, we hope this is helpful for others trying to do the same.

1. The team

We're a new alignment research team within Arcadia Impact, based in London. We’re a team of 8, working closely with members of the UK AISI alignment team. We currently have three main projects:

Understanding model motivations. This currently looks like:
1. Trying to generate documents which fully describe a model’s behaviour (given just its behaviour).
2. Producing an open analysis of alignment training techniques and ways this training could go wrong.
Doing scalable oversight for alignment. This includes validating debate protocols in practice and then trying to apply them to fuzzy alignment-relevant tasks.
Building pipelines for doing automated alignment research.

We're also hiring for two roles! More on this at the bottom.

2. Context about how the team came about

The rest of this post is written from the perspective of Andrew Draganov (research lead & current programme manager on the team) and Erin Robertson (co-director of Arcadia).

In short, Arcadia Impact had been collaborating with AISI already, through LASR Labs and ASET. Our alignment team started by applying for the AISI alignment project funding, saying that we would hire a team of researchers to collaborate with their alignment team. Andrew was taking part in LASR at the time and was brought in to help with the application. His remit then widened as the number of things to do kept growing. Once our AISI funding was approved we began the process of hiring researchers, and also applied to Coefficient Giving for additional compute funding.

A bit about Andrew, since it bears on how replicable this is. In his words:

I have a PhD in computer science/machine learning and was working as a postdoc in ML before doing LASR. This means I've spent a number of years talking shop about AI research, though not as many on AI safety specifically.
I'm not very well-known in the AI safety community! I only have one first-author AI safety paper (which was reasonably well-received but nothing crazy). I mention this because "you need to be an established name to lead a research team" is a reasonable thing to assume, but it wasn't really true here.

For anyone reading this post as a template, here are some things which may be specific to our situation and might not generalise cleanly:

We were immediately hiring 7 researchers to get started at the same time! This is highly unusual and probably never how this otherwise happens.
Arcadia was already an established non-profit. We therefore already had visa sponsorship processes, office space, hiring systems, etc.
- There are fiscal sponsors which can do these tasks if you want to avoid figuring out the overhead yourself.
The Alignment Project, run by AISI, was our initial funder. This is a non-standard funder for many reasons, including that Arcadia already had a working relationship with AISI writ large. If you're aiming to first get funded by, say, Coefficient Giving then the dynamics may be different.
Having run LASR, we know a lot of people in the ecosystem quite well. This made hiring easier (and, indeed, over half of the team are LASR alumni).
We're doing technical AI safety; not governance, fieldbuilding, etc.

3. Lessons learned

Given the above context, here is advice which we hope is immediately actionable by people looking to start AI safety orgs.

3.1 Hiring

[Written from Andrew’s perspective]

I feel like our hiring went very well and I’m really excited about the team. But also I wasted a lot of time chasing leads that were varying amounts of useful.

For one thing, everyone wants to measure 'crackedness' but it’s unclear how to do it. On that axis, the two highest-signal parts of our process were the work test and the references; if we'd relied on only those two, I think we'd have assessed raw research ability roughly as well as we did. The interviews were helpful in addition to that, but mostly to vibecheck for fit rather than to gauge ability.

For the work test, we paid 50 applicants ~$200 each to make a research proposal. We gave them 4 hours to do this, and the deliverable was just a pdf. We then graded them anonymously. This feels in line with what the work actually looks like in the age of Claude code. We’re happy to share the work test and grading template we used if someone is interested.

Here are a few additional thoughts:

The various AI-safety talent scouts are extremely useful when it comes to hiring. This includes research fellowship research managers, people at BlueDot, people at 80K, etc.
There’s just so much talent across the top fellowships. Our team ended up with 4 LASR alums, 1 MATS, 1 Astra, 1 Anthropic Fellow.
- Most of these fellowships now have extension programmes, where good people keep doing work until they get hired. Although we didn’t hire from this pool directly, the extensioners are probably the most useful group of candidates you can target – they are already-vouched for and are looking for jobs!
I probably sent 50 cold emails trying to get people to apply. This was only useful insofar as it got me a meeting with the person (which it rarely did). If I was doing this over again, I would spend more time reaching out to various MATS, LASR, and Constellation research managers, ask them who they’d recommend, and then set up 1-1s with those people.

3.2 Networking

[Written from Andrew’s perspective]

Even though it’s clear that building a good team requires a lot of networking, it was often hard to tell which networking was “worth it” and which wasn’t. Here are the things I’d prioritise if I was doing it again:

Obtaining an active endorsement from a well-known entity in your AI safety subfield. I claim this is the highest-leverage thing you can do when building an org, and it was very useful for us. I define an active endorsement as one in which the senior person/org is going out of their way to vouch for you and will likely work with you once you start. At minimum, a written reference from a senior person goes a long way.
1. Note: Appeals to authority are lame. However, there's so much noise in AI safety and a big endorsement is immediately recognized. This helps with both funding applications and hiring. For instance, we would not have hired as effectively if we couldn’t leverage the AISI and Arcadia affiliations.
Trialing out big-picture ideas on senior community members.
1. I had 2-3 meetings a day for several months pitching senior people on ideas regarding the org (research, position within the community, outreach, various deliverables) and hearing their takes.
  1. These meetings were monotonically more useful as a function of how prepared I was (read: how much time I had spent understanding the other person’s worldview in advance).
  2. I still cringe about the first time I was describing the goal of our new org and said we wanted to do “alignment research, both technical and conceptual”, to which the person responded “so… all of it?”. But I guess these initial stumbling blocks were necessary in order to get better at talking about the ~vision~.
Talking to funders. In some sense, funders are scary: they know their shit, expect you to know yours, and are short on time. Also, you're cold-asking for a seemingly unreasonable amount of money. However, you're on the same team as them and should try to solicit funder opinions when available. They talk to a lot of disproportionately senior people, and I found their suggestions useful as a biased distillation of all those conversations.
1. Coefficient Giving^[1] is also excited about ambitious proposals, so don't pre-shrink your ask (and don't agonise over salary numbers). I wouldn’t expect to get rejected over a reasonable salary ask, and a quick survey of comparable roles at similar orgs is enough to calibrate.

3.3 Trying to build a good team culture

[Written from Erin’s perspective, with context from running LASR Labs for multiple years]

Since the team’s just started, we’re not able to claim the culture is good (also, this is not really for us to say). Instead, here is how we thought about the process of establishing team culture prior to people joining. Parts are heavily influenced by the way this is done for LASR cohorts:

Onboard everyone at once (or failing that, hold a retreat). Bringing people in together is a clean chance to set common norms and the way we want everyone thinking from day one. If you can't start everyone at once, then it’s useful to run a retreat at some point. This looks like letting people become friends, working on strategy together, and making concrete values.
- For example, we wanted the team to think about our communication strategy, so we ran a session exploring how comparable orgs disseminate their work and left with concrete intentions for our own.
Get the team to shape the strategy. We hired people based on them having good judgement, so we spent some time together figuring out our priorities. Specifically, we gave people a list of possible agendas and projects, spent the first week thinking hard about which to focus on, and built teams around people’s preferences.
Set expectations. Collaborators, employees, and advisors all need to know what's being asked of them and how to thrive in their role. Be concrete early about time commitments, what good work looks like, the values you want people building, and who owns what.
Have two distinct management goals. Reviewing success on tasks, and making people better at their job (e.g. coaching, habit forming, feedback). The second is often overlooked in early-stage teams but is an important way to keep the team happy and improve the productivity of the team over time.

Interested in working with us?

We're hiring! Specifically, we're looking for an Alignment Programme Manager, a senior generalist to help build and run the team. We're also hiring a Communications and Operations Associate to shape how our research reaches stakeholders and to keep the team's operations running. Both will be based at the LISA office in central London, with visa sponsorship available.

If you think your skills don’t fit neatly into one of these descriptions but you think you’d be a good fit, please apply – we are flexible on the exact role and are more interested in finding good candidates! The deadline for applications is June 23rd.

Similarly, if you're working on related topics, please reach out! The easiest option is to send an email to andrew[at]arcadiaimpact[dot]org. You can also follow our research updates on twitter.

^{^}
Disclosure: Erin is joining the Coefficient Giving Technical AIS team full time at the end of June and is currently part time there.

Thanks for the writeup! This looks like it'll be helpful for people starting things.

Notes upon reading:

We were immediately hiring 7 researchers to get started at the same time! This is highly unusual and probably never how this otherwise happens.

My guess is that the typical headcount for a new org is less than this, but "probably never how this otherwise happens" is overstating it.
Hopefully this becomes more common in future.
If you're starting a 7-person team, it probably helps if you have some legible experience managing a team that size.

At minimum, a written reference from a senior person goes a long way.

Definitely true!
This is ofc useful as an "appeal to authority", but references have some other advantages:
- The references will typically explain why the senior person is excited and what their hesitations are.
- They signal that the org has some spoken with bunch senior people, and probably taken on their advice.
- The references might contain something like "I, the senior person, will have some bandwidth with the project going forward".
There's an even more minimal thing that grant applicants could do: "Here's a list of people we have spoken to about our project, and what their takes were. Feel free to reach out to them."
- Ideally, this list isn't filtered to include only the most bullish people/takes.
- My guess is that funders are impressed if you're provactively spoken to people you expect to be bearish.

[Funders] talk to a lot of disproportionately senior people, and I found their suggestions useful as a biased distillation of all those conversations.

Yep, I think funders (if they have the time) can often be useful beyond sending cash.
They likely have good takes, and a big rolodex of people you should talk to.
They might have private information, e.g. similar grants they have made.
People should consider reaching out to funders at a much earlier stage, before the details are fleshed out.

Coefficient Giving is also excited about ambitious proposals, so don't pre-shrink your ask (and don't agonise over salary numbers).

+1 to not agnoising over salary numbers.
I claim that you can also ask [some] funders what salaries they think would be reasonable, and trust that they will give you their true opinion.

This was useful, thanks.

One thing that stood out to me was the project of generating documents which fully describe a model’s behaviour, given only its behaviour.

How do people working on this think about cases where there may be multiple equally predictive descriptions of the same model?

For example, if two descriptions predict the behaviour equally well, is there a good way to decide which one is closer to the model’s actual motivation? Or is predictive adequacy basically the standard being aimed at here?

Predictive accuracy matters, but so does a priori plausibility - eg a simpler description is probably better than a complex one, a “natural“ one better than an “unnatural” one, etc

Please post the work test and grading template you used. A lot of people here might benefit from reading it.

I'm hesitant to share the work test completely publicly because it risks getting goodharted. I.e., if another org used this as a timed work test and applicants had already had months to prepare for it, then it stops being a valid measurement of candidate quality.

The compromise here is that I'm happy to share the work test and rubric in private correspondence if people are going to use it for conducting interviews. But I can also describe the broad strokes of what it entailed. Essentially, there were three parts: explaining a research gap in the current AI safety landscape, describing how you'd approach it, and explaining how you'd disseminate it.

I'll happily read that DM and won't forward what you share (but will likely display or talk about it).

I'd like to be able to say to someone who wants to work in safety 'here is an example of a practical problem and the way it is graded, if your thinking is aligned this way, you might have a future in the field'.

Note: I selfishly want to see exactly how far that I am from aligned (it's a lot).

I would love to rry the work test also!