Co-Authors: @yams, @Carson Jones, @McKennaFitzgerald, @Ryan Kidd 

MATS tracks the evolving landscape of AI safety[1] to ensure that our program continues to meet the talent needs of safety teams. As the field has grown, it’s become increasingly necessary to adopt a more formal approach to this monitoring, since relying on a few individuals to intuitively understand the dynamics of such a vast ecosystem could lead to significant missteps.[2]

In the winter and spring of 2024, we conducted 31 interviews, ranging in length from 30 to 120 minutes, with key figures in AI safety, including senior researchers, organization leaders, social scientists, strategists, funders, and policy experts. This report synthesizes the key insights from these discussions. The overarching perspectives presented here are not attributed to any specific individual or organization; they represent a collective, distilled consensus that our team believes is both valuable and responsible to share. Our aim is to influence the trajectory of emerging researchers and field-builders, as well as to inform readers on the ongoing evolution of MATS and the broader AI Safety field.

All interviews were conducted on the condition of anonymity.

Needs by Organization Type

Organization typeTalent needs
Scaling Lab (e.g., Anthropic, Google DeepMind, OpenAI) Safety TeamsIterators > Amplifiers
Small Technical Safety Orgs (<10 FTE)Iterators > Machine Learning (ML) Engineers
Growing Technical Safety Orgs (10-30 FTE)Amplifiers > Iterators
Independent ResearchIterators > Connectors

Here, ">" means "are prioritized over."

Archetypes

We found it useful to frame the different profiles of research strengths and weaknesses as belonging to one of three archetypes (one of which has two subtypes). These aren’t as strict as, say, Diablo classes; this is just a way to get some handle on the complex network of skills involved in AI safety research. Indeed, capacities tend to converge with experience, and neatly classifying more experienced researchers often isn’t possible. We acknowledge past framings by Charlie Rogers-Smith and Rohin Shah (research lead/contributor), John Wentworth (theorist/experimentalist/distillator), Vanessa Kosoy (proser/poet), Adam Shimi (mosaic/palimpsests), and others, but believe our framing of current AI safety talent archetypes is meaningfully different and valuable, especially pertaining to current funding and employment opportunities.

Connectors / Iterators / Amplifiers

Connectors are strong conceptual thinkers who build a bridge between contemporary empirical work and theoretical understanding. Connectors include people like Paul Christiano, Buck Shlegeris, Evan Hubinger, and Alex Turner[3]; researchers doing original thinking on the edges of our conceptual and experimental knowledge in order to facilitate novel understanding. Note that most Connectors are typically not purely theoretical; they still have the technical knowledge required to design and run experiments. However, they prioritize experiments and discriminate between research agendas based on original, high-level insights and theoretical models, rather than on spur of the moment intuition or the wisdom of the crowds. Pure Connectors often have a long lead time before they’re able to produce impactful work, since it’s usually necessary for them to download and engage with varied conceptual models. For this reason, we make little mention of a division between experienced and inexperienced Connectors.

Iterators are strong empiricists who build tight, efficient feedback loops for themselves and their collaborators. Ethan Perez is the central contemporary example here; his efficient prioritization and effective use of frictional time has empowered him to make major contributions to a wide range of empirical projects. Iterators do not, in all cases, have the conceptual grounding or single-agenda fixation of most Connectors; however, they can develop robust research taste (as Ethan arguably has) through experimental iteration and engagement with the broader AI safety conversation. Neel Nanda, Chris Olah, and Dan Hendrycks are also examples of this archetype.

Experienced Iterators often navigate intuitively, and are able to act on experimental findings without the need to formalize them. They make strong and varied predictions for how an experiment will play out, and know exactly how they’ll deploy their available computing resources the moment they’re free. Even experienced Iterators update often based on information they receive from the feedback loops they’ve constructed, both experimentally and socially. Early on, they may be content to work on something simply because they’ve heard it’s useful, and may pluck a lot of low hanging fruit; later, they become more discerning and ambitious.

Amplifiers are people with enough context, competence, and technical facility to prove useful as researchers, but who really shine as communicators, people managers, and project managers. A good Amplifier doesn’t often engage in the kind of idea generation native to Connectors and experienced Iterators, but excels at many other functions of leadership, either in a field-building role or as lieutenant to someone with stronger research taste. Amplifier impact is multiplicative; regardless of their official title, their soft skills help them amplify the impact of whoever they ally themselves with. Most field-building orgs are staffed by Amplifiers, MATS included.

We’ll be using “Connector,” “Iterator,” and “Amplifier” as though they were themselves professions, alongside more ordinary language like “software developer” or “ML engineer”.

Needs by Organization Type (Expanded)

In addition to independent researchers, there are broadly four types of orgs[4] working directly on AI safety research:

  • Scaling Labs
  • Technical Orgs (Small: <10 FTE and Growing: 10-30-FTE)
  • Academic Labs
  • Governance Orgs

Interviewees currently working at scaling labs were most excited to hire experienced Iterators with a strong ML background. Scaling lab safety teams have a large backlog of experiments they would like to run and questions they would like to answer which have not yet been formulated as experiments, and experienced Iterators could help clear that backlog. In particular, Iterators with sufficient experience to design and execute an experimental regimen without formalizing intermediate results could make highly impactful contributions within this context.

Virtually all roles at scaling labs have a very high bar for software development skill; many developers, when dropped into a massive codebase like that of Anthropic, DeepMind, or OpenAI, risk drowning. Furthermore, having strong software developers in every relevant position pays dividends into the future, since good code is easier to scale and iterate on, and virtually everything the company does involves using code written internally.

Researchers working at or running small orgs had, predictably, more varied needs, but still converged on more than a few points. Iterators with some (although not necessarily a lot of) experience are in high demand here. Since most small labs are built around the concrete vision of one Connector, who started their org so that they might build a team to help chase down their novel ideas, additional Connectors are in very low demand at smaller orgs.

Truly small orgs (<10 FTE employees) often don’t have a strong need for Amplifiers, since they are generally funding-constrained and usually possess the requisite soft skills among their founding members. However, as orgs interviewed approached ~20 FTE employees, they appeared to develop a strong need for Amplifiers who could assist with people management, project management, and research management, but who didn’t necessarily need to have the raw experimental ideation and execution speed of Iterators or vision of Connectors (although they might still benefit from either).

Funders of independent researchers we’ve interviewed think that there are plenty of talented applicants, but would prefer more research proposals focused on relatively few existing promising research directions (e.g., Open Phil RFPs, MATS mentors' agendas), rather than a profusion of speculative new agendas. This leads us to believe that they would also prefer that independent researchers be approaching their work from an Iterator mindset, locating plausible contributions they can make within established paradigms, rather than from a Connector mindset, which would privilege time spent developing novel approaches.

Representatives from academia also expressed a dominant need for more Iterators, but rated Connectors more highly than did scaling labs or small orgs. In particular, academia highly values research that connects the current transformer-based deep learning paradigm of ML to the existing concepts and literature on artificial intelligence, rather than research that treats solving problems specific to transformers as an end in itself. This is a key difference about work in academia in general; the transformer-based architecture is just one of many live paradigms that academic researchers address, owing to the breadth of the existing canon of academic work on artificial intelligence, whereas it is the core object of consideration at scaling labs and most small safety orgs.

Impact, Tractability, and Neglectedness (ITN)

Distilling the results above into ordinal rankings may help us identify priorities. First, let’s look at the availability of the relevant talent in the general population, to give us a sense of how useful pulling talent from the broader pool might be for AI safety. Importantly, the questions are:

  • “How impactful does the non-AI-safety labor market consider this archetype in general?”
  • “How tractable is it for the outside world to develop this archetype in a targeted way?”
  • “How neglected is the development of this archetype in a way that is useful for AI safety?”

On the job market in general, Connectors are high-impact dynamos that disrupt fields and industries. There’s no formula for generating or identifying Connectors (although there’s a booming industry trying to sell you such a formula) and, downstream of this difficulty in training the skillset, the production of Connectors is highly neglected.

Soft skills are abundant in the general population relative to AI safety professionals. Current-year business training focuses heavily on soft skills like communication and managerial acumen. Training for Amplifiers is somewhat transferable between fields, making their production extremely tractable. Soft skills are often best developed through general work experience, so targeted development of Amplifiers in AI safety might be unnecessary.

Iterators seem quite impactful, although producing more is somewhat less tractable than for Amplifiers, since their domain-specific skills must run quite deep. As a civilization, we train plenty of people like this at universities, labs, and bootcamps, but these don’t always provide the correct balance of research acumen and raw coding speed necessary for the roles in AI safety that demand top-performing Iterators.

Hiring AI-safety-specific Connectors from the general population is nearly impossible, since by far the best training for reasoning about AI safety is spending a lot of time reasoning about AI safety. Hiring Iterators from the general talent pool is easier, but can still require six months or more of upskilling, since deep domain-specific knowledge is very important. Amplifiers, though, are in good supply in the general talent pool, and orgs often successfully hire Amplifiers with limited prior experience in or exposure to AI safety.

ITN Within AIS Field-building

Within AI Safety, the picture looks very different. Importantly, this prioritization only holds at this moment; predictions about future talent needs from interviewees didn’t consistently point in the same direction.

Most orgs expressed interest in Iterators joining the team, and nearly every org expects to benefit from Amplifiers as they (and the field) continue to scale. Few orgs showed much interest in Connectors, although most would make an exception if an experienced researcher with a strong track record of impactful ideas asked to join.

The development of Iterators is relatively straightforward: you take someone with proven technical ability and an interest in AI Safety, give them a problem to solve and a community to support them, and you can produce an arguably useful researcher relatively quickly. The development of Amplifiers is largely handled by external professional experience, augmented by some time spent building context and acclimating to the culture of the AI safety community.

The development of Connectors, as previously discussed, takes a large amount of time and resources, since you only get better at reasoning about AI safety by reasoning about AI safety, which is best done in conversation with a diverse group of AI safety professionals (who are, by and large, time-constrained and work in gated-access communities). Therefore, doing this type of development, at sufficient volume, with few outputs along the way, is very costly.

We’re not seeing a sufficient influx of Amplifiers from other fields or ascension of technical staff into management positions to meet the demand at existing AI safety organizations. This is a sign that we should either augment professional outreach efforts or consider investing more in developing the soft skills of people who have a strong interest in AI safety. Unfortunately, the current high demand for Iterators at orgs seems to imply that their development is not receiving sufficient attention, either. Finally, that so few people are expressing interest in hiring Connectors, relative to the apparent high numbers of aspiring Connectors applying to MATS and other field-building programs, tells us that the ecosystem is potentially attracting an excess of inexperienced Connectors who may not be adequately equipped to refine their ideas or take on leadership positions in the current job and funding market.

Several interviewees at growing orgs who are currently looking for an Amplifier with strong research vision and taste noted that vision and taste seem to be anticorrelated with collaborative skills like “compromise.” The roles they’re hiring for strictly require those collaborative skills, and merely benefit from research taste and vision, which can otherwise be handled by existing leadership. This observation compelled them to seek, principally, people with strong soft skills and some familiarity with AI safety, rather than people with strong opinions on AI safety strategy who might not cooperate as readily with the current regime. This leads us to believe that developing Connectors might benefit from tending to their soft skills and willingness to compromise for the sake of team collaboration.

So How Do You Make an AI Safety Professional?

MATS does its best to identify and develop AI safety talent. Since, in most cases, it takes years to develop the skills to meaningfully contribute to the field, and the MATS research phase only lasts 10 weeks, identification does a lot of the heavy lifting here. It’s more reliable for us to select applicants that are 90 percent of the way there than to spin up even a very fast learner from scratch.

Still, the research phase itself enriches promising early-stage researchers by providing them with the time, resources, community, and guidance to help amplify their impact. Below we discuss the three archetypes, their respective developmental narratives, and how MATS might fit into those narratives.

The Development of a Connector

As noted above, Connectors have high-variance impact; inexperienced Connectors tend not to contribute much, while experienced Connectors can facilitate field-wide paradigm shifts. I asked one interviewee, “Could your org benefit from another “big ideas” guy?” They replied, “The ideas would have to be really goodthere are a lot of ‘idea guys’ around who don’t actually have very good ideas.

Experience and seniority seemed to track with interviewees’ appraisal of a given Connector’s utility, but not always. Even some highly venerated names in the space who fall into this archetype might not be a good fit at a particular org, since leadership’s endorsement of a given Connector’s particular ideas might bound that individual’s contributions.

Interviewees repeatedly affirmed that the conceptual skills required of Connectors don’t fully mature through study and experimental experience alone. Instead, Connectors tend to pair an extensive knowledge of the literature with a robust network of interlocutors whom they regularly debate to refine their perspective. Since Connectors are more prone to anchoring than Iterators, communally stress-testing and refining ideas shapes initial intuitions into actionable threat models, agendas, and experiments.

The deep theoretical models characteristic of Connectors allow for the development of rich, overarching predictions about the nature of AGI and the broad strokes of possible alignment strategies. Many Connectors, particularly those with intuitions rooted in models of superintelligent cognition, build models of AGI risk that are not yet empirically updateable. Demonstrating an end-to-end model of AGI risk seems to be regarded as “high-status,” but is very hard to do with predictive accuracy. Additionally, over-anchoring on a theoretical model without doing the public-facing work necessary to make it intelligible to the field at large can cause a pattern of rejection that stifles both contribution and development.

Identifying Connectors is extremely difficult ex-ante. Often it’s not until someone is actively contributing to or, at least, regularly conversing with others in the field that their potential is recognized. Some interviewees felt that measures of general intelligence are sufficient for identifying a strong potential Connector, or that CodeSignal programming test scores would generalize across a wide array of tasks relevant to reasoning about AI safety. This belief, however, was rare, and extended conversations (usually over the course of weeks) with multiple experts in the field appeared to be the most widely agreed upon way to reliably identify high-impact Connectors.

One interviewee suggested that, if targeting Connectors, MATS should perform interviews with all applicants that pass an initial screen, and that this would be more time efficient and cost effective (and more accurate) than relying on tests or selection questions. Indeed, some mentors already conduct an interview with over 10 percent of their applicant pool, and use this as their key desideratum when selecting scholars.

The Development of an Iterator

Even inexperienced Iterators can make strong contributions to teams and agendas with large empirical workloads. What’s more, Iterators have almost universally proven themselves in fields beyond AI safety prior to entering the space, often as high-throughput engineers in industry or academia.

Gaining experience as an Iterator means chugging through a high volume of experiments while simultaneously engaging in the broader discourse of the field to help refine both your research taste and intuitive sense for generating follow-up experiments. This isn’t a guaranteed formula; some Iterators will develop at an accelerated pace, others more slowly, and some may never lead teams of their own. However, this developmental roadmap means making increasingly impactful contributions to the field continuously, much earlier than the counterfactual Connector.

Iterators are also easier to identify, both by their resumes and demonstrated skills. If you compare two CVs of postdocs that have spent the same amount of time in academia, and one of them has substantially more papers (or GitHub commits) to their name than the other (controlling for quality), you’ve found the better Iterator. Similarly, if you compare two CodeSignal tests with the same score but different completion times, the one completed more quickly belongs to the stronger Iterator.

The Development of an Amplifier

Amplifiers usually occupy non-technical roles, but often have non-zero technical experience. This makes them better at doing their job in a way that serves the unique needs of the field, since they understand the type of work being done, the kinds of people involved, and how to move through the space fluidly. There are a great many micro-adjustments that non-technical workers in AI safety make in order to perform optimally in their roles, and this type of cultural fluency may be somewhat anticorrelated with the soft skills that every org needs to scale, leading to seasonal operational and managerial bottlenecks field-wide.

Great amplifiers will do whatever most needs doing, regardless of its perceived status. They will also anticipate an organization’s future needs and readily adapt to changes at any scale. A single Amplifier will often have an extremely varied background, making it difficult to characterize exactly what to look for. One strong sign is management experience, since often the highest impact role for an Amplifier is as auxiliary executive function, project management, and people management at a fast-growing org.

Amplifiers mature through direct on-the-job experience, in much the way one imagines traditional professional development. As ability increases, so does responsibility. Amplifiers may find that studying management and business operations, or even receiving management coaching or consulting, helps accentuate their comparative advantage. To build field-specific knowledge, they may consider AI Safety Fundamentals (AISF) or, more ambitiously, ARENA.

So What is MATS Doing?

We intend this section to give some foundational information about the directions we were already considering before engaging in our interviews, and to better contextualize our key updates from this interview series.

At its core, MATS is a mentorship program, and the most valuable work happens between a scholar and their mentor. However, there are some things that will have utility to most scholars, such as networking opportunities, forums for discussion, and exposure to emerging ideas from seasoned researchers. It makes sense for MATS to try to provide that class of things directly. In that spirit, we’ve broadly tried three types of supporting programming, with varied results.

Mandatory programming doesn’t tend to go over well. When required to attend seminars in MATS 3.0, scholars reported lower average value of seminars than scholars in 4.0 or 5.0, where seminars were opt-in. Similarly, when required to read the AISF curriculum and attend discussion groups, scholars reported lower value than when a similar list of readings was made available to them optionally. Mandatory programming, of any kind, doesn’t just trade off against, but actively bounds scholar research time by removing their choice. We feel strongly that scholars know what’s best for them and we want to support their needs.

In observance of the above, we’ve tried a lot of optional programming. Optional programming goes better than mandatory programming, in that scholars are more likely to attend because they consider the programming valuable (rather than showing up because they have to), and so report a better subjective experience. However, it’s still imperfect; seminar and discussion group attendance are highest at the start of the program, and slowly decline as the program progresses and scholars increasingly prioritize their research projects. We also think that optional programming often performs a social function early on and, once scholars have made a few friends and are comfortable structuring their own social lives in Berkeley, they’re less likely to carve out time for readings, structured discussions, or a presentation.

The marginal utility to scholars of additional optional programming elements seems to decline as the volume of optional programming increases. For example, in 4.0 we had far more seminars than in 5.0, and seminars in 4.0 had a lower average rating and attendance. We think this is both because we prioritized our top-performing speakers for 5.0 and because scholars viewed seminars more as novel opportunities, rather than “that thing that happens 8 times a week and often isn’t actually that relevant to my specific research interests.” Optional programming seems good up to some ceiling, beyond which returns are limited (or even negative).

MATS also offers a lot of informal resources. Want to found an org? We’ve got some experience with that. Need help with career planning? We’ve got experience there, too. Meetings with our research managers help, among other things, embed scholars in the AI safety professional network so that they’re not limited to their mentors’ contributions to their professional growth and development. In addition to their core responsibility of directly supporting scholar research projects, research managers serve as a gateway to far-reaching resources and advice outside the explicit scope of the program. A research manager might direct you to talk to another team member about a particular problem, or connect you with folks outside of MATS if they feel it’s useful. These interventions are somewhat inefficient and don’t often generalize, but can have transformative implications for the right scholar.

For any MATS scholar, the most valuable things they can spend their time on are research and networking. The ceiling on returns for time spent in either is very high. With these observations in mind, we’ve already committed internally to offering a lower overall volume of optional programming and focusing more on proactively developing an internal compendium of resources suited to situations individual scholars may find themselves in.

For our team, there are three main takeaways regarding scholar selection and training:

  1. Weight our talent portfolio toward Iterators (knowing that, with sufficient experience, they’ll often fit well even in Connector-shaped roles), since they’re comparatively easy to identify, train, and place in impactful roles in existing AI safety labs.
  2. Avoid making decisions that might select strongly against Amplifiers, since they’re definitely in demand, and existing initiatives to either poach or develop them don’t seem to satisfy this demand. Amplifiers are needed to grow existing AI safety labs and found new organizations, helping create employment opportunities for Connectors and Iterators.
  3. Foster an environment that facilitates the self-directed development of Connectors, who require consistent, high-quality contact with others working in the field in order to develop field-specific reasoning abilities, but who otherwise don’t benefit much from one-size-fits-all education. Putting too much weight on the short-term outputs of a given Connector is a disservice to their development, and for Connectors MATS should be considered less as a bootcamp and more as a residency program.

This investigation and its results are just a small part of the overall strategy and direction at MATS. We’re constantly engaging with the community, on all sides, to improve our understanding of how we best fit into the field as a whole, and are in the process of implementing many considered changes to help address other areas in which there’s room for us to grow.

Acknowledgements

This report was produced by the ML Alignment & Theory Scholars Program. @yams and Carson Jones were the primary contributors to this report, Ryan Kidd scoped, managed, and edited the project, and McKenna Fitzgerald advised throughout. Thanks to our interviewees for their time and support. We also thank Open Philanthropy, DALHAP Investments, the Survival and Flourishing Fund Speculation Grantors, and several generous donors on Manifund, without whose donations we would be unable to run upcoming programs or retain team members essential to this report.

To learn more about MATS, please visit our website. We are currently accepting donations for our Winter 2024-25 Program and beyond!

  1. ^

     AI Safety is a somewhat underspecified term, and when we use ‘AI safety’ or ‘the field’ here, we mean technical AI safety, which has been the core focus of our program up to this point. Technical AI safety, in turn, here refers to the subset of AI safety research that takes current and future technological paradigms as its chief objects of study, rather than governance, policy, or ethics. Importantly, this does not exclude all theoretical approaches, but does in practice prefer those theoretical approaches which have a strong foundation in experimentation. Due to the dominant focus on prosaic AI safety within the current job and funding market, the main focus of this report, we believe there are few opportunities for those pursuing non-prosaic, theoretical AI safety research.

  2. ^

     The initial impetus for this project was an investigation into the oft-repeated claim that AI safety is principally bottlenecked by research leadership. In the preliminary stages of our investigation, we found this to be somewhat, though not entirely, accurate. It mostly applies in the case of mid-sized orgs looking for additional leadership bandwidth, and even there soft skills are often more important than meta-level insights. Most smaller AI safety orgs form around the vision of their founders and/or acquire senior advisors fairly early on, and so have quite a few ideas to work with.

  3. ^

     These examples are not exhaustive, and few people fit purely into one category or another (even if we listed them here as chiefly belonging to a particular archetype). Many influential researchers whose careers did not, to us, obviously fit into one category or another have been omitted.

  4. ^

     In reality, many orgs are engaged in some combination of these activities, but grouping this way did help us to see some trends. At present, we’re not confident we pulled enough data from governance orgs to include them in the analysis here, but we think this is worthwhile and are devoting some additional time to that angle on the investigation. We may share further results in the future.

New Comment
64 comments, sorted by Click to highlight new comments since:

To anyone reading this who is considering working in alignment --

Following the recent revelations, I now believe OpenAI should be regarded as a bad faith actor. If you go work at OpenAI, I believe your work will be net negative; and will most likely be used to "safetywash" or "governance-wash" Sam Altman's mad dash to AGI. It now appears Sam Altman is at least a sketchy as SBF. Attempts to build "social capital" or "affect the culture from the inside" will not work under current leadership (indeed, what we're currently seeing are the failed results of 5+ years of such attempts). I would very strongly encourage anyone looking to contribute to stay away from OpenAI

I recognize this is a statement, and not an argument. I don't have the time to write out the full argument. But I'm leaving this comment here, such that others can signal agreement with it. 

[-]Raemon344

I'm around ~40% on "4 years from now, I'll think it was clearly the right call for alignment folk to just stop working at OpenAI, completely." 

But, I think it's much more likely that I'll continue endorsing something like "Treat OpenAI as a manipulative adversary by default, do not work there or deal with them unless you have a concrete plan for how you are being net positive.  And because there's a lot of optimization power in their company, be pretty skeptical that any plans you make will work. Do not give them free resources (like inviting them to EA global or job fairs)". 

I think it's nonetheless good to have some kind of "stated terms" for what actions OpenAI / Sam etc could take that might make it more worthwhile to work with them in the future (or, to reduce active opposition to them). Ultimately, I think OpenAI is on track to destroy the world, and I think actually stopping them will somehow require their cooperation at some point. So I don't think I'd want to totally burn bridges.

But I also don't think there's anything obvious Sam or OpenAI can do to "regain trust." I think the demonstrated actions with the NDAs, and Sam's deceptive non-apology, means they've lost the ability to credibly signal good faith. 

...

Some background:

Last year, when I was writing "Carefully Bootstrapped Alignment" is organizationally hard, I chatted with people at various AI labs. 

I came away with the impression that Anthropic kinda has a culture/leadership that (might, possibly) be worth investing in (but which I'd still need to see more proactive positive steps to really trust), and that DeepMind was in a weird state where it's culture wasn't very unified, but the leadership seemed at least vaguely in the right place. 

I still had a lot of doubts about those companies, but when I talked to people I knew there, I got at least some sense that there was an internal desire to be safety-conscious.

When I talked to people at OpenAI, the impression I came away with was "there's really no hope of changing the culture there. Do not bother trying."

(I think the people I talked to at all orgs were generally not optimistic about changing culture, and instead more focused on developing standards that could eventually turn into regulations, which would make it harder for the orgs to back out of agreements)

That was last year, before the seriousness of the Nondisparagement clauses and the pressure put on people became more clear cut. And, before reading Zach's post about AI companies aren't really using external evaluators

Hm, I disagree and would love to operationalize a bet/market on this somehow; one approach is something like "Will we endorse Jacob's comment as 'correct' 2 years from now?", resolved by a majority of Jacob + Austin + <neutral 3rd party>, after deliberating for ~30m.

Sure that works! Maybe use a term like "importantly misguided" instead of "correct"? (Seems easier for me to evaluate)

Mostly seems sensible to me (I agree that a likely model is that there's a lot of deceptive and manipulative behavior coming from the top and that caring about extinction risks was substantially faked), except that I would trust an agreement from Altman much more than an agreement from Bankman-Fried.

I weakly disagree. The fewer safety-motivated people want to work at OpenAI, the stronger the case for any given safety person to work there.

Also, now that there are enough public scandals, hopefully anybody wanting to work at OpenAI will be sufficiently guarded and going in with their eyes fully open, rather than naive/oblivious.

Counter-counter-argument: the safety-motivated people, especially if entering at the low level, have ~zero ability to change anything for the better internally, while they could usefully contribute elsewhere, and the presence of token safety-motivated people at OpenAI improves OpenAI's ability to safety-wash its efforts (by pointing at them and going "look how much resources we're giving them!", like was attempted with Superalignment).

[-]TsviBT8466

Technical AI safety, in turn, here refers to the subset of AI safety research that takes the current technological paradigm as its chief object of study. Importantly, this does not exclude all theoretical approaches, but does prefer those theoretical approaches which have a strong foundation in experimentation.

I appreciate this clarification, but I think it's not enough. As the most defensible counterexample, theoretical math is quintessentially technical, whether or not it relates to (non-mental) experimentation. A less defensible but more important counterexample is (careful, speculative, motivated, core) philosophy. An alternative name for what you mean here could be "prosaic". See e.g. https://www.lesswrong.com/posts/YTq4X6inEudiHkHDF/prosaic-ai-alignment :
 

“prosaic” AGI, which doesn’t reveal any fundamentally new ideas about the nature of intelligence or turn up any “unknown unknowns.”

If "prosaic" sounds derogatory, another alternative would be "in-/on-paradigm". 

All young people and other newcomers should be made aware that on-paradigm AI safety/alignment--while being more tractable, feedbacked, well-resourced, and populated compared to theory--is also inevitably streetlighting https://en.wikipedia.org/wiki/Streetlight_effect. 

All young people and other newcomers should be made aware that on-paradigm AI safety/alignment--while being more tractable, feedbacked, well-resourced, and populated compared to theory--is also inevitably streetlighting https://en.wikipedia.org/wiki/Streetlight_effect. 

 

Half-agree. I think there's scope within field like interp to focus on things that are closer to the hard part of the problem or at least touch on robust bottlenecks for alignment agendas (eg: ontology identification). I do think there is a lot of diversity in people working in these more legible areas and that means there are now many people who haven't engaged with or understood the alignment problem well enough to realise where we might be suffering from the street light effect. 

[-]TsviBT1711

Object level: ontology identification, in the sense that is studied empirically, is pretty useless. It streetlights on recognizable things, and AFAIK isn't trying to avoid, for example, the Doppelgänger problem or to at all handle diasystemic novelty or the ex quo of a mind's creativity. [ETA: actually ELK I think addresses the Doggelgänger problem in its problem statement, if not in any proposed solutions.]

Meta:

I think there's scope within field like interp to focus on things that are closer to the hard part of the problem or at least touch on robust bottlenecks for alignment agendas (eg: ontology identification).

You hedged your statement so much that it became true and also not very relevant. Here are the hedges:

  • "scope": some research could be interpreted as trying to get to some other research, or as having a mission statement that includes some other research
  • "within field[s]": some people / some research--or maybe no actual people or reseach, but possible research that would fit with the genre of the field
  • "closer to": but maybe not close to, in an absolute sense
  • "or at least touch on": if an academic philosopher wrote this about their work, you'd immediately recognize it as cope
  • "alignment agendas": there aren't any alignment agendas. There are alignment agendas in the sense that "we can start a colony around Proxima Centauri in the following way: 1. make a go-really-fast-er. 2. use the go-really-fast-er to go really fast towards Proxima Centauri" is an agenda to get to Proxima Centauri. If you make no mention of the part where you have to also slow down, and the part about steering, and the part where you have to shield from cosmic rays, and make a self-sustaining habitat on the ship, and the part about are any of the planets around Proxima Centauri remotely habitable... is this really an agenda?

Object level: ontology identification, in the sense that is studied empirically, is pretty useless. It streetlights on recognizable things, and AFAIK isn't trying to avoid, for example, the Doppelgänger problem

I haven't seen anyone do such interpretability research yet but I see no particular reason to think this is the sort of thing that can't be studied empirically rather than the sort of thing that hasn't been studied empirically. We have, for example, vision transformers and language transformers. I would be very surprised if there was a pure 1:1 mapping between the learned features in those two types of transformer models.

Well, empirically, when people try to study it empirically, instead they do something else. Surely that's empirical evidence that it can't be studied empirically? (I'm a little bit trolling but also not.)

I'd say mechanistic interpretability is trending toward a field which cares & researches the problems you mention. For example, the doppelganger problem is a fairly standard criticism of the sparse autoencoder work, diasystemic novelty seems the kind of thing you'd encounter when doing developmental interpretability, interp-through-time, or inductive biases research, especially with a focus on phase changes (a growing focus area), and though I'm having a hard time parsing your creativity post (an indictment of me, not of you, as I didn't spend too long with it), it seems the kind of thing which would come from the study of in-context-learning, a goal that mainstream MI I believe has, even if it doesn't focus on now (likely because it believes its unable to at this moment), and which I think it will care more about as the power of such in-context learning becomes more and more apparent.

ETA: An argument could be that though these problems will come up, ultimately the field will prioritize hacky fixes in order to deal with them, which only sweep the problems under the rug. I think many in MI will prioritize such limited fixes, but also that some won't, and due to the benefits of such problems becoming empirical, such people will be able to prove the value of their theoretical work & methodology by convincing MI people with their practical applications, and money will get diverted to such theoretical work & methodology by DL-theory-traumatized grantmakers.

the doppelganger problem is a fairly standard criticism of the sparse autoencoder work,

And what's the response to the criticism, or a/the hoped approach?

diasystemic novelty seems the kind of thing you'd encounter when doing developmental interpretability, interp-through-time

Yeah, this makes sense. And hey, maybe it will lead to good stuff. Any results so far, that I might consider approaching some core alignment difficulties?

it seems the kind of thing which would come from the study of in-context-learning, a goal that mainstream MI I believe has, even if it doesn't focus on now (likely because it believes its unable to at this moment), and which I think it will care more about as the power of such in-context learning becomes more and more apparent.

Also makes some sense (though the ex quo, insofar as we even want to attribute this to current systems is distributed across the training algorithms and the architecture sources, as well as inference-time stuff).

Generally what you're bringing up sounds like "yes these are problems and MI would like to think about them... later". Which is understandable, but yeah, that's what streetlighting looks like.

Maybe an implicit justification of current work is like:

There's these more important, more difficult problems. We want to deal with them, but they are too hard right now, so we will try in the future. Right now we'll deal with simpler things. By dealing with simpler things, we'll build up knowledge, skills, tools, and surrounding/supporting orientation (e.g. explaining weird phenomena that are actually due to already-understandable stuff, so that later we don't get distracted). This will make it easier to deal with the hard stuff in the future.

This makes a lot of sense--it's both empathizandable, and seems probably somewhat true. However:

  1. Again, it still isn't in fact currently addressing the hard parts. We want to keep straight the difference between [currently addressing] vs. [arguably might address in the future].
  2. We gotta think about what sort of thing would possibly ever work. We gotta think about this now, as much as possible.
  3. A core motivating intuition behind the MI program is (I think) "the stuff is all there, perfectly accessible programmatically, we just have to learn to read it". This intuition is deeply flawed: Koan: divining alien datastructures from RAM activations

I don't know of any clear progress on your interests yet. My argument was about the trajectory MI is on, which I think is largely pointed in the right direction. We can argue about the speed at which it gets to the hard problems, whether its fast enough, and how to make it faster though. So you seem to have understood me well.

A core motivating intuition behind the MI program is (I think) "the stuff is all there, perfectly accessible programmatically, we just have to learn to read it". This intuition is deeply flawed: Koan: divining alien datastructures from RAM activations

I think I'm more agnostic than you are about this, and also about how "deeply" flawed MI's intuitions are. If you're right, once the field progresses to nontrivial dynamics, we should expect those operating at a higher level of analysis--conceptual MI--to discover more than those operating at a lower level, right?

If, hypothetically, we were doing MI on minds, then I would predict that MI will pick some low hanging fruit and then hit walls where their methods will stop working, and it will be more difficult to develop new methods that work. The new methods that work will look more and more like reflecting on one's own thinking, discovering new ways of understanding one's own thinking, and then going and looking for something like that in the in-vitro mind. IDK how far that could go. But then this will completely grind to a halt when the IVM is coming up with concepts and ways of thinking that are novel to humanity. Some other approach would be needed to learn new ideas from a mind via MI.

However, another dealbreaker problem with current and current-trajectory MI is that it isn't studying minds.

I mean my impression is that there are something on the order of 100-1000 people in the world working on ML interpretability as their day job, and maybe 1k-10k people who dabble in their free time. No research in the field will get done unless one of that small number of people makes a specific decision to tackle that particular research question instead of one of the countless other ones they could choose to tackle.

I don't know what you're trying to do in this thread (e.g. what question you're trying to answer).

To be explicit, that was a response to

Well, empirically, when people try to study it empirically, instead they do something else

I don't know that we have any empirical data on what happens when people try to study that particular empirical question (the specific relationship between the features leaned by two models of different modalities) because I don't know that anyone has set out to study that particular question in any serious way.

In other words, I suspect it's not "when someone starts to study this phenomenon, some mysterious process causes them to study something else instead". I think it's "the surface area of the field is large and there aren't many people in it, so I doubt anyone has even gotten to the part where they start to study this phenomenon."

Edit: to be even more explicit, what I'm trying to do in this thread is encourage thinking about ways one might collect empirical observations about non-"streetlit" topics. None of the topics are under the streetlight until someone builds the streetlight. "Build a streetlight" is sometimes an available action, but it only happens if someone makes a specific effort to do so.

Edit 2: I misunderstood what point you were making as "prosaic alignment is unlikely to be helpful, look at all of these empirical researchers who have not even answered these basic questions" (which is a perspective I disagree with pretty strongly) rather than "I think empirical research shouldn't be the only game in town" (which I agree with) and "we should fund outsiders to go do stuff without much interaction with or feedback from the community to hopefully develop new ideas that are not contaminated with the current community biases" (I think this would he worth doing f resources we're unlimited, not sure as things actually stand).

As a concrete note, I suspect work that demonstrates that philosophical or mathematical approaches can yield predictions about empirical questions is more likely to be funded. For example, in your post you say

In programming, adding a function definition would be endosystemic; refactoring the code into a functional style rather than an object-oriented style, or vice versa, in a way that reveals underlying structure, is diasystemic novelty.

Could that be operationalized as a prediction of the form

If you train a model on a bunch of simple tasks involving both functional and object-oriented code (e.g. "predict the next token of the codebase", "predict missing token", "identify syntax errors") and then train it on a complex task on only object-oriented code (e.g. "write a document describing how to use this library"), it will fail to navigate that ontological shift and will be unable to document functional code.

(I expect that's not a correct operationalization but something of that shape)

Here's the convo according to me:

Bloom:

I think there's scope within field like interp to focus on things that are closer to the hard part of the problem or at least touch on robust bottlenecks for alignment agendas

BT:

Object level: ontology identification, in the sense that is studied empirically, is pretty useless.

sname:

I haven't seen anyone do such interpretability research yet but I see no particular reason to think this is the sort of thing that can't be studied empirically rather than the sort of thing that hasn't been studied empirically.

BT:

Well, empirically, when people try to study it empirically, instead they do something else

sname:

I don't know that we have any empirical data on what happens when people try to study that particular empirical question (the specific relationship between the features leaned by two models of different modalities) because I don't know that anyone has set out to study that particular question in any serious way.

BT:

ah, sname is talking about conceptual Doppelgängers specifically, as ze indicated in a previous comment that I now understand

When I said "when people try to study it empirically", what I meant was "when people try to do interpretability research (presumably, that is relevant to the hard part of the problem?)".

"prosaic alignment is unlikely to be helpful, look at all of these empirical researchers who have not even answered these basic questions"

Right, I'm not saying exactly this. But I am saying:

Prosaic alignment is unlikely to be helpful, look at how they are starting in an extremely streetlighty way(*) and then, empirically, not pushing out into the dark quickly--and furthermore, AFAIK, not very concerned with how they aren't pushing out into the dark quickly enough, or successfully addressing this at the meta level, though plausibly they're doing that and I'm just not aware.

(*): studying LLMs, which are not minds; trying to recognize [stuff we mostly conceptually understand] within systems rather than trying to come to conceptually understand [the stuff we'd need to be able to recognize/design in a mind, in order to determine the mind's effects].

(I think this would he worth doing f resources we're unlimited, not sure as things actually stand).

Well, you've agreed with a defanged version of my statements. The toothful version, which I do think: Insofar as this is even possible, we should allocate a lot more resources toward funding any high-caliber smart/creative/interesting/promising/motivated youngsters/newcomers who want to take a crack at independently approaching the core difficulties of AGI alignment, even if that means reallocating a lot of resources away from existing on-paradigm research.

Edit: to be even more explicit, what I'm trying to do in this thread is encourage thinking about ways one might collect empirical observations about non-"streetlit" topics. None of the topics are under the streetlight until someone builds the streetlight. "Build a streetlight" is sometimes an available action, but it only happens if someone makes a specific effort to do so.

This seems like a good thing to do. But there's multiple ways that existing research is streetlit, and reality doesn't owe it to you to make it be the case that there are nice (tractionful, feasible, interesting, empirical, familiar, non-weird-seeming, feedbacked, grounded, legible, consensusful) paths toward the important stuff. The absence of nice paths would really suck if it's the case, and it's hard to see how anyone could be justifiedly really confident that there are no nice paths. But yes, I'm saying that it looks like there aren't nice paths, or at least there aren't enough nice paths that we seem likely to find them by continuing to sample from the same distribution we've been sampling from; and I have some arguments and reasons supporting this belief, which seem true; and I would guess that a substantial fraction (though not most) of current alignment researchers would agree with a fairly strong version of "very few or no nice paths".

Could that be operationalized as a prediction of the form

If you train a model on a bunch of simple tasks involving both functional and object-oriented code (e.g. "predict the next token of the codebase", "predict missing token", "identify syntax errors") and then train it on a complex task on only object-oriented code (e.g. "write a document describing how to use this library"), it will fail to navigate that ontological shift and will be unable to document functional code.

I don't think that's a good operationalization, as you predict. I think it's trying to be an operationalization related to my claim above:

ontology identification, in the sense that is studied empirically, is pretty useless. It [..] AFAIK isn't trying to [...] at all handle diasystemic novelty [...].

But it sort of sounds like you're trying to extract a prediction about capability generalization or something? Anyway, an interp-like study trying to handle diasystemic novelty might for example try to predict large scale explicitization events events before they happen--maybe in a way that's robust to "drop out". E.g. you have a mind that doesn't explicitly understand Bayesian reasoning; but it is engaging in lots of activities that would naturally induce small-world probabilistic reasoning, e.g. gambling games or predicting-in-distribution simple physical systems; and then your interpreter's job is to notice, maybe only given access to restricted parts (in time or space, say) of the mind's internals, that Bayesian reasoning is (implicitly) on the rise in many places. (This is still easy mode if the interpreter gets to understand Bayesian reasoning explicitly beforehand.) I don't necessarily recommend this sort of study, though; I favor theory.

Here's the convo according to me:

...

Seems about right

When I said "when people try to study it empirically", what I meant was "when people try to do interpretability research (presumably, that is relevant to the hard part of the problem?)".

Is there a particular reason you expect there to be exactly one hard part of the problem, and for the part that ends up being hardest in the end to be the part that looks hardest to us now?

Prosaic alignment is unlikely to be helpful, look at how they are starting in an extremely streetlighty way(*) and then, empirically, not pushing out into the dark quickly--and furthermore, AFAIK, not very concerned with how they aren't pushing out into the dark quickly enough, or successfully addressing this at the meta level, though plausibly they're doing that and I'm just not aware.


If I were a prosaic alignment researcher, I probably would choose to prioritize which problems I worked on a bit differently than those currently in the field. However, I expect that the research that ends up being the most useful will not be that research which looked most promising before someone started doing it, but rather research that stemmed from someone trying something extremely simple and getting an unexpected result, and going "huh, that's funny, I should investigate further". I think that the process of looking at lots of things and trying to get feedback from reality as quickly as possible is promising, even if I don't have a strong expectation that any one specific one of those things is promising to look at.

But there's multiple ways that existing research is streetlit, and reality doesn't owe it to you to make it be the case that there are nice (tractionful, feasible, interesting, empirical, familiar, non-weird-seeming, feedbacked, grounded, legible, consensusful) paths toward the important stuff

Certainly reality doesn't owe us a path like that, but it would be pretty undignified if reality did in fact give us a path like that and we failed to find it because we didn't even look.

Anyway, an interp-like study trying to handle diasystemic novelty might for example try to predict large scale explicitization events events before they happen--maybe in a way that's robust to "drop out". E.g. you have a mind that doesn't explicitly understand Bayesian reasoning; but it is engaging in lots of activities that would naturally induce small-world probabilistic reasoning, e.g. gambling games or predicting-in-distribution simple physical systems; and then your interpreter's job is to notice, maybe only given access to restricted parts (in time or space, say) of the mind's internals, that Bayesian reasoning is (implicitly) on the rise in many places. (This is still easy mode if the interpreter gets to understand Bayesian reasoning explicitly beforehand.)

Interesting. I would be pretty interested to see research along these lines (although the scope of the above is probably still a bit large for a pilot project).

I don't necessarily recommend this sort of study, though; I favor theory.

What is your preferred method for getting feedback from reality on whether your theory describes the world as it is?

Is there a particular reason you expect there to be exactly one hard part of the problem,

Have you stopped beating your wife? I say "the" here in the sense of like "the problem of climbing that mountain over there". If you're far away, it makes sense to talk about "the (thing over there)", even if, when you're up close, there's multiple routes, multiple summits, multiple sorts of needed equipment, multiple sources of risk, etc.

and for the part that ends up being hardest in the end to be the part that looks hardest to us now?

We make an argument like "any solution would have to address X" or "anything with feature Y does not do Z" or "property W is impossible", and then we can see what a given piece of research is and is not doing / how it is doomed to irrelevance. It's not like pointing to a little ball in ideaspace and being like "the answer is somewhere in here". Rather it's like cutting out a halfspace and saying "everything on this side of this plane is doomed, we'd have to be somewhere in the other half", or like pointing out a manifold that all research is on and saying "anything on this manifold is doomed, we'd have to figure out how to move somewhat orthogonalward".

research that stemmed from someone trying something extremely simple and getting an unexpected result

I agree IF we are looking at the objects in question. If LLMs were minds, the research would be much more relevant. (I don't care if you have an army of people who all agree on taking a stance that seems to imply that there's not much relevant difference between LLMs and future AGI systems that might kill everyone.)

What is your preferred method for getting feedback from reality on whether your theory describes the world as it is?

I think you (and everyone else) don't know how to ask this question properly. For example, "on whether your theory describes the world as it is" is a too-narrow idea of what our thoughts about minds are supposed to be. Sub-example: our thoughts about mind are supposed to also produce design ideas.

To answer your question: by looking at and thinking about minds. The only minds that currently exist are humans, and the best access you have to minds is introspection. (I don't mean meditation, I mean thinking and also thinking about thinking/wanting/acting--aka some kinds of philosophy and math.)

Is there a particular reason you expect there to be exactly one hard part of the problem,

Have you stopped beating your wife? I say "the" here in the sense of like "the problem of climbing that mountain over there". If you're far away, it makes sense to talk about "the (thing over there)", even if, when you're up close, there's multiple routes, multiple summits, multiple sorts of needed equipment, multiple sources of risk, etc.

I think the appropriate analogy is someone trying to strategize about "the hard part of climbing that mountain over there" before they have even reached base camp or seriously attempted to summit any other mountains. There are a bunch of parts that might end up being hard, and one can come up with some reasonable guesses as to what those parts might be, but the bits that look hard from a distance and the bits that end up being hard when you're on the face of the mountain may be different parts.

We make an argument like "any solution would have to address X" or "anything with feature Y does not do Z" or "property W is impossible", and then we can see what a given piece of research is and is not doing / how it is doomed to irrelevance

Doomed to irrelevance, or doomed to not being a complete solution in and of itself? The point of a lot of research is to look at a piece of the world and figure out how it ticks. Research to figure out how a piece of the world ticks won't usually directly allow you to make it tock instead, but can be a useful stepping stone. Concrete example: dictionary learning vs Golden Gate Claude.

I agree IF we are looking at the objects in question. If LLMs were minds, the research would be much more relevant.

I think one significant crux is "to what extent are LLMs doing the same sort of thing that human brains do / the same sorts of things that future, more powerful AIs will do?"  It sounds like you think the answer is "they're completely different and you won't learn much about one by studying the other". Is that an accurate characterization?

To answer your question: by looking at and thinking about minds. The only minds that currently exist are humans

Agreed, though with quibbles

and the best access you have to minds is introspection.

In my experience, my brain is a dirty lying liar that lies to me at every opportunity -- another crux might be how faithful one expects their memory of their thought processes to be to the actual reality of those thought processes.

Doomed to irrelevance, or doomed to not being a complete solution in and of itself?

Doomed to not be trying to go to and then climb the mountain.

my brain is a dirty lying liar that lies to me at every opportunity

So then it isn't easy. But it's feedback. Also there's not that much distinction between making a philosophically rigorous argument and "doing introspection" in the sense I mean, so if you think the former is feasible, work from there.

Doomed to irrelevance, or doomed to not being a complete solution in and of itself?

Doomed to not be trying to go to and then climb the mountain.

 

If you think that current mech interp work is currently trying to directly climb the mountain, rather than trying to build and test a set of techniques that might be helpful on a summit attempt, I can see why you'd be frustrated and discouraged at the lack of progress.

> Also there's not that much distinction between making a philosophically rigorous argument and "doing introspection" in the sense I mean, so if you think the former is feasible, work from there.

I don't have much hope in the former being feasible, though I do support having a nonzero number of people try it because sometimes things I don't think are feasible end up working.

I mean if we're going with memes I could equally say

though realistically I think the most common problem in this kind of discussion is

Look... Consider the hypothetically possible situation that in fact everyone is very far from being on the right track, and everything everyone is doing doesn't help with the right track and isn't on track to get on the right track or to help with the right track.

Ok, so I'm telling you that this hypothetically possible situation seems to me like the reality. And then you're, I don't know, trying to retreat to some sort of agreeable live-and-let-live stance, or something, where we all just agree that due to model uncertainty and the fact that people have vaguely plausible stories for how their thing might possibly be helpful, everyone should do their own thing and it's not helpful to try to say that some big swath of research is doomed? If this is what's happening, then I think that what you in particular are doing here is a bad thing to do here.

Maybe we can have a phone call if you'd like to discuss further.

Maybe we can have a phone call if you'd like to discuss further.

I doubt it's worth it -- I'm not a major funder in this space and don't expect to become one in the near future, and my impression is that there is no imminent danger of you shutting down research that looks promising to me and unpromising to you. As such, I think the discussion ended up getting into the weeds in a way that probably wasn't a great use of either of our time, and I doubt spending more time on it would change that.

That said, I appreciated your clarity of thought, and in particular your restatement of how the conversation looked to you. I will probably be stealing that technique.

You are, of course, correct in your definitions of "technical" and "prosaic" AI safety. Our interview series did not exclude advocates of theoretical or non-prosaic approaches to AI safety. It was not the intent of this report to ignore talent needs in non-prosaic technical AI safety. We believe that this report summarises our best understanding of the dominant talent needs across all of technical AI safety, at least as expressed by current funders and org leaders.

MATS has supported several theoretical or non-prosaic approaches to improving AI safety, including Vanessa Kosoy’s learning theoretic agenda, Jesse Clifton’s and Caspar Oesterheldt’s cooperative AI research, Vivek Hebbar’s empirical agent foundations research, John Wentworth’s selection theorems agenda, and more. We remain supportive of well-scoped agent foundations research, particularly that with tight empirical feedback loops. If you are an experienced agent foundations researcher who wants to mentor, please contact us; this sub-field seems particularly bottlenecked by high-quality mentorship right now.

I have amended our footnote to say:

Technical AI safety, in turn, here refers to the subset of AI safety research that takes current and future technological paradigms as its chief objects of study, rather than governance, policy, or ethics. Importantly, this does not exclude all theoretical approaches, but does in practice prefer those theoretical approaches which have a strong foundation in experimentation. Due to the dominant focus on prosaic AI safety within the current job and funding market, the main focus of this report, we believe there are few opportunities for those pursuing non-prosaic, theoretical AI safety research.

If you disagree with our assessment, please let us know! We would love to hear about more jobs or funding opportunities for non-prosaic AI safety research.

[-]TsviBT2520

Thanks. Well, now the footnote seems better, but now it contradicts the title. The footnote says that "the main focus of this report" is "the current job and funding market". This is conflating "the current job and funding market" with "technical AI safety", given that the title is "Talent Needs in Technical AI Safety".

Note: I don't mean to single out you (Ryan) or MATS or this post; I greatly appreciate your work and think it's good, and don't think you're doing something worse than others are doing with regard to framing the field for newcomers. What I'm trying to do here is fight a (rearguard, unfortunately) action against the sweep of [most of the resource allocation around here] conflating [what the people currently working on stuff called "AI safety/alignment" say they could use help with] with [what is needed in order to figure out AGI alignment].

One response to what I'm saying is: Yes, the people in the field will of course make lots of mistakes, but they're still at the forefront, and so the aggregate of their guesses about what new talent should do represent our best guess.

My counterresponse: No, that doesn't follow. There's a separate parameter of "how much do we (the people in the field) actually know about how to turn effort into progress, as opposed to not knowing and therefore needing help in the form of new talent that tries new approaches to turning effort into progress". At least as of a couple years ago, my sense was that nearly all experts working on AI safety/alignment would agree that 1. their plans for alignment won't work, and 2. alignment is preparadigmatic. (I'm not confident that they would have said so, or would say so now.)

Depending on the value of that parameter, conflating "the current job and funding market" with "technical AI safety" makes more or less sense. Further, to the extent that people inappropriately conflate these two things, these two things become even more distinct. (Cf. Dangers of deference.)

I think there might be a simple miscommunication here: in our title and report we use "talent needs" to refer to "job and funding opportunities that could use talent." Importantly, we generally make a descriptive, not a normative, claim about the current job and funding opportunities. We could have titled the report "Open and Impactful Job and Funding Opportunities in Technical AI Safety," but this felt unwieldy. Detailing what job and funding opportunities should exist in the technical AI safety field is beyond the scope of this report.

Also, your feedback is definitely appreciated!

[-]TsviBT3132

Ok I think you're right. I didn't know (at least, not well enough) that "talent needs" quasi-idiomatically means "sorts of people that an organization wants to hire", and interpreted it to mean literally "needs (by anyone / the world) for skills / knowledge".

I don't buy the unwieldiness excuse; you could say "Hiring needs in on-paradigm technical AI safety", for example. But me criticizing minutae of the framing in this post doesn't seem helpful. The main thing I want to communicate is that

  1. the main direct help we can give to AGI alignment would go via novel ideas that would be considered off-paradigm; and therefore
  2. high-caliber newcomers to the field should be strongly encouraged to try to do that; and
  3. there's strong emergent effects in the resource allocation (money, narrative attention, collaboration) of the field that strongly discourage newcomers from doing so and/or don't attract newcomers who would do so.

Yes, there is more than unwieldiness at play here. If we retitled the post "Hiring needs in on-paradigm technical AI safety," (which does seem unwieldy and introduces an unneeded concept, IMO) this seems like it would work at cross purposes to our (now explicit) claim, "there are few opportunities for those pursuing non-prosaic, theoretical AI safety research." I think it benefits no-one to make false or misleading claims about the current job market for non-prosaic, theoretical AI safety research (not that I think you are doing this; I just want our report to be clear). If anyone doesn't like this fact about the world, I encourage them to do something about it! (E.g., found organizations, support mentees, publish concrete agendas, petition funders to change priorities.)

As indicated by MATS' portfolio over research agendas, our revealed preferences largely disagree with point 1 (we definitely want to continue supporting novel ideas too, constraints permitting, but we aren't Refine). Among other objectives, this report aims to show a flaw in the plan for point 2: high-caliber newcomers have few mentorship, job, or funding opportunities to mature as non-prosaic, theoretical technical AI safety researchers and the lead time for impactful Connectors is long. We welcome discussion on how to improve paths-to-impact for the many aspiring Connectors and theoretical AI safety researchers.

I agree with Tsvi here (as I'm sure will shock you :)).

I'd make a few points:

  1. "our revealed preferences largely disagree with point 1" - this isn't clear at all. We know MATS' [preferences, given the incentives and constraints under which MATS operates]. We don't know what you'd do absent such incentives and constraints.
    1. I note also that "but we aren't Refine" has the form [but we're not doing x], rather than [but we have good reasons not to do x]. (I don't think MATS should be Refine, but "we're not currently 20% Refine-on-ramp" is no argument that it wouldn't be a good idea)
  2. MATS is in a stronger position than most to exert influence on the funding landscape. Sure, others should make this case too, but MATS should be actively making a case for what seems most important (to you, that is), not only catering to the current market.
    1. Granted, this is complicated by MATS' own funding constraints - you have more to lose too (and I do think this is a serious factor, undesirable as it might be).
  3. If you believe that the current direction of the field isn't great, then "ensure that our program continues to meet the talent needs of safety teams" is simply the wrong goal.
    1. Of course the right goal isn't diametrically opposed to that - but still, not that.
  4. There's little reason to expect the current direction of the field to be close to ideal:
    1. At best, the accuracy of the field's collective direction will tend to correspond to its collective understanding - which is low.
    2. There are huge commercial incentives exerting influence.
    3. There's no clarity on what constitutes (progress towards) genuine impact.
    4. There are many incentives to work on what's already not neglected (e.g. things with easily located "tight empirical feedback loops"). The desirable properties of the non-neglected directions are a large part of the reason they're not neglected.
    5. Similar arguments apply to [field-level self-correction mechanisms].
  5. Given (4), there's an inherent sampling bias in taking [needs of current field] as [what MATS should provide]. Of course there's still an efficiency upside in catering to [needs of current field] to a large extent - but efficiently heading in a poor direction still sucks.
  6. I think it's instructive to consider extreme-field-composition thought experiments: suppose the field were composed of [10,000 researchers doing mech interp] [10 researchers doing agent foundations].
    1. Where would there be most jobs? Most funding? Most concrete ideas for further work? Does it follow that MATS would focus almost entirely on meeting the needs of all the mech interp orgs? (I expect that almost all the researchers in that scenario would claim mech interp is the most promising direction)
    2. If you think that feedback loops along the lines of [[fast legible work on x] --> [x seems productive] --> [more people fund and work on x]] lead to desirable field dynamics in an AIS context, then it may make sense to cater to the current market. (personally, I expect this to give a systematically poor signal, but it's not as though it's easy to find good signals)
    3. If you don't expect such dynamics to end well, it's worth considering to what extent MATS can be a field-level self-correction mechanism, rather than a contributor to predictably undesirable dynamics.
      1. I'm not claiming this is easy!!
      2. I'm claiming that it should be tried.

 

Detailing what job and funding opportunities should exist in the technical AI safety field is beyond the scope of this report.

Understandable, but do you know anyone who's considering this? As the core of their job, I mean - not on a [something they occasionally think/talk about for a couple of hours] level. It's non-obvious to me that anyone at OpenPhil has time for this.

It seems to me that the collective 'decision' we've made here is something like:

  • Any person/team doing this job would need:
    • Extremely good AIS understanding.
    • To be broadly respected.
    • Have a lot of time.
  • Nobody like this exists.
  • We'll just hope things work out okay using a passive distributed approach.

To my eye this leads to a load of narrow optimization according to often-not-particularly-enlightened metrics - lots of common incentives, common metrics, and correlated failure.

 

Oh and I still think MATS is great :) - and that most of these issues are only solvable with appropriate downstream funding landscape alterations. That said, I remain hopeful that MATS can nudge things in a helpful direction.

I plan to respond regarding MATS' future priorities when I'm able (I can't speak on behalf of MATS alone here and we are currently examining priorities in the lead up to our Winter 2024-25 Program), but in the meantime I've added some requests for proposals to my Manifund Regrantor profile.

RFPs seem a good tool here for sure. Other coordination mechanisms too.
(And perhaps RFPs for RFPs, where sketching out high-level desiderata is easier than specifying parameters for [type of concrete project you'd like to see])

Oh and I think the MATS Winter Retrospective seems great from the [measure a whole load of stuff] perspective. I think it's non-obvious what conclusions to draw, but more data is a good starting point. It's on my to-do-list to read it carefully and share some thoughts.

[-]TsviBT3831

Ok I want to just lay out what I'm trying to do here, and why, because it could be based on false assumptions.

A main assumption I'm making, which totally could be false, is that your paragraph

Funders of independent researchers we’ve interviewed think that there are plenty of talented applicants, but would prefer more research proposals focused on relatively few existing promising research directions (e.g., Open Phil RFPs, MATS mentors' agendas), rather than a profusion of speculative new agendas.

is generally representative of the entire landscape, with a few small-ish exceptions. In other words, I'm assuming that it's pretty difficult for a young smart person to show up and say "hey, I want to spend 3 whole years thinking about this problem de novo, can I have one year's salary and a reevaluation after 1 year for a renewal".

A main assumption that motivates what I'm doing here, and that could be false, is:

Funders make decisions mostly by some combination of recommendations from people they trust. The trust might be personal, or might be based on accomplishments, or might be based on some arguments made by the trusted person to the funder--and, centrally, the trust is actually derived from a loose diffuse array of impressions coming from the community, broadly.

To make the assumption slightly more clear: The assumption says that it's actually quite common, maybe even the single dominant way funders make decisions, for the causality of a decision to flow through literally thousands of little interactions, where the little interactions communicate "I think XYZ is Important/Unimportant". And these aggregate up into a general sense of importance/unimportance, or something. And then funding decisions work with two filters:

  1. The explicit reasoning about the details--is this person qualified, how much funding, what's the feedback, who endorses it, etc etc.
  2. The implicit filter of Un/Importance. This doesn't get raised to attention usually. It's just in the background.

And "fund a smart motivated youngster without a plan for 3 years with little evaluation" is "unimportant". And this unimportance is implicitly but strongly reinforced by everyone talking about in-paradigm stuff. And the situation is self-reinforcing because youngsters mostly don't try to do the thing, because there's no narrative and no funding, and so it is actually true that there aren't many smart motivated youngsters just waiting for some funding to do trailblazing.

If my assumptions are true, then IDK what to do about this but would say that at least

  1. people should be aware of this situation, and
  2. people should keep talking about this situation, especially in contexts where they are contributing to the loose diffuse array of impressions by contributing to framing about what AGI alignment needs.

An interesting note: I don't necessarily want to start a debate about the merits of academia, but "fund a smart motivated youngster without a plan for 3 years with little evaluation" sounds a lot like "fund more exploratory AI safety PhDs" to me. If anyone wants to do an AI safety PhD (e.g., with these supervisors) and needs funding, I'm happy to evaluate these with my Manifund Regrantor hat on.

That would only work for people with the capacity to not give a fuck what anyone around them thinks, especially including the person funding and advising them. And that's arguably unethical depending on context.

I like Adam's description of an exploratory AI safety PhD:

You'll also have an unusual degree of autonomy: You’re basically guaranteed funding and a moderately supportive environment for 3-5 years, and if you have a hands-off advisor you can work on pretty much any research topic. This is enough time to try two or more ambitious and risky agendas.

Ex ante funding guarantees, like The Vitalik Buterin PhD Fellowship in AI Existential Safety or Manifund or other funders, mitigate my concerns around overly steering exploratory research. Also, if one is worried about culture/priority drift, there are several AI safety offices in Berkeley, Boston, London, etc. where one could complete their PhD while surrounded by AI safety professionals (which I believe was one of the main benefits of the late Lightcone office).

From the section you linked:

Moreover, the program guarantees at least some mentorship from your supervisor. Your advisor’s incentives are reasonably aligned with yours: they get judged by your success in general, so want to see you publish well-recognized first-author research, land a top research job after graduation and generally make a name for yourself (and by extension, them).

Doing a PhD also pushes you to learn how to communicate with the broader ML research community. The “publish or perish'' imperative means you’ll get good at writing conference papers and defending your work.

These would be exactly the "anyone around them" about whose opinion they would have to not give a fuck.

I don't know a good way to do this, but maybe a pointer would be: funders should explicitly state something to the effect of:

"The purpose of this PhD funding is to find new approaches to core problems in AGI alignment. Success in this goal can't be judged by an existing academic structure (journals, conferences, peer-review, professors) because there does not exist such a structure aimed at the core problems in AGI alignment. You may if you wish make it a major goal of yours to produce output that is well-received by some group in academia, but be aware that this goal would be non-overlapping with the purpose of this PhD funding."

The Vitalik fellowship says:

To be eligible, applicants should either be graduate students or be applying to PhD programs. Funding is conditional on being accepted to a PhD program, working on AI existential safety research, and having an advisor who can confirm to us that they will support the student’s work on AI existential safety research.

Despite being an extremely reasonable (even necessary) requirement, this is already a major problem according to me. The problem is that (IIUC--not sure) academics are incentivized to, basically, be dishonest, if it gets them funding for projects / students. Of the ~dozen professors here (https://futureoflife.org/about-us/our-people/ai-existential-safety-community/) who I'm at least a tiny bit familiar with, I think maybe 1.5ish are actually going to happily support actually-exploratory PhD students. I could be wrong about this though--curious for more data either way. And how many will successfully communicate to the sort of person who would take a real shot at exploratory conceptual research if given the opportunity to do such research that they would in fact support that? I don't know. Zero? One? And how would someone sent to the FLI page know of the existence of that professor?

Fellows are expected to participate in annual workshops and other activities that will be organized to help them interact and network with other researchers in the field.

Continued funding is contingent on continued eligibility, demonstrated by submitting a brief (~1 page) progress report by July 1st of each year.

Again, reasonable, but... Needs more clarity on what is expected, and what is not expected.

a technical specification of the proposed research

What does this even mean? This webpage doesn't get it. We're trying to buy something that isn't something someone can already write a technical specification of.

I want to sidestep critique of "more exploratory AI safety PhDs" for a moment and ask: why doesn't MIRI sponsor high-calibre young researchers with a 1-3 year basic stipend and mentorship? And why did MIRI let Vivek's team go?

I don't speak for MIRI, but broadly I think MIRI thinks that roughly no existing research is hopeworthy, and that this isn't likely to change soon. I think that, anyway.

In discussions like this one, I'm conditioning on something like "it's worth it, these days, to directly try to solve AGI alignment". That seems assumed in the post, seems assumed in lots of these discussions, seems assumed by lots of funders, and it's why above I wrote "the main direct help we can give to AGI alignment" rather than something stronger like "the main help (simpliciter) we can give to AGI alignment" or "the main way we can decrease X-risk".

I'm reading this as you saying something like "I'm trying to build a practical org that successfully onramps people into doing useful work. I can't actually do that for arbitrary domains that people aren't providing funding for. I'm trying to solve one particular part of the problem and that's hard enough as it is."

Is that roughly right?

Fwiw I appreciate your Manifund regrantor Request for Proposals announcement.

I'll probably have more thoughts later.

Yes to all this, but also I'll go one level deeper. Even if I had tons more Manifund money to give out (and assuming all the talent needs discussed in the report are saturated with funding), it's not immediately clear to me that "giving 1-3 year stipends to high-calibre young researchers, no questions asked" is the right play if they don't have adequate mentorship, the ability to generate useful feedback loops, researcher support systems, access to frontier models if necessary, etc.

A few points here (all with respect to a target of "find new approaches to core problems in AGI alignment"):

It's not clear to me what the upside of the PhD structure is supposed to be here (beyond respectability). If the aim is to avoid being influenced by most of the incentives and environment, that's more easily achieved by not doing a PhD. (to the extent that development of research 'taste'/skill acts to service a publish-or-perish constraint, that's likely to be harmful)

This is not to say that there's nothing useful about an academic context - only that the sensible approach seems to be [create environments with some of the same upsides, but fewer downsides].

I can see a more persuasive upside where the PhD environment gives:

  • Access to deep expertise in some relevant field.
  • The freedom to explore openly (without any "publish or perish" constraint).

This seems likely to be both rare, and more likely for professors not doing ML. I note here that ML professors are currently not solving fundamental alignment problems - we're not in a [Newtonian physics looking for Einstein] situation; more [Aristotelian physics looking for Einstein]. I can more easily imagine a mathematics PhD environment being useful than an ML one (though I'd expect this to be rare too).

This is also not to say that a PhD environment might not be useful in various other ways. For example, I think David Krueger's lab has done and is doing a bunch of useful stuff - but it's highly unlikely to uncover new approaches to core problems.

For example, of the 213 concrete problems posed here how many would lead us to think [it's plausible that a good answer to this question leads to meaningful progress on core AGI alignment problems]? 5? 10? (many more can be a bit helpful for short-term safety)

There are a few where sufficiently general answers would be useful, but I don't expect such generality - both since it's hard, and because incentives constantly push towards [publish something on this local pattern], rather than [don't waste time running and writing up experiments on this local pattern, but instead investigate underlying structure].

I note that David's probably at the top of my list for [would be a good supervisor for this kind of thing, conditional on having agreed the exploratory aims at the outset], but the environment still seems likely to be not-close-to-optimal, since you'd be surrounded by people not doing such exploratory work.

I do think category theory professors or similar would be reasonable advisors for certain types of MIRI research.

I broadly agree with this. (And David was like .7 out of the 1.5 profs on the list who I guessed might genuinely want to grant the needed freedom.)

I do think that people might do good related work in math (specifically, probability/information theory, logic, etc.--stuff about formalized reasoning), philosophy (of mind), and possibly in other places such as theoretical linguistics. But this would require that the academic context is conducive to good novel work in the field, which lower bar is probably far from universally met; and would require the researcher to have good taste. And this is "related" in the sense of "might write a paper which leads to another paper which would be cited by [the alignment textbook from the future] for proofs/analogies/evidence about minds".

Have you looked through the FLI faculty listed there?
How many seem useful supervisors for this kind of thing? Why?

If we're sticking to the [generate new approaches to core problems] aim, I can see three or four I'd be happy to recommend, conditional on their agreeing upfront to the exploratory goals, and that publication would not be necessary (or a very low concrete number agreed upon).

There are about ten more that seem not-obviously-a-terrible-idea, but probably not great (e.g. those who I expect have a decent understanding of the core problems, but basically aren't working on them).

The majority don't write anything that suggests they know what the core problems are.

For almost all of these supervisors, doing a PhD would seem to provide quite a few constraints, undesirable incentives, and an environment that's poor.
From an individual's point of view this can still make sense, if it's one of the only ways to get stable medium-term funding.
From a funder's point of view, it seems nuts.
(again, less nuts if the goal were [incremental progress on prosaic approaches, and generation of a respectable publication record])

As a concrete proposal, if anyone wants to reboot Refine or similar, I'd be interested to consider that while wearing my Manifund Regrantor hat.

Yeah that looks good, except that it takes an order of magnitude longer to get going on conceptual alignment directions. I'll message Adam to hear what happened with that.

For reference there's this: What I learned running Refine 
When I talked to Adam about this (over 12 months ago), he didn't think there was much to say beyond what's in that post. Perhaps he's updated since.

My sense is that I view it as more of a success than Adam does. In particular, I think it's a bit harsh to solely apply the [genuinely new directions discovered] metric. Even when doing everything right, I expect the hit rate to be very low there, with [variation on current framing/approach] being the most common type of success.

Agreed that Refine's timescale is clearly too short.
However, a much longer program would set a high bar for whoever's running it.
Personally, I'd only be comfortable doing so if the setup were flexible enough that it didn't seem likely to limit the potential of participants (by being less productive-in-the-sense-desired than counterfactual environments).

Ah thanks!

In particular, I think it's a bit harsh to solely apply the [genuinely new directions discovered] metric. Even when doing everything right, I expect the hit rate to be very low there, with [variation on current framing/approach] being the most common type of success.

Mhm. In fact I'd want to apply a bar that's even lower, or at least different: [the extent to which the participants (as judged by more established alignment thinkers) seem to be well on the way to developing new promising directions--e.g. being relentlessly resourceful including at the meta-level; having both appropriate Babble and appropriate Prune; not shying away from the hard parts].

the setup were flexible enough that it didn't seem likely to limit the potential of participants (by being less productive-in-the-sense-desired than counterfactual environments).

Agree that this is an issue, but I think it can be addressed--certainly at least well enough that there'd be worthwhile value-of-info in running such a thing.

I'd be happy to contribute a bit of effort, if someone else is taking the lead. I think most of my efforts will be directed elsewhere, but for example I'd be happy to think through what such a program should look like; help write justificatory parts of grant applications; and maybe mentor / similar.

Report back if you get details, I'm curious.

I have, and I also remember seeing Adam’s original retrospective, but I always found it unsatisfying. Thanks anyway!

I think there might be a simple miscommunication here: in our title and report we use "talent needs" to refer to "job and funding opportunities that could use talent." Importantly, we generally make a descriptive, not a normative, claim about the current job and funding opportunities.

I think the title of this post is actively misleading if that's what you're trying to convey. "Defining" a term to mean something specific thing, which does not match how lots of readers will interpret it (especially in the title!), will in general make your writing not communicate what your "definition" claims to be trying to communicate.

If the post is about job openings and grant opportunities, then it should say that at the top, rather than "talent needs".

I can understand if some people are confused by the title, but we do say "the talent needs of safety teams" in the first sentence. Granted, this doesn't explicitly reference "funding opportunities" too, but it does make it clear that it is the (unfulfilled) needs of existent safety teams that we are principally referring to.

We changed the title. I don't think keeping the previous title was aiding understanding at this point.

Great post, but there is one part I'd like to push back on:

Iterators are also easier to identify, both by their resumes and demonstrated skills. If you compare two CVs of postdocs that have spent the same amount of time in academia, and one of them has substantially more papers (or GitHub commits) to their name than the other (controlling for quality), you’ve found the better Iterator. Similarly, if you compare two CodeSignal tests with the same score but different completion times, the one completed more quickly belongs to the stronger Iterator.

This seems like a bit of an over-claim. I would endorse a weaker claim, like "in the presence of a high volume of applicants, CodeSignal tests, GitHub commits, and paper count statistically provide some signal," but the reality of work in the fields of research and software development is often such that there isn't a clean correspondence between these measures and someone's performance. In addition, all three of these measures are quite easy to game (or Goodhart).

For example, in research alone, not every paper entails the same-sized project; two high-quality papers could have an order of magnitude difference in the amount of work required to produce them. Not every research bet pays off, too--some projects don't result in papers, and research management often plays a role in what directions get pursued (and dropped or not if they are unproductive). There are also many researchers who have made a career out of getting their names on as many papers as possible; there is an entire science to doing this that is completely independent of your actual research abilities.

In the case of CodeSignal evaluations, signal is likewise relatively low-dimensional and primarily conveys one thing: enough experience with a relatively small set of patterns that one can do the assessment very quickly. I've taken enough of these and seen enough reviews from senior engineers on CodeSignal tests to know that they capture only a small, specific part of what it takes to be a good engineer, and overemphasize speed (which is not the main thing you want from an actual senior engineer; you want quality as well as maintainability and readability, which often are at odds with speed. Senior engineers' first instinct is not generally to jump in and start spitting out lines of code like their lives depend on it). Then there's the issue of how hackable/gameable the assessments are; senior engineer Yanir Seroussi has a good blog post on CodeSignal specifically:  https://yanirseroussi.com/2023/05/26/how-hackable-are-automated-coding-assessments/

I'm definitely not arguing that these metrics are useless, however. They do provide some signal (especially if the volume of applicants is high), but I'd suggest that we see them as imperfect proxies that we're forced to use due to insufficient manpower for comprehensive candidate evaluations, rather than actually capturing some kind of ground truth.

Yeah, I basically agree with this nuance. MATS really doesn't want to overanchor on CodeSignal tests or publication count in scholar selection.

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

This is a brilliant post, thanks. I appreciate the breakdown of different types of contributors and how orgs have expressed the need for some types of contributors over others.