2197

LESSWRONG
LW

2196
CareersAI
Frontpage

44

So You Want to Work at a Frontier AI Lab

by Joe Rogero
11th Jun 2025
Linkpost from intelligence.org
9 min read
12

44

44

So You Want to Work at a Frontier AI Lab
10Kaj_Sotala
5Joe Rogero
9307th
4Mass_Driver
4307th
4Mass_Driver
3307th
4Mass_Driver
4307th
3MondSemmel
-3307th
6MondSemmel
New Comment
12 comments, sorted by
top scoring
Click to highlight new comments since: Today at 8:49 PM
[-]Kaj_Sotala3mo*100

Given the above, are there any lines of reasoning that might make a job at an AI lab net positive? 

I think one missing line of reasoning is something like a question - how are we ever going to get AIs aligned if the leading AI labs have no alignment researchers?

It does seem plausible that alignment efforts will actually accelerate capabilities progress. But at the same time, the only way we'll get an aligned AGI is if the entity building the AGI... actually tries to align it. For which they need people with some idea of how to do that. You say that none of the current labs are on track to solve the hard problems, but isn't that an argument for joining the labs to do alignment work, so that they'd have better odds of solving those problems?

(For what it's worth, I do agree that joining OpenAI to do alignment research looks like a lost cause, but Anthropic seems to at least be trying.)

You say:

Today, if you are hired by a frontier AI lab to do machine learning research, then odds are you are already competent enough to do high-quality research elsewhere. 

Of course, you can try to do alignment work outside the labs, but for the labs to actually adopt that work there need to be actual alignment researchers inside the labs to take the results of that work and apply it into their products. If that work gets done but none of the organizations building AGI do anything about it, then it's effectively wasted.

Reply
[-]Joe Rogero3mo54

Anthropic is indeed trying. Unfortunately, they are not succeeding, and they don't appear to be on track to notice this fact and actually stop. 

If Anthropic does not keep up with the reckless scaling of e.g. OpenAI, they will likely cease to attract investment and wither on the vine. But aligning superintelligence is harder than building it. A handful of alignment researchers working alongside capabilities folks aren't going to cut it. Anthropic cannot afford to delay scaling; even if their alignment researchers advised against training the next model, Anthropic could not afford to heed them for long. 

I'm primarily talking about the margin when I advise folks not to go work at Anthropic, but even if the company had literally zero dedicated alignment researchers, I question the claim that the capabilities folks would be unable to integrate publicly available alignment research. If they had a Manual of Flawless Alignment produced by diligent outsiders, they could probably use it. (Though even then, we would not be safe, since some labs would inevitably cut corners.) 

I think the collective efforts of humanity can produce such a Manual given time. But in the absence of such a Manual, scaling is suicide. If Anthropic builds superintelligence at approximately the same velocity as everyone else while trying really really hard to align it, everyone dies anyway. 

Reply
[-]307th3mo9-11

I really disagree with this piece and others like it. I think there's a selectively applied fatalism about frontier labs that is entirely unwarranted. Some examples of this selective fatalism:

> Each lab’s emphasis on alignment varies, but none are on track to solve the hard problems, or to prevent these machines from growing irretrievably incompatible with human life.

The entire argument for avoiding frontier labs falls apart if you admit even a 20% likelihood that frontier labs will create aligned superintelligence, because that 20% likelihood implies that a motivated person joining could push it upwards, which would then be an incomprehensibly beneficial and heroic thing for that person to do.

> I don’t expect the marginal extra researcher to substantially improve these odds, even if they manage to resist the oppressive weight of subtle and unsubtle incentives. 

Why not? And, why would they have to substantially improve these odds? Pushing the odds from 20% to 20.01% would be an incredible accomplishment for one person.

> The claim: Working within a lab can position a safety-conscious individual to influence the course of that lab’s decisions. 

> My assessment: I admit I have a hard time steelmanning this case. It seems straightforwardly true that no individual entering the field right now will be meaningfully positioned to slow the development of superhuman AI from inside a lab. 

A group is composed of people. The specific beliefs of the people in that group will be important for deciding what that group does.

If you shake off the fatalism and look at things clearly, you should realize: joining a frontier lab is an incredible opportunity to make things go better. If anyone has the skills to go for it I highly recommend they gather their courage and do so.

Reply
[-]Mass_Driver3mo41

I suspect Joe would agree with me that the current odds that AI developers solve superalignment are significantly less than 20%.

Even if we concede your estimate of 20% for the sake of argument, though, what price are you likely to pay for increasing the odds of success by 0.01%? Suppose that, given enough time, nonprofit alignment researchers would eventually solve superalignment with 80% odds. In order to increase, e.g., Anthropic's odds of success by 0.01%, are you boosting Anthropic's capabilities in a way that shortens timelines in a way that decreases the amount of time that the nonprofit alignment teams have to solve superalignment in a way that reduces their odds of success by at least 0.0025%? If so, you've done net harm. If not, why not? What about Joe's arguments that most for-profit alignment work has at least some applicability to capabilities do you find unconvincing?

Reply
[-]307th3mo40

I think if you do concede that superalignment is tractable at a frontier lab, it is pretty clear that joining and working on alignment will have far more benefits than any speedup. You could construct probabilities such that that's not true, I just don't think those probabilities would be realistic.

I also think that people who argue against working in a frontier lab are burying the lede. It is often phrased as a common sense proposition anyone who agrees in the possibility of X-risk should agree with. Then you get into the discussion and it turns out that the entire argument is premised on extremely controversial priors that most people who believe in X-risk from AI do not agree with. I don't mind debating those priors but it seems like a different conversation - rather than "don't work at a frontier lab" your headline should be "frontier labs will fail at alignment while nonprofits can succeed, here's why".

Reply
[-]Mass_Driver3mo41

Well, I can't change the headline; I'm just a commenter. However, I think the reason why "frontier labs will fail at alignment while nonprofits can succeed" is that frontier labs are only pretending to try to solve alignment -- it's not actually a serious goal of their leadership, and it's not likely to get meaningful support in terms of compute, recruiting, data, or interdepartmental collaboration, and in fact the leadership will probably actively interfere with your work on a regular basis because the intermediate conclusions you're reaching will get in the way of their profits and hurt their PR. In order to do useful superalignment research, I suspect you sometimes need to warn about or at least openly discuss the serious threats that are posed by increasingly advanced AI, but the business model of frontier labs depends on pretending that none of those threats are actually serious. By contrast, the main obstacle at a nonprofit is that they might not have much funding, but at least whatever funding they do have will be earnestly directed at supporting your team's work.

Reply
[-]307th3mo30

> In order to do useful superalignment research, I suspect you sometimes need to warn about or at least openly discuss the serious threats that are posed by increasingly advanced AI, but the business model of frontier labs depends on pretending that none of those threats are actually serious.

I think this is overly cynical. Demis Hassabis, Sam Altman, and Dario Amodei all signed the statement on AI risk:

"Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war."

They don't talk about it all the time but if someone wants to discuss the serious threats internally, there is plenty of external precedent for them to do so.

> frontier labs are only pretending to try to solve alignment 

This is probably the main driver of our disagreement. I think hands-off theoretical approaches are pretty much guaranteed to fail, and that successful alignment will look like normal deep learning work. I'd guess you feel the opposite (correct me if I'm wrong), which would explain why it looks to you like they aren't really trying and it looks to me like they are.

Reply
[-]Mass_Driver3mo41

> frontier labs are only pretending to try to solve alignment 

>>This is probably the main driver of our disagreement.

I agree with your diagnosis! I think Sam Altman is a sociopathic liar, so the fact that he signed the statement on AI risk doesn't convince me that he cares about alignment. I feel reasonably confident about that belief. Zvi's series on Moral Mazes apply here: I don't claim that you literally can't mention existential risk at OpenAI, but if you show signs of being earnestly concerned enough about it to interefere with corporate goals, then I believe you'll be sidelined.

I'm much less confident about whether or not successful alignment looks like normal deep learning work; I know more about corporate behavior than I do about technical AI safety. It seems odd and unlikely to me that the same kind of work (normal deep learning) that looks like it causes a series of major problems (power-seeking, black boxes, emergent goals) when you do a moderate amount of it would wind up solving all of those same problems when you do a lot of it, but I'm not enough of a technical expert to be sure that that's wrong.

Because there are independent, non-technical reasons for people to want to believe that normal deep learning will solve alignment (it means they get to take fun, high-pay, high-status jobs at AI developers without feeling guilty about it), if you show me a random person who believes this and I don't know anything about their incorruptiability or the clarity of their thinking ahead of time, then my prior is that most of the people in the random distribution that this person was drawn from probably arrived at the belief mostly out of convenience and temptation, rather than mostly by becoming technically convinced of the merits of a position that seems a priori unlikely to me. However, I can't be sure -- perhaps it's more likely than I think that normal deep learning can solve alignment.

Reply
[-]307th3mo40

By "it will look like normal deep learning work" I don't mean it will be exactly the same as mainstream capabilities work - e.g. RLHF was both "normal deep learning work" and also notably different from all other RL at the time. Same goes for constitutional AI.

What seems promising to me is paying close attention to how we're training the models and how they behave, thinking about their psychology and how the training influences that psychology, reasoning about how that will change in the next generation.

 

It seems odd and unlikely to me that the same kind of work (normal deep learning) that looks like it causes a series of major problems (power-seeking, black boxes, emergent goals) when you do a moderate amount of it would wind up solving all of those same problems when you do a lot of it, but I'm not enough of a technical expert to be sure that that's wrong.

What are we comparing deep learning to here? Black box - 100% granted. 

But for the other problems - power-seeking and emergent goals - I think they will be a problem with any AI system and in fact they are much better in deep learning than I would have expected. Deep learning is basically short sighted and interpolative rather than extrapolative, which means that when you train it on some set of goals, it by default tries to pursue those goals in a short sighted way that makes sense. If you train it on poorly formed goals, you can still get bad behaviour, and as it gets smarter we'll have more issues, but LLMs are a very good base to start from - they're highly capable, understand natural language, and aren't power seeking.

 

In contrast, the doomed theoretical approaches I have in mind are things like provably safe AI. With these approaches you have two problems: 1), a whole new way of doing AI which won't work, and 2), the theoretical advantage - that if you can precisely specify what your alignment target is, it will optimize for it - is in fact a terrible disadvantage, since you won't be able to precisely specify your alignment target.

Because there are independent, non-technical reasons for people to want to believe that normal deep learning will solve alignment (it means they get to take fun, high-pay, high-status jobs at AI developers without feeling guilty about it)

This is what I mean about selective cynicism! I've heard the exact same argument about theoretical alignment work - "mainstream deep learning is very competitive and hard; alignment work means you get a fun nonprofit research job" - and I don't find it convincing in either case.

Reply
[-]MondSemmel3mo30

The entire argument for avoiding frontier labs falls apart if you admit even a 20% likelihood that frontier labs will create aligned superintelligence,

Sure, but given that none of the frontier labs seem remotely on track to align anything, superintelligence or otherwise, that's an extraordinary claim which requires extraordinary levels of evidence.

Reply
[-]307th3mo-3-10

The frontier labs have certainly succeeded at aligning their models. LLMs have achieved a level of alignment people wouldn't have dreamed of 10 years ago.

Now labs are running into issues with the reasoning models, but this doesn't at all seem insurmountable.

Reply
[-]MondSemmel3mo65

Contemporary AI models are not "aligned" in any sense that would help the slightest bit against a superintelligence. You need stronger guardrails against stronger AI capabilities, and current "alignment" doesn't even prevent stuff like ChatGPT's recent sycophancy, or jailbreaking.

Reply
Moderation Log
More from Joe Rogero
View more
Curated and popular this week
12Comments
CareersAI
Frontpage

Several promising software engineers have asked me: Should I work at a frontier AI lab? 

My answer is always “No.” 

This post explores the fundamental problem with frontier labs, some of the most common arguments in favor of working at one, and why I don’t buy these arguments. 

The Fundamental Problem

The primary output of frontier AI labs—such as OpenAI, Anthropic, Meta, and Google DeepMind—is research that accelerates the capabilities of frontier AI models and hastens the arrival of superhuman machines. Each lab’s emphasis on alignment varies, but none are on track to solve the hard problems, or to prevent these machines from growing irretrievably incompatible with human life. In the absence of an ironclad alignment procedure, frontier capabilities research accelerates the extinction of humanity. As a very strong default, I expect signing up to assist such research to be one of the gravest mistakes a person can make.

Some aspiring researchers counter: “I know that, but I want to do safety research on frontier models. I’ll simply refuse to work directly on capabilities.” Plans like these, while noble, dramatically misunderstand the priorities and incentives of scaling labs. The problem isn’t that you will be forced to work on capabilities; the problem is that the vast majority of safety work conducted by the labs enables or excuses continued scaling while failing to address the hard problems of alignment. 

You Will Be Assimilated

AI labs are under overwhelming institutional pressure to push the frontier of machine learning. This pressure can distort everything lab employees think, do, and say.

Former OpenAI Research Scientist Richard Ngo noticed this effect firsthand: 

This distortion affects research directions even more strongly. It’s perniciously easy to "safetywash” despite every intention to the contrary.

The overlap between alignment and capabilities research compounds this effect. Many efforts to understand and control the outputs of machine learning models in the short term not only can be used to enhance the next model release, but are often immediately applied this way. 

  • Reinforcement learning from human feedback (RLHF) represented a major breakthrough for marketable chatbots.
  • Scalable oversight, a popular component of alignment plans, fundamentally relies on building AIs that equal or surpass human researchers while alignment remains unsolved.
  • Even evaluations of potentially dangerous capabilities can be hill-climbed in pursuit of marketable performance gains. 

Mechanistic interpretability has some promising use cases, but it’s not a solution in the way that some have claimed, and it runs the risk of being applied to inspire algorithmic improvements. To the extent that future interpretability work could enable human researchers to deeply understand how neural networks function, it could also be used to generate efficiency boosts that aren’t limited to alignment research. 

Make no mistake: Recursive self improvement leading to superintelligence is the strategic mandate of every frontier AI lab. Research that does not support this goal is strongly selected against, and financial incentives push hard for downplaying the long-term consequences of blindly scaling. 

Right now, I don’t like anyone’s odds of a good outcome. I have on occasion been asked whose plan seems the most promising. I think this is the wrong question to ask. The plans to avert extinction are all terrible, when they exist at all. A better question is: Who is capable of noticing their efforts are failing, and will stop blindly scaling until their understanding improves? This is the test that matters, and every major lab fails it miserably. 

If you join them, you likely will too. 

The Arguments

Given the above, are there any lines of reasoning that might make a job at an AI lab net positive? Here, I attempt to address the strongest cases I’ve heard for working at a frontier lab. 

Labs may develop useful alignment insights

The claim: Working at a major lab affords research opportunities that are difficult to find elsewhere. The ability to tinker with frontier models, the concentration of research talent, and the access to compute resources make frontier labs the best place to work on alignment before an AI scales to superintelligence. 

My assessment: Labs like Anthropic and DeepMind indeed deserve credit for landmark work like Alignment Faking in Large Language Models and a significant portion of the world’s interpretability research. 

I’m glad this research exists. All else equal, I’d like to see more of it. But all else is not equal. Massively more effort is directed at making AI more powerful and, to the extent that work of this kind can be used to advance capabilities, if that work is done at a lab, it will be used to advance capabilities.

Their alignment efforts strike me as too little, too late. No lab has outlined a workable approach to aligning smarter-than-human AI. They don’t seem likely to fix this before they get smarter-than-human AI. Their existing safety frameworks imply unreasonable confidence. 

I don’t expect the marginal extra researcher to substantially improve these odds, even if they manage to resist the oppressive weight of subtle and unsubtle incentives. 

Also, bear in mind that if the organization you choose does develop important alignment insights, you might be under strict binding agreement not to take them outside the company. 

As labs scale, their models get more likely to become catastrophically dangerous. Massively more resources are required to get an aligned superintelligence than a merely functioning superintelligence, and the leading labs are devoting most of their resources to the second thing. As we don't and can't know where the precipice is, all of this work is net irresponsible and shouldn't be happening.

Insiders are better positioned to whistleblow

The claim: As research gets more secretive and higher stakes, and as more unshared models are used internally, it will be harder for both the safety community and the general public to know when a tipping point is reached. It may be valuable to the community as a whole to retain some talent on the inside that can sound the alarm at a critical moment. 

My assessment: I’m sympathetic to the idea that someone should be positioned to sound the alarm if there are signs of imminent takeoff or takeover. But I have my doubts about the effectiveness of this plan, for broadly three reasons: 

  1. Seniority gates access.
  2. You only get one shot.
  3. Clarions require clarity. 

Seniority gates access. Labs are acutely aware of the risk of leaks, and I expect them to reserve key information for the eyes of senior researchers. I also expect access restrictions to rise with the stakes. Some people already at the labs do claim in private that they plan to whistleblow if they see an imminent danger. You are unlikely to surpass these people in access. Recall that these labs were (mostly) founded by safety-conscious people, and not all of them have left or fully succumbed to industry pressures. 

You only get one shot. Whistleblowing is an extremely valuable service, but it’s not generally a repeatable one. Regulations might prevent labs from firing known whistleblowers, but they will likely be sidelined away from sensitive research. This means that a would-be whistleblower needs to think very carefully about exactly when and how to spend their limited capital, and that can mean delaying a warning until it’s too late to matter. 

Clarions require clarity. There likely won’t be a single tipping point. For those with the eyes to recognize the danger, evidence abounds that frontier labs are taking on inordinate risk in their reckless pursuit of superintelligence. Given the strength of the existing evidence, it's hard to tell what policymakers or the general public might consider a smoking gun. That is, would-be-whistleblowers are forced to play a dangerous game of chicken, wherein to-them-obvious evidence of wrongdoing may, in fact, not be sufficiently legible to outsiders. 

There’s a weaker version of this approach one could endorse. For example, applying to an AI lab with the intent to keep a finger on the pulse of progress and quietly alert allies of important developments. The benefits of such a path are even more questionable, however, and your leverage will remain limited by seniority, disclosure restrictions, and the value of the knowledge you can actually share. 

(If you already work at a frontier lab, I still recommend pivoting. You should also be aware of the existence of AI Lab Watch as a resource for this and related decisions.) 

Insiders are better positioned to steer lab behavior

The claim: Working within a lab can position a safety-conscious individual to influence the course of that lab’s decisions. 

My assessment: I admit I have a hard time steelmanning this case. It seems straightforwardly true that no individual entering the field right now will be meaningfully positioned to slow the development of superhuman AI from inside a lab. 

The pace is set by the most reckless driver. Many of the people who worked at OpenAI and who might be reasonably characterized as “safety-focused” ended up leaving for various reasons, whether to avoid bias and do better work elsewhere or because they weren’t getting enough support. Those who remain have a tight budget of professional capital, which they must manage carefully in order to retain their positions. 

To effectively steer a scaling lab from the inside, you’d have to:

  • Secure a position of influence in an organization already captured by an accelerationist paradigm;
  • Avoid being captured yourself;
  • Push an entire organization against the grain of its economic incentives; and
  • Withstand the resulting backlash long enough to make a difference. 

It seems clear to me that this game was already played, and the incentives won. 

Lab work yields practical experience

The claim: Working at a scaling lab is the best way to gain expertise in machine learning [which can then be leveraged into solving the alignment problem]. 

My assessment: I used to buy this argument, before a coworker pointed out to me just how high the bar is for such work. Today, if you are hired by a frontier AI lab to do machine learning research, then odds are you are already competent enough to do high-quality research elsewhere. 

AI labs have some of the most competitive hiring pipelines on the planet, chaotic hierarchies and team structures, high turnover rates, and limited opportunities for direct mentorship. This is not a good environment for upskilling. 

Better me than the alternative

The claim: If I don’t work at a frontier lab, someone else will. I’m probably more safety-focused than the typical ML engineer at my skill level, so I can make a difference on the margin, and that’s the highest impact I can reasonably expect to have. 

My assessment: This argument seems to rest on three key assumptions: 

  1. You won’t do a better job than another ML engineer, so your work won’t meaningfully improve the capabilities frontier on the margin. 
  2. Your work will meaningfully improve safety on the margin.
  3. Policy is doomed; there’s no better way for me to improve humanity’s odds outside the lab. 

The first may well be true, though I wouldn’t want to put myself in a situation in which humanity’s future could depend on my not being too good at my job. 

The second raises a clear followup question: How are you improving safety? As we’ve briefly discussed above, most opportunities to promote safety at frontier labs are not so promising. Alignment research at the labs is subject to intense pressure to serve the short-term needs of the business and its product; whistleblowing opportunities are rare at best; and attempting to steer the lab against the flow of its incentives hasn’t worked out well for those who tried. 

The third is simply an unforced error. Many policymakers express their sincere concerns behind closed doors, the American public largely supports AI regulation, and a real or perceived crisis can drive rapid change. Dismissing a global halt as intractable is foolish and self-defeating; we can and should know better. 

What to Do Instead? 

Right now, the most urgent and neglected problems are in policy and technical governance. We need smart and concerned people thinking about what might work and preparing concrete proposals for the day when an opportunity arises to implement them. If this interests you, consider applying to orgs such as the AI Futures Project, MIRI, Palisade, NIST, RAND, an AISI, or another policy-focused organization with a need for technical talent. 

If instead you're eager to apply an existing machine learning skillset to projects aimed at decreasing risk on the margin, you might consider METR, Apollo Research, Timaeus, Simplex, Redwood, or ARC. You might also take inspiration from, or look for ways to assist, Jacob Steinhardt’s work and Davidad's work at UK ARIA. 

Regardless of skillset, however, my advice is to avoid any organization that’s explicitly trying to build AGI.