So You Want to Work at a Frontier AI Lab

Given the above, are there any lines of reasoning that might make a job at an AI lab net positive?

I think one missing line of reasoning is something like a question - how are we ever going to get AIs aligned if the leading AI labs have no alignment researchers?

It does seem plausible that alignment efforts will actually accelerate capabilities progress. But at the same time, the only way we'll get an aligned AGI is if the entity building the AGI... actually tries to align it. For which they need people with some idea of how to do that. You say that none of the current labs are on track to solve the hard problems, but isn't that an argument for joining the labs to do alignment work, so that they'd have better odds of solving those problems?

(For what it's worth, I do agree that joining OpenAI to do alignment research looks like a lost cause, but Anthropic seems to at least be trying.)

You say:

Today, if you are hired by a frontier AI lab to do machine learning research, then odds are you are already competent enough to do high-quality research elsewhere.

Of course, you can try to do alignment work outside the labs, but for the labs to actually adopt that work there need to be actual alignment researchers inside the labs to take the results of that work and apply it into their products. If that work gets done but none of the organizations building AGI do anything about it, then it's effectively wasted.

[-]Joe Rogero3mo54

Anthropic is indeed trying. Unfortunately, they are not succeeding, and they don't appear to be on track to notice this fact and actually stop.

If Anthropic does not keep up with the reckless scaling of e.g. OpenAI, they will likely cease to attract investment and wither on the vine. But aligning superintelligence is harder than building it. A handful of alignment researchers working alongside capabilities folks aren't going to cut it. Anthropic cannot afford to delay scaling; even if their alignment researchers advised against training the next model, Anthropic could not afford to heed them for long.

I'm primarily talking about the margin when I advise folks not to go work at Anthropic, but even if the company had literally zero dedicated alignment researchers, I question the claim that the capabilities folks would be unable to integrate publicly available alignment research. If they had a Manual of Flawless Alignment produced by diligent outsiders, they could probably use it. (Though even then, we would not be safe, since some labs would inevitably cut corners.)

I think the collective efforts of humanity can produce such a Manual given time. But in the absence of such a Manual, scaling is suicide. If Anthropic builds superintelligence at approximately the same velocity as everyone else while trying really really hard to align it, everyone dies anyway.

[-]307th3mo9-11

I really disagree with this piece and others like it. I think there's a selectively applied fatalism about frontier labs that is entirely unwarranted. Some examples of this selective fatalism:

> Each lab’s emphasis on alignment varies, but none are on track to solve the hard problems, or to prevent these machines from growing irretrievably incompatible with human life.

The entire argument for avoiding frontier labs falls apart if you admit even a 20% likelihood that frontier labs will create aligned superintelligence, because that 20% likelihood implies that a motivated person joining could push it upwards, which would then be an incomprehensibly beneficial and heroic thing for that person to do.

> I don’t expect the marginal extra researcher to substantially improve these odds, even if they manage to resist the oppressive weight of subtle and unsubtle incentives.

Why not? And, why would they have to substantially improve these odds? Pushing the odds from 20% to 20.01% would be an incredible accomplishment for one person.

> The claim: Working within a lab can position a safety-conscious individual to influence the course of that lab’s decisions.

> My assessment: I admit I have a hard time steelmanning this case. It seems straightforwardly true that no individual entering the field right now will be meaningfully positioned to slow the development of superhuman AI from inside a lab.

A group is composed of people. The specific beliefs of the people in that group will be important for deciding what that group does.

If you shake off the fatalism and look at things clearly, you should realize: joining a frontier lab is an incredible opportunity to make things go better. If anyone has the skills to go for it I highly recommend they gather their courage and do so.

[-]Mass_Driver3mo41

I suspect Joe would agree with me that the current odds that AI developers solve superalignment are significantly less than 20%.

Even if we concede your estimate of 20% for the sake of argument, though, what price are you likely to pay for increasing the odds of success by 0.01%? Suppose that, given enough time, nonprofit alignment researchers would eventually solve superalignment with 80% odds. In order to increase, e.g., Anthropic's odds of success by 0.01%, are you boosting Anthropic's capabilities in a way that shortens timelines in a way that decreases the amount of time that the nonprofit alignment teams have to solve superalignment in a way that reduces their odds of success by at least 0.0025%? If so, you've done net harm. If not, why not? What about Joe's arguments that most for-profit alignment work has at least some applicability to capabilities do you find unconvincing?

[-]307th3mo40

I think if you do concede that superalignment is tractable at a frontier lab, it is pretty clear that joining and working on alignment will have far more benefits than any speedup. You could construct probabilities such that that's not true, I just don't think those probabilities would be realistic.

I also think that people who argue against working in a frontier lab are burying the lede. It is often phrased as a common sense proposition anyone who agrees in the possibility of X-risk should agree with. Then you get into the discussion and it turns out that the entire argument is premised on extremely controversial priors that most people who believe in X-risk from AI do not agree with. I don't mind debating those priors but it seems like a different conversation - rather than "don't work at a frontier lab" your headline should be "frontier labs will fail at alignment while nonprofits can succeed, here's why".

[-]Mass_Driver3mo41

Well, I can't change the headline; I'm just a commenter. However, I think the reason why "frontier labs will fail at alignment while nonprofits can succeed" is that frontier labs are only pretending to try to solve alignment -- it's not actually a serious goal of their leadership, and it's not likely to get meaningful support in terms of compute, recruiting, data, or interdepartmental collaboration, and in fact the leadership will probably actively interfere with your work on a regular basis because the intermediate conclusions you're reaching will get in the way of their profits and hurt their PR. In order to do useful superalignment research, I suspect you sometimes need to warn about or at least openly discuss the serious threats that are posed by increasingly advanced AI, but the business model of frontier labs depends on pretending that none of those threats are actually serious. By contrast, the main obstacle at a nonprofit is that they might not have much funding, but at least whatever funding they do have will be earnestly directed at supporting your team's work.

[-]307th3mo30

> In order to do useful superalignment research, I suspect you sometimes need to warn about or at least openly discuss the serious threats that are posed by increasingly advanced AI, but the business model of frontier labs depends on pretending that none of those threats are actually serious.

I think this is overly cynical. Demis Hassabis, Sam Altman, and Dario Amodei all signed the statement on AI risk:

"Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war."

They don't talk about it all the time but if someone wants to discuss the serious threats internally, there is plenty of external precedent for them to do so.

> frontier labs are only pretending to try to solve alignment

This is probably the main driver of our disagreement. I think hands-off theoretical approaches are pretty much guaranteed to fail, and that successful alignment will look like normal deep learning work. I'd guess you feel the opposite (correct me if I'm wrong), which would explain why it looks to you like they aren't really trying and it looks to me like they are.

[-]Mass_Driver3mo41

> frontier labs are only pretending to try to solve alignment

>>This is probably the main driver of our disagreement.

I agree with your diagnosis! I think Sam Altman is a sociopathic liar, so the fact that he signed the statement on AI risk doesn't convince me that he cares about alignment. I feel reasonably confident about that belief. Zvi's series on Moral Mazes apply here: I don't claim that you literally can't mention existential risk at OpenAI, but if you show signs of being earnestly concerned enough about it to interefere with corporate goals, then I believe you'll be sidelined.

I'm much less confident about whether or not successful alignment looks like normal deep learning work; I know more about corporate behavior than I do about technical AI safety. It seems odd and unlikely to me that the same kind of work (normal deep learning) that looks like it causes a series of major problems (power-seeking, black boxes, emergent goals) when you do a moderate amount of it would wind up solving all of those same problems when you do a lot of it, but I'm not enough of a technical expert to be sure that that's wrong.

Because there are independent, non-technical reasons for people to want to believe that normal deep learning will solve alignment (it means they get to take fun, high-pay, high-status jobs at AI developers without feeling guilty about it), if you show me a random person who believes this and I don't know anything about their incorruptiability or the clarity of their thinking ahead of time, then my prior is that most of the people in the random distribution that this person was drawn from probably arrived at the belief mostly out of convenience and temptation, rather than mostly by becoming technically convinced of the merits of a position that seems a priori unlikely to me. However, I can't be sure -- perhaps it's more likely than I think that normal deep learning can solve alignment.

[-]307th3mo40

By "it will look like normal deep learning work" I don't mean it will be exactly the same as mainstream capabilities work - e.g. RLHF was both "normal deep learning work" and also notably different from all other RL at the time. Same goes for constitutional AI.

What seems promising to me is paying close attention to how we're training the models and how they behave, thinking about their psychology and how the training influences that psychology, reasoning about how that will change in the next generation.

It seems odd and unlikely to me that the same kind of work (normal deep learning) that looks like it causes a series of major problems (power-seeking, black boxes, emergent goals) when you do a moderate amount of it would wind up solving all of those same problems when you do a lot of it, but I'm not enough of a technical expert to be sure that that's wrong.

What are we comparing deep learning to here? Black box - 100% granted.

But for the other problems - power-seeking and emergent goals - I think they will be a problem with any AI system and in fact they are much better in deep learning than I would have expected. Deep learning is basically short sighted and interpolative rather than extrapolative, which means that when you train it on some set of goals, it by default tries to pursue those goals in a short sighted way that makes sense. If you train it on poorly formed goals, you can still get bad behaviour, and as it gets smarter we'll have more issues, but LLMs are a very good base to start from - they're highly capable, understand natural language, and aren't power seeking.

In contrast, the doomed theoretical approaches I have in mind are things like provably safe AI. With these approaches you have two problems: 1), a whole new way of doing AI which won't work, and 2), the theoretical advantage - that if you can precisely specify what your alignment target is, it will optimize for it - is in fact a terrible disadvantage, since you won't be able to precisely specify your alignment target.

Because there are independent, non-technical reasons for people to want to believe that normal deep learning will solve alignment (it means they get to take fun, high-pay, high-status jobs at AI developers without feeling guilty about it)

This is what I mean about selective cynicism! I've heard the exact same argument about theoretical alignment work - "mainstream deep learning is very competitive and hard; alignment work means you get a fun nonprofit research job" - and I don't find it convincing in either case.

[-]MondSemmel3mo30

The entire argument for avoiding frontier labs falls apart if you admit even a 20% likelihood that frontier labs will create aligned superintelligence,

Sure, but given that none of the frontier labs seem remotely on track to align anything, superintelligence or otherwise, that's an extraordinary claim which requires extraordinary levels of evidence.

[-]307th3mo-3-10

The frontier labs have certainly succeeded at aligning their models. LLMs have achieved a level of alignment people wouldn't have dreamed of 10 years ago.

Now labs are running into issues with the reasoning models, but this doesn't at all seem insurmountable.

[-]MondSemmel3mo65

Contemporary AI models are not "aligned" in any sense that would help the slightest bit against a superintelligence. You need stronger guardrails against stronger AI capabilities, and current "alignment" doesn't even prevent stuff like ChatGPT's recent sycophancy, or jailbreaking.

LESSWRONG
LW

LESSWRONG
LW

44

So You Want to Work at a Frontier AI Lab

44

44

The Fundamental Problem

You Will Be Assimilated

The Arguments

Labs may develop useful alignment insights

Insiders are better positioned to whistleblow

Insiders are better positioned to steer lab behavior

Lab work yields practical experience

Better me than the alternative

What to Do Instead?