Seth Herd's Shortform

Seth Herd

Outside of full-blown deceit-leading-to-coup and sharp-left-turn scenarios where everything looks just fine until we're all dead, alignment and capabilities often tend to be significantly intertwined, few things are just Alignment, and it's often hard determine the ratio of the two (at least without the benefit of tech-tree hindsight). Capabilities are useless if your LLM capably spews stuff that gets you sued, and it's also rapidly becoming the case that a majority of capabilities researchers/engineers even at superscalers acknowledge that alignment (or at least safety) is a real problem that actually needs to be worked on, and their company has team doing so. (I could name a couple of orgs that seem like exceptions to this, but they're now in a minority.) There's an executive order that mentions the importance of Alignment, the King of England made a speech about it, and even China signed on to a statement about it (though one suspects they meant alignment to the Party).

Capabilities researchers/engineers outnumber alignment researchers/engineers by more then an order of magnitude, and some of them are extremely smart. The probability that any given alignment researcher/engineer has come up with a key capabilities-enhancing idea that has eluded every capabilities researcher/engineer out there, and that will continue to do so for very long, seems pretty darned unlikely (and also rather intellectually arrogant). [Yes, I know Conjecture sat on chain-of-thought prompting — for a month or two while multiple other people came up with it independently and then wrote and published papers, or didn't. Any schoolteacher could have told you that was a good idea, it wasn't going to stay secret.]

So, (unless you're pretty sure you're a genius) I don't think people should worry quite as much about this as many seem to. Alignment is a difficult, very urgent problem. We're not going to solve it in time while wearing a gag, nor with one hand tied behind our back. Caution makes sense to me, but not the sort of caution that makes it much slower for us to get things done — we're not in a position to slow Capabilities by more than a tiny fraction, no matter how closed-mouthed we are; but we're in a lot better position to slow Alignment down. And if your gears-level predictions are about the prospects of things that multiple teams of capabilities engineers are already working on, go ahead and post them — I could be wrong, but I don't think Yann LeCun is reading the Alignment Forum. Yes, ideas can and do diffuse, but that takes a few months, and that's about the timespan apart of most parallel inventions. If you've been sitting on a capabilities idea for >6 months, you've done literature searches to confirm no one else published it, and you're not in fact a genius, then there's probably a reason why none of the capabilities people have published it yet.

[-]Seth Herd6mo40

I've been trying to figure out what's going on in the field of alignment research and X-risk.

Here's one theory: we are having confused discussions about AI strategy, alignment difficulty, and timelines, because all of these depend on gears-level models of possible AGI, directly or indirectly.

And we aren't aren't talking about those gears-level predictions, so as not to accelerate progress if we're right. The better one's gears-level model, the less likely one is to talk about it.

This leads to very abstract and confused discussions.

I don't know what to do about this, but I think it's worth noting.

[-]RogerDearnaley5mo20

LESSWRONG
LW

Seth Herd's Shortform

New to LessWrong?