A Quick List of Some Problems in AI Alignment As A Field

Nicholas Kross

1. MIRI as central point of failure for... a few things...

For the past decade or more, if you read an article saying "AI safety is important", and you thought, "I need to donate or apply to work somewhere", MIRI was the default option. If you looked at FLI or FHI or similar groups, you'd say "they seem helpful, but they're not focused solely on AI safety/alignment, so I should go to MIRI for the best impact."

2. MIRI as central point of failure for learning and secrecy.

MIRI's secrecy (understandable) and their intelligent and creatively-thinking staff (good) have combined into a weird situation: for some research areas, nobody really knows what they've tried and failed/succeeded at, nor the details of how that came to be. Yudkowsky did link some corrigibility papers he labels as failed, but neither he nor MIRI have done similar (or more in-depth) autopsies of their approaches, to my knowledge.

As a result, nobody else can double-check that or learn from MIRI's mistakes. Sure, MIRI people write up their meta-mistakes, but that has limited usefulness, and people still (understandably) disbelieve their approaches anyway. This leads either to making the same meta-mistakes (bad), or to blindly trusting MIRI's approach/meta-approach (bad because...)

3. We need more uncorrelated ("diverse") approaches to alignment.

MIRI was the central point for anyone with any alignment approach, for a very long time. Recently-started alignment groups (Redwood, ARC, Anthropic, Ought, etc.) are different from MIRI, but their approaches are correlated with each other. They all relate to things like corrigibility, the current ML paradigm, IDA, and other approaches that e.g. Paul Christiano would be interested in.

I'm not saying these approaches are guaranteed to fail (or work). I am saying that surviving worlds would have, if not way more alignment groups, definitely way more uncorrelated approaches to alignment. This need not lead to extra risk as long as the approaches are theoretical in nature. Think early-1900s physics gedankenexperiments, and how diverse they may have been.

Or, if you want more hope and less hope at the same time, look at how many wildly incompatible theories have been proposed to explain quantum mechanics. A surviving world would have at least this much of a Cambrian explosion in theories, and would also be better at handling this than we are in real-life handling the actual list of quantum theories (in absence of better experimental evidence).

Simply put, if evidence is dangerous to collect, and every existing theoretical approach is deeply flawed along some axis, then let schools proliferate with little evidence, dammit! This isn't psych, where stuff fails to replicate and people keep doing it. AI alignment is somewhat better coordinated than other theoretical fields... we just overcorrected to putting all our eggs in a few approach baskets.

(Note: if MIRI is willing and able, it could continue being a/the central group for AI alignment, given the points in (1), but it would need to proliferate many schools of thought internally, as per (5) below.)

One problem with this ^[1], is that the AI alignment field as a whole may not have the resources (or the time) to pursue this hits-based strategy. In that case, AI alignment would appear to be bottlenecked on funding, rather than talent directly. That's... news to me. In either case, this requires either more fundraising, and/or more money-efficient ways to get similar effects to what I'm talking about. (If we're too talent-constrained to pursue a hits-based approach strategy, it's even more imperative to fix the talent constraints first, as per (4) below.)

Another problem is whether the "winning" approach might come from deeper searching along the existing paths, rather than broader searching in weirder areas. In that case, it could maybe still make sense to proliferate sub-approaches under the existing paths. The rest of the points (especially (4) below) would still apply, and this still relies on the existing paths being... broken enough to call "doom", but not broken enough to try anything too different. This is possible.

EDIT Sept. 9, 2022: John S Wentworth explains here why "just fund randos" is not the way to solve this, and how to do better.

4. How do people get good at this shit?

MIRI wants to hire the most competent people they can. People apply, and are turned away for not being smart/self-taught/security-mindset enough. So far so good.

But then... how do people get good at alignment skills before they're good enough to work at MIRI, or whatever group has the best approach? How they get good enough to recognize, choose, and/or create the best approaches (which, remember, we need more of)?

Academia is loaded with problems. Existing orgs are already small and selective. Independent research is promising, yet still relies on a patchwork of grants and stuff. By the time you get good enough to get a grant, you have to have spent a lot of time studying this stuff. Unpaid, mind you, and likely with another job/school/whatever taking up your brain cycles.

Here's a (failure?) mode that I and others are already in, but might be too embarrassed to write about: taking weird career/financial risks, in order to obtain the financial security, to work on alignment full-time ^[2]. Anyone more risk-averse (good for alignment!) might just... work a normal job for years to save up, or modestly conclude they're not good enough to work in alignment altogether. If security mindset can be taught at all, this is a shit equilibrium.

Yes, I know EA and the alignment community are both improving at noob-friendliness. I'm glad of this. I'd be more glad if I saw non-academic noob-friendly programs that pay people, with little legible evidence of their abilities, to upskill full-time. IQ or other tests are legal, certainly in a context like this. Work harder on screening for whatever's unteachable, and teaching what is.

5. Secret good ideas + collaboration + more work needed = ???

The good thing about having a central org to coordinate around, is it solves the conflicting requirements of "intellectual sharing" and "infohazard secrecy". One org where the best researchers go, open on the inside, closed to the outside. Good good.

But, as noted in (1), MIRI has not lived up to its potential in this regard ^[3]. MIRI could kill two birds with one stone, and act as a secrecy/collaboration coordination point while also having multiple small internal teams working on disparate approaches and thus having a high absolute headcount (helping (5) and (4)) while avoiding many issues common to big gangly organizations.

Then again, Zvi and others have written extensively on why big organizations are doomed to cancer and maybe theoretically impossible to align. Okay. Not promising. Then maybe we need approaches that get similar benefits (secrecy, collaboration, coordination, many schools) without making a large group. Perhaps a big closed-door annual conference? More MIRIx chapters? Something?

6. The hard problem of smart people working on a hard problem.

Remember "The Bitter Lesson"? Where AI researchers go for approaches using human expertise and galaxy-brained solutions, instead of brute scale?

Sutton's reasoning for this is (at least partly) that researchers have human vanity. "I'm a smart person, therefore my solution should be sufficiently-complicated." ^[4]

I think similar reasons of vanity (and related social-status) reasons are holding back some AI alignment progress.

I think people are afraid to suggest sufficiently weird/far-out ideas (which, recall, need to be quite different from existing flawed approaches), because they have a mental model of semi-adequate MIRI trying and failing something, and then not prioritizing writing-up-the-failure (or keeping the failure secret for some reason).

Sure, there are good security-mindset and iffy-teachability reasons why many new ideas can and should be rejected on-sight. But, as noted in (4), these problems should not be impossible to get around. And in actual cybersecurity and cryptography, where people are presumably selected at least a tad for having security mindset, there's not exactly a shortage of creative ideas and moon math solutions. Given our field's relatively-high coordination and self-reflection, surely we can do better?

This relates to a point I've made elsewhere, that in the face of lots of things not working, we need to try more hokey, wacky, cheesy, low-hanging, "dumb" ideas. I'm disappointed that I couldn't find any LessWrong post suggesting like "Let's divvy up team members where each one represents a cortex of the brain, then we can divide intellectual labor!". The idea is dumb, it likely won't work, but surviving worlds don't leave that stone unturned. If famously-wacky early LessWrong didn't have this lying around, how do I know MIRI hasn't secretly tried and failed at it?

Related to division of intellectual labor: I also think Yudkowsky's example of Einstein, in the Sequences, may make people afraid to offer incremental ideas, critiques, solutions, etc. "If I can't solve all of alignment (or all of [big alignment subproblem]) in one or two groundbreaking papers, like Einstein did with Relativity, I'm not smart enough to work in alignment." So, uh, don't be afraid to take even half-baked ideas to the level of a LaTeX-formatted paper. (If you can solve alignment in one paper, obviously do that!)

7. Concluding paragraph because you have a crippling addiction to prose (ok, same, fair).

Here's an example of something that combines many solution-ideas noted in (6). If it becomes more accepted to write ideas in bullet points, then:

It lowers the barrier to entry for people who think better/more easily than they write.
It lowers the mental "status-grab" barrier for people who are subtly intimidated by prose quality.
- This, in turn, signals to more people who already don't care about status, that their blunt ideas are welcome on alignment spaces.
It makes prose quality less able to influence readers' evaluations of idea quality, which is good for examining ideas' truth values.
It may be easier even for people who already have little problem writing prose.
People can (and probably should) still write prose when they're more comfortable with it / when needed for other purposes (explicitly persuading people?) anyway. Making bullet points more common does not necessarily entail forcibly limiting prose.

H/T my co-blogger Devin, as is the case with my articles' editing in general, and noticing gaps in my logic in particular. ↩︎
If you're in this situation, DM me for moral support and untested advice. ↩︎
Or maybe it has! We don't know! See (2)! ↩︎
See also, uh, that list of explanations of quantum mechanics. ↩︎

There are now quite a lot of AI alignment research organizations, of widely varying quality. I'd name the two leading ones right now as Redwood and Anthropic, not MIRI (which is in something of a rut technically). Here's a big review of the different orgs by Larks:

https://www.lesswrong.com/posts/C4tR3BEpuWviT7Sje/2021-ai-alignment-literature-review-and-charity-comparison

I don't know, I might be wrong here but seems to me that most serious AGI x-risk research comes from MIRI-affiliated people. Most other organisations (with exceptions) seem to mostly write hacky math-free papers. Is there particular research you like?

https://transformer-circuits.pub/ seems impressive to me!

On secrecy : while I think secrecy on capability research is probably bad because it creates an arm race, winner take all mindset that will not let anyone pause to think when the moment to plug in the actual AGI comes, I think that secrecy for alignement research is just crazy. Open publication = easier progress, easier for outsiders to contribute, easier for orgs that do capability to use the results from alignement.

On recruiting good people : maybe try to recruit bright young people on scholarships/PhD grants and teach them rather than trying to only hire people who are ready to work for free for years before getting hired ?
(epistemic status : obviously this is only hearsay from the Internet, may likely not represent the real recruitment process.

If existing intelligence works the way I think it does, "small and secret" could be a very poor approach to solving an unreasonably difficult problem. You'd want a large, relatively informal network of researchers working on the problem. The first challenge, then, would be working out how to begin to align the network in a way that lets it learn on the problem.

There's a curious self-reflective recursivity here. Intuitively, I suspect the task of aligning the reseach network would turn out isomorphic to the AI alignment problem it was trying to solve.

Here's a (failure?) mode that I and others are already in, but might be too embarrassed to write about: taking weird career/financial risks, in order to obtain the financial security, to work on alignment full-time...

I'd be more glad if I saw non-academic noob-friendly programs that pay people, with little legible evidence of their abilities, to upskill full-time.

CEEALAR offers this (free accommodation and food, and a moderate stipend), and was set up to avoid the failure mode mentioned (not just for alignment, for EA in general).

Heck yeah! Would love to see its model spread, too...

On 3.

You say:

Another problem is whether the "winning" approach might come from deeper searching along the existing paths, rather than broader searching in weirder areas. In that case, it could maybe still make sense to proliferate sub-approaches under the existing paths. The rest of the points (especially (4) below) would still apply, and this still relies on the existing paths being... broken enough to call "doom", but not broken enough to try anything too different. This is possible.

This seems pretty plausible to me, and makes me think that people probably shouldn't be too worried about trying to do "diverse" approaches just for the sake of trying them.

In an interview with John Wentworth from AXRP, he suggests that convergence in research directions may generally be viewed as a positive sign.

You have to play down through a few layers of the game tree before you start to realize what the main bottlenecks to solving the problem are.
[...]
And I do think the longer people are in the field, the more they tend to converge to a similar view. So for instance, example of that, right now, myself, Scott Garrabrant, and Paul Christiano are all working on basically the same problem. We’re all basically working on, what is abstraction and where does the human ontology come from? That sort of thing. And that was very much a case of convergent evolution. We all came from extremely different directions.

If a bunch of researchers have really ended up on somewhat similar research agendas because they each find these approaches the most promising, I think I feel better about them all sticking with their similar approaches than I would about them trying to go for more diverse approaches simply to "change things up" or "diversify our bets."

On 4.

What are your thoughts on programs like AGISF and SERI MATS that allow people to learn about alignment research and try out their fit for it in a more structured environment? Do you think people should generally be scaling programs like this up further, or trying something pretty different?

Also, you say:

By the time you get good enough to get a grant, you have to have spent a lot of time studying this stuff.

My impression was that many funders may be somewhat willing to give grants (especially relatively small ones) to people who haven't spent a ton of time learning about alignment already and who have relatively little in the way of existing "accomplishments," to try their hand at alignment work. Have you personally gotten to apply for funding to work on alignment full-time yet?

I heard of (and worked through some of) the AGISF, but haven't heard of SERI MATS. Scaling these up would likely work well.

I super mega agree with the prose thing. I think in bullet points. I prefer to write complex ideas in them too. Prose is for stories! (And maybe for clear explanations of ideas that have already been well-developed, but tbh I doubt that too.)

Edit: After actually reading all the comments on that reddit post you linked, I think I've changed my mind. Prose is probably necessary, but not sufficient. Outlines are also not sufficient. I have long suspected that the ideal way of representing information in a digital world has not yet been invented. Hmm...

One thing is that the most-clear knowledge representations, vary by field/task. Sometimes data is what you need, sometimes a math proof (which itself can vary from more prosey to more symbol-manipulation-based).

Excellent post. I have nothing really to add, only that you're not alone in this:

Here's a (failure?) mode that I and others are already in, but might be too embarrassed to write about: taking weird career/financial risks, in order to obtain the financial security, to work on alignment full-time ^[2]. Anyone more risk-averse (good for alignment!) might just... work a normal job for years to save up, or modestly conclude they're not good enough to work in alignment altogether. If security mindset can be taught at all, this is a shit equilibrium.
Yes, I know EA and the alignment community are both improving at noob-friendliness. I'm glad of this. I'd be more glad if I saw non-academic noob-friendly programs that pay people, with little legible evidence of their abilities, to upskill full-time. IQ or other tests are legal, certainly in a context like this. Work harder on screening for whatever's unteachable, and teaching what is.

I'm more on the "working on having more energy so I can spend more time learning even with a 9-5" side than taking risks, but same idea.

https://www.lesswrong.com/posts/C4tR3BEpuWviT7Sje/2021-ai-alignment-literature-review-and-charity-comparison

https://transformer-circuits.pub/ seems impressive to me!

Here's a (failure?) mode that I and others are already in, but might be too embarrassed to write about: taking weird career/financial risks, in order to obtain the financial security, to work on alignment full-time...

I'd be more glad if I saw non-academic noob-friendly programs that pay people, with little legible evidence of their abilities, to upskill full-time.

CEEALAR offers this (free accommodation and food, and a moderate stipend), and was set up to avoid the failure mode mentioned (not just for alignment, for EA in general).

Heck yeah! Would love to see its model spread, too...

On 3.

You say:

Another problem is whether the "winning" approach might come from deeper searching along the existing paths, rather than broader searching in weirder areas. In that case, it could maybe still make sense to proliferate sub-approaches under the existing paths. The rest of the points (especially (4) below) would still apply, and this still relies on the existing paths being... broken enough to call "doom", but not broken enough to try anything too different. This is possible.

This seems pretty plausible to me, and makes me think that people probably shouldn't be too worried about trying to do "diverse" approaches just for the sake of trying them.

In an interview with John Wentworth from AXRP, he suggests that convergence in research directions may generally be viewed as a positive sign.

You have to play down through a few layers of the game tree before you start to realize what the main bottlenecks to solving the problem are.
[...]
And I do think the longer people are in the field, the more they tend to converge to a similar view. So for instance, example of that, right now, myself, Scott Garrabrant, and Paul Christiano are all working on basically the same problem. We’re all basically working on, what is abstraction and where does the human ontology come from? That sort of thing. And that was very much a case of convergent evolution. We all came from extremely different directions.

On 4.

Also, you say:

By the time you get good enough to get a grant, you have to have spent a lot of time studying this stuff.

I heard of (and worked through some of) the AGISF, but haven't heard of SERI MATS. Scaling these up would likely work well.

Excellent post. I have nothing really to add, only that you're not alone in this:

Here's a (failure?) mode that I and others are already in, but might be too embarrassed to write about: taking weird career/financial risks, in order to obtain the financial security, to work on alignment full-time ^[2]. Anyone more risk-averse (good for alignment!) might just... work a normal job for years to save up, or modestly conclude they're not good enough to work in alignment altogether. If security mindset can be taught at all, this is a shit equilibrium.
Yes, I know EA and the alignment community are both improving at noob-friendliness. I'm glad of this. I'd be more glad if I saw non-academic noob-friendly programs that pay people, with little legible evidence of their abilities, to upskill full-time. IQ or other tests are legal, certainly in a context like this. Work harder on screening for whatever's unteachable, and teaching what is.

I'm more on the "working on having more energy so I can spend more time learning even with a 9-5" side than taking risks, but same idea.

LESSWRONG
LW

LESSWRONG
LW

75

A Quick List of Some Problems in AI Alignment As A Field

75

1. MIRI as central point of failure for... a few things...

2. MIRI as central point of failure for learning and secrecy.

3. We need more uncorrelated ("diverse") approaches to alignment.

4. How do people get good at this shit?

5. Secret good ideas + collaboration + more work needed = ???

6. The hard problem of smart people working on a hard problem.

7. Concluding paragraph because you have a crippling addiction to prose (ok, same, fair).

75

On 3.

On 4.

75

On 3.

On 4.