(Related to What an Actually Pessimistic Containment Strategy Looks Like)
It seems to me like there are several approaches with an outside chance of preventing doom from AGI. Here are four:
- Convince a significant chunk of the field to work on safety rather than capability
- Solve the technical alignment problem
- Rethink fundamental ethical assumptions and search for a simple specification of value
- Establish international cooperation toward Comprehensive AI Services, i.e., build many narrow AI systems instead of something general
Furthermore, these approaches seem quite different, to the point that some have virtually no overlap in a Venn-diagram. #1 is entirely a social problem, #2 a technical and philosophical problem, #3 primarily a philosophical problem, and #4 in equal parts social and technical.
Now suppose someone comes to you and says, "Hi. I'm working on AI safety, which I think is the biggest problem in the world. There are several very different approaches for doing this. I'm extremely confident (99%+) that the approach I've worked on the most and know the best will fail. Therefore, my policy recommendation is that we all keep working on that approach and ignore the rest."
I'm not saying the above describes Eliezer, only that the ways in which it doesn't are not obvious. Presumably Eliezer thinks that the other approaches are even more doomed (or at least doomed to a degree that's sufficient to make them not worth talking about), but it's unclear why that is or why we can be confident in it given the lack of effort that has been extended so far.
Take this comment as an example:
How about if you solve a ban on gain-of-function research [before trying the policy approach], and then move on to much harder problems like AGI? A victory on this relatively easy case would result in a lot of valuable gained experience, or, alternatively, allow foolish optimists to have their dangerous optimism broken over shorter time horizons.
This reply makes sense if you are already convinced that policy is a dead end and just want to avoid wasting resources on that approach. If policy can work, it sounds like a bad plan since we probably don't have time to solve the easier problem first, especially not if one person has to do it without the combined effort of the community. (Also, couldn't we equally point to unsolved subproblems in alignment, or alignment for easier cases, and demand that they be solved before we dare tackle the hard problem?)
What bothers me the most is that discussion of alternatives has not even been part of the conversation. Any private person is, of course, free to work on whatever they want (and many other researchers are less pessimistic about alignment), but I'm specifically questioning Miri's strategy, which is quite influential in the community. No matter how pessimistic you are about the other approaches, surely there has to be some probability for alignment succeeding below which it's worth looking at alternatives. Are we 99% sure that value isn't simple and that the policy problem is unsolvable even for a shift to narrow systems? 99.99%? What is the point at which it begins to make sense to advocate for work on something else (which is perhaps not even on the list)? It's possible that Miri should stick to alignment regardless because of comparative advantage, but the messaging could have been "this alignment thing doesn't seem to work; we'll keep at it but the rest of you should do something else", and well it wasn't.
My interpretation of MIRI is that they are ARE looking for alternatives and so far have not found any that don't also seem doomed. E.g. they thought about trying to coordinate to ban or slow down AI capabilities research and concluded that it's practically impossible since we can't even ban gain-of-function research and that should be a lot easier. My interpretation of MIRI is that their recent public doomsaying is NOT aimed at getting people to just keep thinking harder about doomed AI alignment research agendas; rather, it is aimed at getting people to think outside the box and hopefully come up with a new plan that might actually work. (The recent April Fool's post also served another function of warning against some common failure modes, e.g. the "slip sideways into fantasy world" failure mode where you start gambling on assumptions holding true and then end up doing this repeatedly and losing track of how increasingly unlikely the world you are planning for is.)
If that is the case, then I would very much like them to publicize the details for why they think other approaches are doomed. When Yudkowsky has talked about it in the past, it tends to be in the form of single-sentence statements pointing towards past writing on general cognitive fallacies. For him I’m sure that would be enough of a hint to clearly see why strategy x fits that fallacy and will therefore fail, but as a reader, it doesn’t give me much insight as to why such a project is doomed, rather than just potentially flawed. (Sorry if this doesn’t make sense btw, I’m really tired and am not sure I’m thinking straight atm)
I think this would probably be helpful.
One rationale for not spelling things out in more detail is an expectation that anyone capable of solving a significant chunk of the problem will need to be able to notice such drawbacks themselves. If getting to a partial solution will require a researcher to notice over a hundred failure modes along the path, then Eliezer's spelling out the first ten may not help - it may deny them the opportunity to reason things through themselves and learn. (I imagine that in reality time constraints are playing a significant role too)
I do think there's something in this, but it strikes me that there are likely more efficient approaches worth looking for (particularly when it comes to people who want/need to understand the alignment research landscape, but aren't themselves planning to work in technical alignment).
Quite a lot depends on how hard we think it is to navigate through potential-partial-alignment-solution space. If it were a low dimensional space or workable solutions were dense, one could imagine finding solutions by throwing a few thousand ants at a promising part of the space and letting them find the sugar.
Since the dimensionality is high, and solutions not dense, I think there's a reasonable case that the bar on individual navigation skill is much higher (hence the emphasis on rationality).
I mean, I'm also assuming something like this is true, probably, but it's mostly based on "it seems like something they should do, and I ascribe a lot of competence to them".
How much effort have we as a community put into banning gain of function vs. solving alignment? Given this, if, say, banning AGI research is 0.5 as hard as alignment (which would make it a great approach) and gain-of-function 0.1 as hard as banning AGI, would we have succeeded at a gain-of-function ban? I doubt it.
Idk, I skimmed the April Fool's post again before submitting this, and I did not get that impression.
I think this is is missing the point somewhat.
When Eliezer and co. talk about tackling "the hard part of the problem"* I believe they are referring trying to solve to the simplest, easiest, problems that capture some part of the core difficulty of alignment.
See this fictionalized segment from the rocket alignment problem:
Hence, doing Agent Foundations work that isn't directly about working with machine learning systems, but is mostly about abstractions of ideal agents, etc.
Similarly, I think that creating and enforcing a global ban on gain of function research captures much of the hard part of causing the world to coordinate not to build AGI. It is an easier task, but you will encounter many of the same blockers, and need to solve many of the same sub-problems.
Creating a global ban on gain of function research : coordinating the world to prevent solving AGI :: solving the tiling agents problem : solving the whole alignment problem.
* For instance, in this paragraph, from here.
I don't know when Miri started working on Tiling Agents, but was published in 2013. In retrospect, it seems like we would not have wanted people to wait that long to work on alignment. And it's especially problematic now that timelines are shorter.
I mean, assume a coordinated effort to ban gain-of-function research succeeds eight years from now; even if we then agree that policy is the way to go, it may be too late.
I don't buy this characterization. This might sound at odds with my comment above, but working on tiling agents was an attempts at solving alignment, not deferring solving alignment.
The way you solve a thorny, messy, real-world technical problem, is to first solve a easier problem with simplified assumptions, and then gradually add in more complexity.
I agree that this analogizes less tightly to the political action case, because solving the problem of putting a ban on gain of function research is not a strictly necessary step for creating a ban on AI, the way solving Tiling agents is (or at least seemed at the time to be) a necessary step for solving alignment.
I totally agree. My point was not that tiling agents isn't alignment research (it definitely is), it's that the rest of the community wasn't waiting for that success to start doing stuff.
I'd say that basically factors into "solve AI governance" and "solve the technical alignment problem", both of which seem extremely hard, but we need to try it anyways.
(In particular, points 3&4 are like instances of 2 that won't work. (Ok maybe sth like 4 has a small chance to be helpful.))
The governance and the technical part aren't totally orthogonal. Making progress on one helps making the other part easier or buys more time.
(I'm not at all as pessimistic as Eliezer, and I totally agree with What an Actually Pessimistic Containment Strategy Looks Like, but I think you (like many people) seem to be too optimistic that something will work if we just try a lot. Thinking about concrete scenarios may help to see the actual difficulty.)
Towards a #1-flavored answer, a Hansonian fine insured bounty system seems like it might scale well for enforcing cooperation against AI research.
I would break down the possibility space as:
(Your #4 is my #1, your #1 is my #1 or #3, your #2 is my #3, your #3 is one aspect of my #3 (i.e., it's getting at an approach to ease outer alignment, but we would still also need to solve inner alignment).)
Yeah, I suspect that EY thinks #1 is even less likely than #3, and that he doesn't think about #1 at all except to poo-poo people who are even trying. My impression is that EY has done work brainstorming on #2 in the past, but that he has given up.
For my part, I'm more open-minded and happy for people to be trying different things in parallel—see for example what I wrote under “objection #6” here (expressing an open mind that someone might solve #1).
This list is a lot better in terms of comprehensiveness since most things are probably in one of the three buckets, but one of my main points was that there are a lot of approaches with a chance of working that have little overlap, and lumping them together glosses over that. There are probably several ideas that aren't on my list and also in your #1.
Or differently put, the opinion that #1 is hopeless only makes sense if you have an argument that applies to all things in #1, but I question whether you can justifiably have such an argument. (I know you don't have it, I'm just saying that the category you've created may be doing some unjustified work here.)
Edit: and also (this is something I almost included in the post but then didn't), I'm also skeptical about lumping #3 together since e.g. agent foundations and debate don't seem to have a lot of overlap, but they're both ideas that aim to solve the entire problem. And EY seems more specialized in agent foundations specifically than all of alignment.
(Good correction on the inner alignment thing.)
I think there's a few things that get in the way of doing detailed planning for outcomes where alignment is very hard and takeoff very fast. This post by David Manheim discusses some of the problems: https://www.lesswrong.com/posts/xxMYFKLqiBJZRNoPj
One is that, there's no clarity even among people who've made AI research their professional career about alignment difficulty or takeoff speed. So getting buy in in advance of clear warning signs will be extremely hard.
The other is that the strategies that might help in situations with hard alignment are at cross purposes to ones in Paul-like worlds with slow takeoff and easy alignment - promoting differential progress Vs creating some kind of global policing system to shut down AI research
What if AI safety and governance people published their papers on Arxiv in addition to NBER or wherever? I know it's not the kind of stuff that Arxiv accepts, but if I was looking for a near-term policy win, that might be one.