Also known as Raelifin: https://www.lesswrong.com/users/raelifin
Here are some thoughts about the recent back-and-forth where Will MacAskill reviewed IABI and Rob Bensinger wrote a reply and Will replied back. I'm making this a quick take instead of a full post because it gets kinda inside baseball/navelgazy and I want to be more chill about that than I would be in a full writeup.
First of all, I want to say thank you to Will for reviewing IABI. I got a lot out of the mini-review, and broadly like it, even if I disagree on the bottom-line and some of the arguments. It helped me think deep thoughts.
The evolution analogy
I agree with Will that the evolution analogy is useful and informative in some ways, but of limited value. It’s imperfect, and thinking hard about the differences is good.
The most basic disanalogy is that evolution wasn’t trying, in any meaningful sense, to produce beings that maximise inclusive genetic fitness in off-distribution environments.
I agree with this, and I appreciate MacAskill talking about it.
But we will be doing the equivalent of that!
One of the big things that I think distinguishes more doomy people like me from less doomy people like Will is our priors on how incompetent people are. Like, I agree that it's possible to carefully train an ML system to be (somewhat) robust to distributional shifts. But will we actually do that?
I think, at minimum, any plan to build AGI (to say nothing of ASI) should involve:
And I personally think pure corrigibility has a nonzero chance of being a good choice of goal, and that a sufficiently paranoid training regime has a nonzero chance of being able to make a semi-safe AGI this way, even with current techniques. (That said, I don’t actually advocate for plans that have a significant chance of killing everyone, and I think “try to build corrigible AGI” does have a significant chance of killing everyone; I just notice that it seems better than what the research community currently seems to be doing, even at Anthropic.)
I predict the frontier lab that builds the first AGI will not be heavily focused on ensuring robustness to distributional shifts. We could bet, maybe.
Types of misalignment
I really benefited from this! Will changed my mind! My initial reaction to Will’s mini-review was like, "Will is wrong that these are distinct concepts; any machine sufficiently powerful to have a genuine opportunity to disempower people but which is also imperfectly aligned will produce a catastrophe."
And then I realized that I was wrong. I think. Like, what if Will is (secretly?) gesturing at the corrigibility attractor basin or perhaps the abstracted/generalized pattern of which corrigibility is an instance? (I don't know of other goals which have the same dynamic, but maybe it's not just corrigibility?)
An agent which is pseudo-corrigible, and lives inside the attractor basin, is imperfectly aligned (hence the pseudo) but if it's sufficiently close to corrigible it seems reasonable to me that it won't disempower humanity, even if given the opportunity (at least, not in every instance it gets the opportunity). So at the very least, corrigibility (one of my primary areas of research!) is (probably) an instance of Will being right (and my past self being wrong), and the distinction between his "types of misalignment" is indeed a vital one.
I feel pretty embarrassed by this, so I guess I just wanna say oops/sorry/thanks.
If I set aside my actual beliefs and imagine that we’re going to naturally land in the corrigibility attractor basin by default, I feel like I have a better sense of some of the gradualism hope. Like, my sense is that going from pseudo-corrigible to perfectly corrigible is fraught, but can be done with slow, careful iteration. Maybe Clara Collier and other gradualists think we're going to naturally land in the corrigibility attractor basin, and that the gradual work is the analogue of the paranoid iteration that I conceive as being the obvious next-step?
If this is how they're seeing things, I guess I feel like I want to say another oops/sorry/thanks to the gradualists. ...And then double-click on why they think we have a snowball's chance in hell of getting this without a huge amount of restriction on the various frontier labs and way more competence/paranoia than we currently seem to have. My guess is that this, too, will boil down to worldview differences about competence or something. Still. Oops?
(Also on the topic of gradualism and the notion of having "only one try" I want to gesture at the part of IABI where it says (paraphrased from memory, sorry): if you have a clever scheme for getting multiple tries, you still only get one try at getting that scheme to work.)
appeals to what “most” goals are like (if you can make sense of that) doesn’t tell you much about what goals are most likely. (Most directions I can fire a gun don’t hit the target; that doesn’t tell you much about how likely I am to hit the target if I’m aiming at it.)
I agree that “value space is big” is not a good argument, in isolation, for how likely it is for our creations to be aligned. The other half of the pincer is “our optimization pressure towards aligned goals is weak,” and without that the argument falls apart.
(Maybe we won’t be able to make deals with AIs? I agree that’s a worry; but then the right response is to make sure that we can. Won’t the superintelligence have essentially a 100% chance of taking over, if it wanted to? But that’s again invoking the “discontinuous jump to godlike capabilities” idea, which I don’t think is what we’ll get).
Here’s a plan for getting a good future:
I think this plan is bad because it fails the heuristic of “don’t summon demons and try to cleverly bargain with them,” but perhaps I’m being unfair.
My main criticism with "make deals with the AIs" is that it seems complex and brittle and like it depends heavily on a level of being able to read the machine’s mind that we definitely don’t currently have and might never have.
That said, I do think there's a lot of value in being the sorts of people/groups that can make deals and be credible trade partners. Efforts to be more trustworthy and honorable and so on seem great.
suppose that all the first superintelligence terminally values is paperclips. But it’s risk-averse, in the sense that it prefers a guarantee of N resources over a 50/50 chance of 0 or 2N resources; let’s say it’s more risk-averse than the typical human being.
On a linguistic level I think "risk-averse" is the wrong term, since it usually, as I understand it, describes an agent which is intrinsically averse to taking risks, and will pay some premium for a sure-thing. (This is typically characterized as a bias, and violates VNM rationality.) Whereas it sounds like Will is talking about diminishing returns from resources, which is, I think, extremely common and natural and we should expect AIs to have this property for various reasons.
it would strongly prefer to cooperate with humans in exchange for, say, a guaranteed salary, rather than to take a risky gamble of either taking over the world or getting caught and shut off.
Rob wrote some counterpoints to this, but I just want to harp on it a little. Making a deal with humans to not accumulate as much power as possible is likely an extremely risky move for multiple reasons, including that other AIs might come along and eat the lightcone.
I can imagine a misaligned AI maybe making a deal with humans who let it out of the box in exchange for some small fraction of the cosmos (and honoring the deal; again, the hard part is that it has to know we can tell if it's lying, and we probably can't).
I can’t really imagine an AI that has a clear shot at taking over the world making a deal to be a meek little salary worker, even if there are risks in trying to take over. Taking over the world means, in addition to other things, being sure you won’t get shut off or replaced by some other AI or whatever.
(Though I can certainly imagine a misaligned AI convincing people (and possibly parts of itself) that it is willing to make a deal like that, even as it quietly accumulates more power.)
Their proposal
Now we're getting into the source of the infighting, I think (just plain fighting? I think of Will as being part of my ingroup, but idk if he feels the same; Rob definitely is part of my ingroup; are they part of each other's ingroups? Where is the line between infighting and just plain fighting?). Will seems very keen on criticizing MIRI's "SHUT IT DOWN YOU FOOLS" strategy — mostly, it seems to me, because he sees this approach as insufficiently supportive of strategies besides shutting things down.
When Rob shared his draft of his reply to Will, I definitely noticed that it seemed like he was not responding accurately to the position that I saw in Will's tweet. Unfortunately, I was aware that there is something of a history between Will and MIRI and I incorrectly assumed that Rob was importing true knowledge of Will's position that I simply wasn't aware of. I warned him that I thought he was being too aggressive, writing "I expect that some readers will be like 'whoa why is MIRI acting like this guy is this extremist--I don't see evidence of that and bet they're strawmanning him'." But I didn't actually push back hard, and that's on me. Apologies to Will.
(Rob reviewed a draft of this post and adds his own apologies for misunderstanding Will’s view. He adds: “My thanks to Max and multiple other MIRI people for pushing back on that part of my draft. I made some revisions in response, though they obviously weren’t sufficient!”)
I'm very glad to see in Will's follow-up:
"I definitely think it will be extremely valuable to have the option to slow down AI development in the future,” as well as “the current situation is f-ing crazy”
I wish this had been more prominent in his mini-review, but :shrug:
I think Will and I probably agree that funding a bunch of efforts to research alignment, interpretability, etc. would be good. I'm an AI safety/alignment researcher, and I obviously do my day-to-day work with a sense that it's valuable and a sense that more effort would also be valuable. I've heard multiple people (whom I respect and think are doing good work) complain that Eliezer is critical/dismissive of their work, and I wish Eliezer was more supportive of that work (while also still saying "this won't be sufficient" if that's what he believes, and somehow threading that needle).
I am pretty worried about false hope, though. I'm worried that people will take "there are a bunch of optimistic researchers working hard on this problem" as a sign that we don't need to take drastic action. I think we see a bunch of this already and researchers like myself have a duty to shout "PLEASE DON'T RISK EVERYTHING! I DON'T GOT THIS!"[1] even while pursuing the least-doomed alignment strategies they know of. (I tried to thread this needle in my corrigibility research.)
Anyway, I think I basically agree with Will's clarified position that a "kitchen-sink approach" is best, including a lot of research, as long as actually shutting down advanced training runs and pure capabilities research is in the kitchen sink. I feel worried that Will isn't actually pushing for that in a way that I think is important (not building "It" is the safest intervention I'm aware of), but I'm also worried about my allies (people who basically agree that AI is unacceptably dangerous and that we need to take action) being unable to put forward a collective effort without devolving into squabbling about tone and strawmanning each other. :(
Anyway. Thank you again to Will and Rob. I thought both pieces were worth reading.
(Not to say that we should necessarily risk everything if alignment researchers do feel like they've "got this." That's a question worth debating in its own right. Also, it's obviously worth noting that work that is incrementally useful but clearly insufficient to solve the entire field can still be valuable and the researcher is still allowed to say "I got this" on their little, local problems. (And they're definitely allowed to speak up if they actually do solve the whole damn problem, of course. But they better have actually solved it!))
I think VNM is important and underrated and CAST is compatible with it. Not sure exactly what you're asking, but hopefully that answers it. Search "VNM" on the post where I respond to existing work for more of my thoughts on the topic.
My read on what @PeterMcCluskey is trying to say: "Max's work seems important and relevant to the question of how hard corrigibility is to get. He outlined a vision of corrigibility that, in the absence of other top-level goals, may be possible to truly instill in agents via prosaic methods, thanks to the notion of an attractor basin in goal space. That sense of possibility stands in stark opposition to the normal MIRI party-line of anti-naturality making things doomed. He also pointed out that corrigibility is likely to be a natural concept, and made significant progress in describing it. Why is this being ignored?"
If I'm right about what Peter is saying, then I basically agree. I would not characterize it as "an engineering problem" (which is too reductive) but I would agree there are reasons to believe that it may be possible to achieve a corrigible agent without a major theoretical breakthrough. (If (1) I'm broadly right, (2) anti-naturality isn't as strong as the attractor basin in practice, and (3) I'm not missing any big complications, which is a big set of ifs that I would not bet my career on, much less the world.)
I think Nate and Eliezer don't talk about my work out of a combination of having been very busy with the book and not finding my writing/argumentation compelling enough to update them away from their beliefs about how doomed things are because of the anti-naturality property.
I think @StanislavKrym and @Lucius Bushnaq are pointing out that I think building corrigible agents is hard and risky, and that we have a lot to learn and probably shouldn't be taking huge risks of building powerful AIs. This is indeed my position, and does not feel contrary to or solidly addressing Peter's points.
Lucius and @Mikhail Samin bring up anti-naturality. I wrote about this at length in CAST and basically haven't significantly updated, so I encourage people to follow Lucius' link if they want to read my full breakdown there. But in short, I do not feel like I have a handle on whether the anti-naturality property is a stronger repulsor than the corrigibility basin is an attractor in practice. There are theoretical arguments that pseudo-corrigible agents will become fully corrigible and arguments that they will become incorrigible and I think we basically just have to test it and (if it favors attraction) hope that this generalizes to superintelligence. (Again, this is so risky that I would much rather we not be building ASI in general.) I do not see why Nate and Eliezer are so sure that anti-naturality will dominate, and this is, I think, the central issue of confidence that Peter is trying to point at.
(Aside: As I wrote in CAST, "anti-natural" is a godawful way of saying opposed-to-the-instrumentally-convergent-drives, since it doesn't preclude anti-natural things being natural in various ways.)
Anyone who I mischaracterized is encouraged to correct me. :)
(Minor point: I agree we're not on track, but I was trying to include in my statement the possibility that we change track.)
Agreed. Thanks for pointing out my failing, here. I think this is one of the places in my rebuttal where my anger turned into snark, and I regret that. Not sure if I should go back and edit...
Thank you for this response. I think it really helped me understand where you're coming from, and it makes me happy. :)
I really like the line "their case is maybe plausible without it, but I just can't see the argument that it's certain." I actually agree that IABIED fails to provide an argument that it's certain that we'll die if we build superintelligence. Predictions are hard, and even though I agree that some predictions are easier, there's a lot of complexity and path-dependence and so on! My hope is that the book persuades people that ASI is extremely dangerous and worth taking action on, but I'd definitely raise an eyebrow at someone who did not have Eliezer-level confidence going in, but then did have that level of confidence after reading the book.
There's a motte argument that says "Um actually the book just says we'll die if we build ASI given the alignment techniques we currently have" but this is dumb. What matters is whether our future alignment skill will be up to the task. And to my understanding, Nate and Eliezer both think that there's a future version of Earth which has smarter, more knowledgeable, more serious people that can and should build safe/aligned ASI. Knowing that a godlike superintelligence with misaligned goals will squish you might be an easy call, but knowing exactly what the state of alignment science will be when ASI is first built is not.
(This is why it's important that the world invests a whole bunch more in alignment research! (...in addition to trying to slow down capabilities research.))
It seems like maybe part of the issue is that you hear Nate and Eliezer as saying "here is the argument for why it's obvious that ASI will kill us all" and I hear them as saying "here is the argument for why ASI will kill us all" and so you're docking them points when they fail to reach the high standard of "this is a watertight and irrefutable proof" and I'm not?
On a different subtopic, it seems clear to me that we think about the possibility of a misaligned ASI taking over the world pretty differently. My guess is that if we wanted to focus on syncing up our worldviews, that is where the juicy double-cruxes are. I'm not suggesting that we spend the time to actually do that--just noting the gap.
Thanks again for the response!
@Max H may have a different take than mine, and I'm curious for his input, but I find myself still thinking about serial operations versus parallel operations. Like, I don't think it's particularly important to the question of whether AIs will think faster to ask how many transistors operating in parallel will be needed to capture the equivalent information processing of a single neuron, but rather how many serial computations are needed. I see no reason it would take that many serial operations to capture a single spike, especially in the limit of e.g. specialized chips.
Yeah, sorry. I should've been more clear. I totally agree that there are ways in which brains are super inefficient and weak. I also agree that on restricted domains it's possible for current AIs to sometimes reach comparable data efficiency.
Ah, I hadn't thought about that misreading being a source of confusion. Thanks!
Oh, uh, I guess @wdmacaskill and @Rob Bensinger