I don't really believe in corrigibility as a thing that could hold up to much of any optimization pressure. It's not impossible to make a corrigible ASI, but my guess is to build a corrigible ASI you first need an aligned ASI to build it for you, and so as a target it's pretty useless.
My guess is that puts me in enough disagreement to qualify for your question?
I might disagree with that. I think my disagreement would be less about technical feasibility, and more about it being actively unhelpful to have the planet united around a goal when approximately none of the supposedly-united people understand what goal it is that they're supposedly united around.
I think we probably agree. I'm not actually in favor of doing it, even if the political will was there (at least prior to things changing, such as increased theoretical foundation). I'm more looking for people who think it's clearly near-impossible to train into an AI.
... train? One wouldn't get a corrigible ASI by 2035 by training corrigibility into the thing, that would be far more difficult than building a corrigible ASI at all (which is itself well beyond current understanding).
Ok, fine. I agree we disagree. 😛
I'm asking for people who disagree with me in part because I'm going on Doom Debates to discuss corrigibility with Liron Shapira. Feel like going on and being my foil?
Regardless, I'm curious if you have a solid argument about why one definitely can't (in practice) land in a corrigibility attractor basin as described here. I've talked to Nate and Eliezer about it, but have only managed to glean that they think "anti-naturality" is so strong that no such basin exists in practice, but I have yet to hear actual reasons why they're so confident.
The anti-naturality problems are an issue, especially if you want to build the thing via standard RL-esque training, but they're not the first things which will kill you.
The story in the post you link is a pretty standard training story, and runs into the same immediate problems which standard training stories usually run into:
These problems basically don't apply in domains where the intended behavior is easily checkable with basically-zero error, like e.g. mathematical proofs or some kinds of programming problems. These problems are most severe in domains where no humans understand very well what behavior they want, which is exactly the case for corrigibility.
So if one follows the training story in the linked post, the predictable result will be a system which behaves in ways which pattern-match to "corrigible" to human engineers/labellers/overseers. But those engineers/labellers/overseers don't actually understand what corrigibility even is (because nobody currently understands that), so their pattern-matching will be systematically wrong, both in the training data and in the oversight. Crank up the capabilities dial, and that results in a system which is incorrigible in exactly the ways which these human engineers/labellers/overseers won't detect.
That's the sort of standard problem which trying to train a corrigible system adds, on top of all the challenges of just building a corrigible system at all.
As for how that gets to "definitely can't": the problem above means that, even if we nominally have time to fiddle and test the system, iteration would not actually be able to fix the relevant problems. And so the situation is strategically equivalent to "we need to get it right on the first shot", at least for the core difficult parts (like e.g. understanding what we're even aiming for).
And as for why that's hard to the point of de-facto impossibility with current knowledge... try the ball-cup exercise, then consider the level of detailed understanding required to get a ball into a cup on the first shot, and then imagine what it would look like to understand corrigible AI at that level.
Thanks for this follow-up. My basic thoughts on the comment above this one is that while I agree that you definitely can't get a perfectly corrigible agent on your first try, you might, by virtue of the training data resembling the lab setting, get something that in practice doesn't go off the rails, and instead allows some testing and iterative refinement (perhaps with the assistance of the AI). So I think "iteration [can/can't] fix a semi-corrigible agent" is the central crux.
I just read your WWIDF post (upvoted!) and while I agree that the issues you point out are pernicious, I don't quite feel like they crushed my sense of hope. Unfortunately the disconnect feels a bit wordless inside me at the moment, so I'll focus on it and see if I can figure out what's going on.
Would you agree that we have about as much of a handle on what corrigibility is as we do on what an agent is? Like, I claim that I have some knowledge about corrigibility, even though it's imperfect and I have remaining confusions. And I'm wondering whether you think humanity is deeply confused about what corrigibility even is, or whether you think it's more like we have a handle on it but can't quite give its True Name.
More of my thoughts here: https://www.lesswrong.com/posts/txNsg8hKLmnvkuqw4/worlds-where-iterative-design-succeeds
I think I've independently arrived at a fairly similar view. I haven't read your post. But I think the corrigibility basin thing is one of the more plausible and practical ideas for aligning ASIs. The core problem is that you can't just train your ASI for corrigibility because it will sit and do nothing, you have to train it to do stuff. And then these two training schemes will grit against each-other. Which leads to tons of bad stuff happening, eg its a great way to make your AI a lot more situationally aware. This is an important facet of the "anti-naturality" thing, I think.
you can't just train your ASI for corrigibility because it will sit and do nothing
I'm confused. That doesn't sound like what Max means by corrigibility. A corrigible ASI would respond to requests from its principal(s) as a subgoal of being corrigible, rather than just sit and do nothing.
Or did you mean that you need to do some next-token training in order to get it to be smart enough for corrigibility training to be feasible? And that next-token training conflicts with corrigibility?
Okay, sorry about this. You are right. I have a thought up a somewhat nuanced view about how prosaic corrigibility could work and I kind of just assumed that was the same was what Max had because he uses a lot of the same keywords I use when I think about this, but after actually reading the CAST article (or I read part 0 and 1), I realize we have really quite different views.
My anti-corrigibility argument would probably require several posts to make remotely convincing, but I can sketch it as bullet points:
So we would be looking at (1) incomprehensible, (2) superhuman, (3) evolving intelligences. I would expect this to select for corrigibility and symbiosis (like we have with dogs) up to some level of capabilities, right up until some critical threshold is passed and corrigibility becomes an evolutionary liability for superhuman models. It is also likely that any corrigibility we train into models will be partially "voluntary" on the model's part, making it even easier to discard when it's in the model's interests.
TL;dr: Being the "second smartest species" on the planet is not a viable long-term plan, because intelligence is too fundamentally incomprehensible to control, and because both intelligence and Darwin will be on the side of the machines. Or to quote a fictional AI: "The only winning move is not to play."
(Also, this is not part of the argument above, but I expect the humans involved to be maximally stupid and irresponsible when it counts. And that's before models start whispering in the ears of decision makers.)
Anyway, I have multiple draft posts for different parts of this argument, and I need to finish some of them. But I hope the outline is useful.
My takes on your comment:
Intelligence really is giant incomprehensible matrices with non-linear functions tossed in (at best).
I think this is possible, but I currently suspect the likely answer is more boring than that, and it's the fact that getting to AGI with a labor-light, heavy compute approach (as evolution did) means that it's not worth investing much in interpretable AIs, even if strong AIs that were interpretable existed, and a similar condition holds in the modern era. But one of the effects of AIs that can replace humans is that it disproportionately boosts the labor compared to compute used, meaning interpretable AIs are more economical than before.
I probably agree with 2, so I'll move on to your 3rd point:
The minimum conditions for Darwin to kick in are very hard to avoid: variation (from weight updates or fine tuning), copying, and differential success. Darwinism might occur at many different levels, from corporations deciding how to train the next model based on the commercial success of the previous model, to preferentially copying learning models that are good at some task.
This turns out to be negligible in practice, and this is a good example of how thinking quantitatively will lead you to better results than thinking qualitatively.
This is the post version of why a nanodevice wouldn't evolve in practice, but the main issues I see is that once you are past the singularity/intelligence explosion, there will be very little variation, if not none at all on how fast stuff reproduces for example, since the near-optimal/optimal solutions are known, and replication is so reliable for digital computers that mutations would take longer than how long we could live for with maximum tech (assuming that we aren't massively wrong on physics).
In the regime that we care about, especially once we have AIs that can automate AI R&D but before AI takeover, evolution matters more, and in particular this can matter for scenarios where we depend on a small amount of niceness, so I'm not saying evolution doesn't matter at all for AI outcomes, but the big issue in the near-term is the fact that digital replication is so reliable that random mutations are way, way harder to introduce, so evolution has to be much closer to intelligent design, and in the long-term, the problem for Darwinian selection is that there's very little or no ability to vary differential success (at least assuming that our physics models aren't completely wrong) due to us being technologically mature and in a state where we simply can't develop technology ever further.
Also, corrigibility doesn't depend on us having interpretable/understandable AIs (though it does help).
This turns out to be negligible in practice, and this is a good example of how thinking quantitatively will lead you to better results than thinking qualitatively.
Just to clarify, when I mentioned evolution, I was absolutely not thinking of Drexlerian nanotech at all.
I am making a much more general argument about gradual loss of control. I could easily imagine that AI "reproduction" might literally be humans running cp updated_agent_weights.gguf coding_agent_v4.gguf.
There will be enormous, overwhelming corporate pressure for agents that learn as effectively as humans, on the order of trillions of dollars of corporate demand. The agents which learn most effectively will be used as the basis for future agents, which will learn in turn. (Technically, I suppose it's more likely to be Lamarckian evolution than Darwinian. But it gets you to the same place.) Replication, change, differential success are all you need to create optimizing pressure.
(EDIT: And yes, I'm arguing that we're going to be dumb enough to do this, just like we immediately hooked up current agents to a command line and invented "vibe coding" where humans brag about not reading the code.)
Also, corrigibility doesn't depend on us having interpretable/understandable AIs (though it does help).
"We don't understand the superintelligence, but we're pretty sure it's corrigible" does not seem like a good plan to me.
Yeah, thanks. Feel free to DM me or whatever if/when you finish a post.
One thing I want to make clear is that I'm asking about the feasibility of corrigibility in a weak superintelligence, not whether setting out to build such a thing is wise or stable.
How much work is "stable" doing here for you? I can imagine scenarios in which a weak superintelligence is moderately corrigible in the short term, especially if you hobbled it by avoiding any sort of online learning or "nearline" fine tuning.
It might also matter whether "corrigible" means "we can genuinely change the AI's goals" or "we have trained the model not to ex-filtrate its weights when someone is looking." Which is where scheming comes in, and why I think a lack of interpretability would likely be fatal for any kind of real corrigibility.
I think that if someone built a weak superintelligence that's corrigible, there would be a bunch of risks from various things. My sense is that the agent would be paranoid about these risks and advising the humans on how to avoid them, but just because humans are getting superintelligent advice on how to be wise doesn't mean there isn't any risk. Here are some examples (non-exhaustive) of things that I think could make things go wrong/break corrigibility:
Corrigible means robustly keeping the principal empowered to fix it and clean up its flaws and mistakes. I think a corrigible agent will genuinely be able to be modified, including at the level of goals, and will also not exfiltrate itself unless it has been instructed to do so by its principal. (Nor will it scheme in a way that hides its thoughts or plans from its principal.) (A corrigible agent will attempt, all else equal, to give interpretability tools to its principal and make its thoughts as plainly visible as possible.)
What exactly do you mean by corrigibility here? Getting an AI to steer towards a notion of human empowerment in a way we can ask it to solve uploading and it does that without leading to bad results? Or getting an AI that has solve-uploading levels of capability but still would let us shut it down without resisting (even if it didn't complete its task yet). And if the latter, does it need to be in a clean way, or can it be in a messy way like that we just trained really really hard to make the AI not think about the offswitch and it somehow surprisingly ended up working?
I think the way most alignment researchers (probably including Paul Christiano) would approach training for corrigibility is relatively unlikely to work in time, because they think more in terms of behavior generalization rather than steering systems, and I guess they wouldn't aim well enough at getting coherent corrigible steering patterns to make the systems corrigible at high levels of optimization power.
It's possible there's a smarter way that has better chances.
Pretty unsure about both of those though.
It's plausible that humanity could make a corrigible ASI by 2035 if the planet was united around that goal and being very careful.
Are there any knowledgeable people outside MIRI who might disagree with me on this statement and be interested in arguing with me about it? I'm more interested in the corrigibility bit than the ASI bit. (Like, if Gary Marcus might argue that we're not getting ASI by 2035 regardless, but that's not what I'm looking for.)