I suspect we're talking about two different things.
If you just naively program a super-intelligent AI to satisfice a goal, then, sure, most of the candidate pathways to satisfice will involve accruing a lot of some type of power, because power is useful for achieving goals. That's a valid point, and it's important to understand that merely switching from optimizers to satisficers won't adequately protect us against overly ambitious AIs.
However, that doesn't mean that it's futile to explicitly penalize most (but not literally all) of the paths that the AI could take to accumulate too much power. Suppose you adequately penalize all of the tactics that would have catastrophic side effects except for, oh, I don't know, cornering the world market on all types of grain, because you forgot to think of that one particular path to power. Would most of the candidate plans that the AI submits for human approval then turn out to involve secretly cornering the grain market?
I don't see why they would. All else being equal, sure, the most powerful tactic available is going to be somewhat more attractive than other, less powerful tactics. But how much more attractive? Would an AI generate 1,000x more plans that involve one specific globally disruptive tactic like cornering the grain market vs. all of the various tactics that are only locally disruptive, like publishing propaganda or hiring away competing researchers or hacking into competing research labs or interrupting the flow of electricity to those labs? 10x more such plans? 2x more such plans? I don't think that's the kind of estimate you can make just by gesturing at basic principles of game theory; you'd need some concrete domain knowledge about the AI's specific planning algorithms.
If the truly dangerous plans are only 10x more common on the initial brainstorming list, then we can probably make sure those plans aren't chosen by filtering for plans that the AI rates as safe and transparent. We can use 3 different AIs on airgapped equipment -- one that only "cares" about evaluating whether a plan is safe (i.e., would the plan cause unacceptably large changes if implemented?), one that only "cares" about evaluating whether a plan is transparent (i.e., can the plan be explained to humans so that the humans will have an accurate understanding of what actions will be taken as part of the plan?), and one that generates plans that satisfice for a substantive goal (in this case, shutting down opposing AI research).
Of the plans that are approved by all 3 AIs, most of them probably won't involve accumulating excessive amounts of power, because usually if you accumulate too much power, then that is (a) unsafe, and (b) likely to involve lying to 'your' human programmers.
If powerful plans are 10x more common on the initial brainstorming list, then you need filters for safety and/or transparency that are at least 90% effective at catching unacceptably powerful plans in order to get back up to a 50% chance of humanity surviving. If powerful plans are more than 10x more common, then you need a set of filters that's more reliable than 90%. I don't think we have any idea what those numbers are yet, but I do think it's worth trying to reduce how common it is for excessively powerful plans to show up on the initial brainstorming list, and I think we can do that by training AIs to recognize dangerously disruptive plans and to try to avoid those types of plans. It's better to at least try to get AIs to engage with the concept of "this plan is too disruptive" then to throw up our hands and say, "Oh, power is an attractor in game theory space, so there's no possible way to get brilliant AIs that don't seize infinite power."
Sure, the metaphor is strained because natural selection doesn't have feelings, so it's never going to feel satisfied, because it's never going to feel anything. For whatever it's worth, I didn't pick that metaphor; Eliezer mentions contraception in his original post.
As I understand it, the point of bringing up contraception is to show that when you move from one level of intelligence to another, much higher level of intelligence, then the more intelligent agent can wind up optimizing for values that would be anathema to the less intelligent agents, even if the less intelligent agents have done everything they can to pass along their values. My objection to this illustration is that I don't think anyone's demonstrated that human goals could plausibly be described as "anathema" to natural selection. Overall, humans are pursuing a set of goals that are relatively well-aligned with natural selection's pseudo-goals.
One of my assumptions is that it's possible to design a "satisficing" engine -- an algorithm that generates candidate proposals for a fixed number of cycles, and then, assuming at least one proposal with estimated utility greater than X has been generated within that amount of time, selects one of the qualifying proposals at random. If there are no qualifying candidates, the AI takes no action.
If you have a straightforward optimizer that always returns the action with the highest expected utility, then, yeah, you only have to miss one "cheat" that improves "official" utility at the expense of murdering everyone everywhere and then we all die. But if you have a satisficer, then as long as some of the qualifying plans don't kill everyone, there's a reasonable chance that the AI will pick one of those plans. Even if you forget to explicitly penalize one of the pathways to disaster, there's no special reason why that one pathway would show up in a large majority of the AI's candidate plans.
Sure, I agree! If we miss even one such action, we're screwed. My point is that if people put enough skill and effort into trying to catch all such actions, then there is a significant chance that they'll catch literally all the actions that are (1) world-ending and that (2) the AI actually wants to try.
There's also a significant chance we won't, which is quite bad and very alarming, hence people should work on AI safety.
Right, I'm not claiming that AGI will do anything like straightforwardly maximize human utility. I'm claiming that if we work hard enough at teaching it to avoid disaster, it has a significant chance of avoiding disaster.
The fact that nobody is artificially mass-producing their genes is not a disaster from Darwin's point of view; Darwin is vaguely satisfied that instead of a million humans there are now 7 billion humans. If the population stabilizes at 11 billion, that is also not a Darwinian disaster. If the population spreads across the galaxy, mostly in the form of emulations and AIs, but with even 0.001% of sentient beings maintaining some human DNA as a pet or a bit of nostalgia, that's still way more copies of our DNA than the Neanderthals were ever going to get.
There are probably some really convincing analogies or intuition pumps somewhere that show that values are likely to be obliterated after a jump in intelligence, but I really don't think evolution/contraception is one of those analogies.
I think we're doing a little better than I predicted. Rationalists seem to be somewhat better able than their peers to sift through controversial public health advice, to switch careers (or retire early) when that makes sense, to donate strategically, and to set up physical environments that meet their needs (homes, offices, etc.) even when those environments are a bit unusual. Enough rationalists got into cryptocurrency early enough and heavy enough for that to feel more like successful foresight than a lucky bet. We're doing something at least partly right.
That said, if we really did have a craft of reliably identifying and executing better decisions, and if even a hundred people had been practicing that craft for a decade, I would expect to see a lot more obvious results than the ones I actually see. I don't see a strong correlation between the people who spend the most time and energy engaging with the ideas you see on Less Wrong, and the people who are wealthy, or who are professionally successful, or who have happy families, or who are making great art, or who are doing great things for society (with the possible exception of AI safety, and it's very difficult to measure whether working on AI safety is actually doing any real good).
If anything, I think the correlation might point the other way -- people who are distressed or unsuccessful at life's ordinary occupations are more likely to immerse themselves in rationalist ideas as an alternate source of meaning and status. There is something actually worth learning here, and there are actually good people here; it's not like I would want to warn anybody away. If you're interested in rationality, I think you should learn about it and talk about it and try to practice it. However, I also think some of us are still exaggerating the likely benefits of doing so. Less Wrong isn't objectively the best community; it's just one of many good communities, and it might be well-suited to your needs and quirks in particular.
I mostly agree with the reasoning here; thank you to Eliezer for posting it and explaining it clearly. It's good to have all these reasons here in once place.
The one area I partly disagree with is Section B.1. As I understand it, the main point of B.1 is that we can't guard against all of the problems that will crop up as AI grows more intelligent, because we can't foresee all of those problems, because most of them will be "out-of-distribution," i.e., not the kinds of problems where we have reasonable training data. A superintelligent AI will do strange things that wouldn't have occurred to us, precisely because it's smarter than we are, and some of those things will be dangerous enough to wipe out all human life.
I think this somewhat overstates the problem. If we tell an AI not to invent nanotechnology, not to send anything to protein labs, not to hack into all of the world's computers, not to design weird new quantum particles, not to do 100 of the other most dangerous and weirdest things we can think of, and then ask it to generalize and learn not to do things of that sort and build avoidance of catastrophic danger as a category into its utility function...
And then we test whether the AI is actually doing these things and successfully using something like the human category of "catastrophe" when the AI is only slightly smarter than humans...
And then learn from those tests and honestly look at the failures and improve the AI's catastrophe-avoidance skills based on what we learn...
Then the chances that that AI won't immediately destroy the world seem to me to be much much larger than 0.1%. They're still low, which is bad, but they're not laughably insignificant, either, because if you make an honest, thoughtful, sustained effort to constrain the preferences of your successors, then often you at least partially succeed.
If natural selection had feelings, it might not be maximally happy with the way humans are behaving in the wake of Cro-Magnon optimization...but it probably wouldn't call it a disaster, either. Despite the existence of contraception, there sure are a whole lot more Cro-Magnons than there ever were Neanderthals, and the population is still going up every year.
Similarly, training an AI to act responsibly isn't going to get us a reliably safe AI, but whoever launches the first super-intelligent AI puts enough effort into that kind of training, then I don't see any reason why we shouldn't expect at least a 50% chance of a million or better survivors. I'm much more worried about large, powerful organizations that "vocally disdain all talk of AGI safety" than I am about the possibility that AGI safety research is inherently futile. It's inherently imperfect in that there's no apparent path to guaranteeing the friendliness of superintelligence...but that's not quite the same thing as saying that we shouldn't expect to be able to increase the probability that superintelligence is at least marginally friendly.
I appreciate how much detail you've used to lay out why you think a lack of human agency is a problem -- compared to our earlier conversations, I now have a better sense of what concrete problem you're trying to solve and why that problem might be important. I can imagine that, e.g., it's quite difficult to tell how well you've fit a curve if the context in which you're supposed to fit that curve is vulnerable to being changed in ways whose goodness or badness is difficult to specify. I look forward to reading the later posts in this sequence so that I can get a sense of exactly what technical problems are arising and how serious they are.
That said, until I see a specific technical problem that seems really threatening, I'm sticking by my opinion that it's OK that human preferences vary with human environments, so long as (a) we have a coherent set of preferences for each individual environment, and (b) we have a coherent set of preferences about which environments we would like to be in. Right, like, in the ancestral environment I prefer to eat apples, in the modern environment I prefer to eat Doritos, and in the transhuman environment I prefer to eat simulated wafers that trigger artificial bliss. That's fine; just make sure to check what environment I'm in before feeding me, and then select the correct food based on my environment. What do you do if you have control over my environment? No big deal, just put me in my preferred environment, which is the transhuman environment.
What happens if my preferred environment depends on the environment I'm currently inhabiting, e.g., modern me wants to migrate to the transhumanist environment, but ancestral me thinks you're scary and just wants you to go away and leave me alone? Well, that's an inconsistency in my preferences -- but it's no more or less problematic than any other inconsistency. If I prefer oranges when I'm holding an apple, but I prefer apples when I'm holding an orange, that's just as annoying as the environment problem. We do need a technique for resolving problems of utility that are sensitive to initial conditions when those initial conditions appear arbitrary, but we need that technique anyway -- it's not some special feature of humans that makes that technique necessary; any beings with any type of varying preferences would need that technique in order to have their utility fully optimized.
It's certainly worth noting that standard solutions to Goodhart's law won't work without modification, because human preferences vary with their environments -- but at the moment such modifications seem extremely feasible to me. I don't understand why your objections are meant to be fatal to the utility of the overall framework of Goodhart's Law, and I hope you'll explain that in the next post.
Hmm. Nobody's ever asked me to try to teach them that before, but here's my advice:
I just came here to point out that even nuclear weapons were a slow takeoff in terms of their impact on geopolitics and specific wars. American nuclear attacks on Hiroshima and Nagasaki were useful but not necessarily decisive in ending the war on Japan; some historians argue that the Russian invasion of Japanese-occupied Manchuria, the firebombing of Japanese cities with massive conventional bombers, and the ongoing starvation of the Japanese population due to an increasingly successful blockade were at least as influential in the Japanese decision to surrender.
After 1945, the American public had no stomach for nuclear attacks on enough 'enemy' civilians to actually cause large countries like the USSR or China to surrender, and nuclear weapons were too expensive and too rare to use them to wipe out large enemy armies -- the 300 nukes America had stockpiled at the start of the Korean War in 1950 would not necessarily have been enough to kill the 3 million dispersed Chinese soldiers who actually fought in Korea, let alone the millions more who would likely have volunteered to retaliate against a nuclear attack.
The Soviet Union had a similarly-sized nuclear stockpile and no way to deliver it to the United States or even to the territory of key US allies; the only practical delivery vehicle at that time was via heavy bomber, and the Soviet Union had no heavy bomber force that could realistically penetrate western air defense systems -- hence the Nike anti-aircraft missiles rusting along ridgelines near the California coast and the early warning stations dotting the Canadian wilderness. If you can shoot their bombers down before they can reach your cities, then they can't actually win a nuclear war against you.
Nukes didn't become a complete gamechanger until the late 1950s, when the increased yields from hydrogen bombs and the increased range from ICBMs created a truly credible threat of annihilation.