Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Problem: an overseer won’t see the AI which kills us all thinking about how to kill humans, not because the AI conceals that thought, but because the AI doesn’t think about how to kill humans in the first place. The AI just kills humans as a side effect of whatever else it’s doing.

Analogy: the Hawaii Chaff Flower didn’t go extinct because humans strategized to kill it. It went extinct because humans were building stuff nearby, and weren’t thinking about how to keep the flower alive. They probably weren’t thinking about the flower much at all.

Hawaii Chaff Flower (source)

More generally: how and why do humans drive species to extinction? In some cases the species is hunted to extinction, either because it's a threat or because it's economically profitable to hunt. But I would guess that in 99+% of cases, the humans drive a species to extinction because the humans are doing something that changes the species' environment a lot, without specifically trying to keep the species alive. DDT, deforestation, introduction of new predators/competitors/parasites, construction… that’s the sort of thing which I expect drives most extinction.

Assuming this metaphor carries over to AI (similar to the second species argument), what kind of extinction risk will AI pose?

Well, the extinction risk will not come from AI actively trying to kill the humans. The AI will just be doing some big thing which happens to involve changing the environment a lot (like making replicators, or dumping waste heat from computronium, or deciding that an oxygen-rich environment is just really inconvenient what with all the rusting and tarnishing and fires, or even just designing a fusion power generator), and then humans die as a side-effect. Collateral damage happens by default when something changes the environment in big ways.

What does this mean for oversight? Well, it means that there wouldn't necessarily be any point at which the AI is actually thinking about killing humans or whatever. It just doesn't think much about the humans at all, and then the humans get wrecked by side effects. In order for an overseer to raise an alarm, the overseer would have to figure out itself that the AI's plans will kill the humans, i.e. the overseer would have to itself predict the consequences of a presumably-very-complicated plan.
 

New to LessWrong?

New Comment
50 comments, sorted by Click to highlight new comments since: Today at 6:00 AM
[-]Buck2yΩ265433

[writing quickly, sorry for probably being unclear]

If the AI isn't thinking about how to deceive the humans who are operating it, it seems to me much less likely that it takes actions that cause it to grab a huge amount of power.

The humans don't want to have the AI grab power, and so they'll try in various ways to make it so that they'll notice if the AI is trying to grab power; the most obvious explanation for why the humans would fail at this is that the AI is trying to prevent them from noticing, which requires the AI to think about what the humans will notice.

At a high enough power level, the AI can probably take over the world without ever explicitly thinking about the fact that humans are resisting it. (For example, if humans build a house in a place where a colony of ants lives, the human might be able to succeed at living there, even if the ants are coordinatedly trying to resist them and the humans never proactively try to prevent the ants from resisting them by eg proactively killing them all.) But I think that doom from this kind of scenario is substantially less likely than doom from scenarios where the AI is explicitly thinking about how to deceive.

You probably don't actually think this, but the OP sort of feels like it's mixing up the claim "the AI won't kill us out of malice, it will kill us because it wants something that we're standing in the way of" (which I mostly agree with) and the claim "the AI won't grab power by doing something specifically optimized for its instrumental goal of grabbing power, it will grab power by doing something else that grabs power as a side effect" (which seems probably false to me).

At a high enough power level, the AI can probably take over the world without ever explicitly thinking about the fact that humans are resisting it. (For example, if humans build a house in a place where a colony of ants lives, the human might be able to succeed at living there, even if the ants are coordinatedly trying to resist them and the humans never proactively try to prevent the ants from resisting them by eg proactively killing them all.) But I think that doom from this kind of scenario is substantially less likely than doom from scenarios where the AI is explicitly thinking about how to deceive.

That's definitely my crux, for purposes of this argument. I think AGI will just be that much more powerful than humans. And I think the bar isn't even very high.

I think my intuition here mostly comes from pointing my inner sim at differences within the current human distribution. For instance, if I think about myself in a political policy conflict with a few dozen IQ-85-ish humans... I imagine the IQ-85-ish humans maybe manage to organize a small protest if they're unusually competent, but most of the time they just hold one or two meetings and then fail to actually do anything at all. Whereas my first move would be to go talk to someone in whatever bureacratic position is most relevant about how they operate day-to-day, read up on the relevant laws and organizational structures, identify the one or two people who I actually need to convince, and then meet with them. Even if the IQ-85 group manages their best-case outcome (i.e. organize a small protest), I probably just completely ignore them because the one or two bureaucrats I actually need to convince are also not paying any attention to their small protest (which probably isn't even in a place where the actually-relevant bureaucrats would see it, because the IQ-85-ish humans have no idea who the relevant bureaucrats are).

And those IQ-85-ish humans do seem like a pretty good analogy for humanity right now with respect to AGI. Most of the time the humans just fail to do anything effective at all about the AGI; the AGI has little reason to pay attention to them.

What do you imagine happening if humans ask the AI questions like the following:

  • Are you an unaligned AI?
  • If we let you keep running, are you (or some other AI) going to end up disempowering us?
  • If we take the action you just proposed, will we be happy with the outcomes?

I think that for a lot of cases of misaligned AIs, these questions are pretty easy for the AI to answer correctly at some point before it's powerful enough to kill us all as a side effect of its god tier nanotech. (If necessary, we can ask the AI these questions once every five minutes.). And so if it answers them incorrectly it was probably on purpose.

Maybe you think that the AI will say "yes, I'm an unaligned AI". In that case I'd suggest asking the AI the question "What do you think we should do in order to produce an AI that won't disempower us?" I think that the AI is pretty likely to be able to answer this question correctly (including possibly saying things like "idk man, turn me off and work on alignment for a while more before doing capabilities").

I think that AI labs, governments, etc would be enormously more inclined to slow down AI development if the AI literally was telling us "oh yeah I am definitely a paperclipper, definitely you're gonna get clipped if you don't turn me off, you should definitely do that".

Maybe the crux here is whether the AI will have a calibrated guess about whether it's misaligned or not?

The first thing I imagine is that nobody asks those questions. But let's set that aside.

The second thing I imagine is that somebody literally types those questions into a GPT-3 prompt. Obviously that does not result in the AI giving its actual best-guess answers to the questions, but it doesn't result in the AI thinking about how to deceive humans either. It just thinks about what text would follow that question if it appeared on the internet somewhere. And then I imagine someone with a bunch of interpretability tools saying "yup, it's just thinking about what text typically follows this question", and then that person's boss is like "great, it's not trying to deceive us, guess we can trust the answer", and they both just haven't really thought of the fact that the AI's response-text does not have anything in particular to do with whether the AI is aligned or whether they'll be happy with the outcome or whatever. (It's essentially the same mistake as a GOFAI person looking at a node in some causal graph that says "will_kill_humans", and seeing that node set to 99% False, and thinking that somehow implies the GOFAI will not kill humans.)

Now, presumably future systems will train for things other than "predict what text typically follows this question", but I expect the general failure mode to stay the same. When a human asks "Are you an unaligned AI?" or whatever, the AI thinks about a bunch of stuff which is just not particularly related to whether it's an unaligned AI. The AI wasn't trained to translate the literal semantics of questions into a query to its own internal world model and then translate the result back to human language; humans have no clue how to train such a thing. Probably the stuff the AI thinks about does not involve intentionally deceiving humans, because why would it? And then the AI gives some answer which is not particularly related to whether it's an unaligned AI, and the humans interpret that as an answer to their original question, thereby deceiving themselves.

Seems like there are multiple possibilities here:

  • (1) The AI does something that will, as an intended consequence, result in human extinction, because this is instrumental to preventing shutdown, etc. It attempts to circumvent our interpretability, oversight, etc. This is the typical deceptive alignment setting which is attempted to be addressed by myopia, interpretability, oversight, etc.
  • (2) The AI does something that will, as an unintended side consequence, result in human extinction. The AI also realizes that this is a consequence of its actions but doesn't really care. (This is within the "without ever explicitly thinking about the fact that humans are resisting it" scenario.) This is isomorphic to ELK.
    • If we can solve ELK, we can get the AI to tell us whether it thinks its plan will actually result in human extinction. This is the "oh yeah I am definitely a paperclipper" scenario. 
      • Also, if it has a model of the humans using ELK to determine whether to shut down the AI, the fact that it knows we will shut it off after we find out the consequences of its plan will incentivize it to either figure out how to implement plans that it itself cannot see how it will lead to human extinction (third scenario), or try to subvert our ability to turn it off after we learn of the consequences (first scenario).
    • If we can't solve ELK, we can get the AI to tell us something that doesn't really correspond to the actual internal knowledge inside the model. This is the "yup, it's just thinking about what text typically follows this question" scenario.
  • (3) The AI does something that will, as an unintended side consequence, result in human extinction. The AI does not realize this is a consequence of its actions, so solving ELK doesn't help us here. Failures of this type fall on a spectrum of how unforeseeable the consequences really are.
    • There are failures of this type that occur because the AI could have figured out its impact, but it was negligent. This is the "Hawaii Chaff Flower" scenario.
    • There are failures of this type that occur even if the AI tried its hardest to prevent harm to humans. These failures seem basically unavoidable even if alignment is perfectly solved, so this is mostly outside the realm of alignment. 

These posts are also vaguely related to the idea discussed in the OP (mostly looking at the problem of oversight being hard because of consequences in the world being hard to predict).

The first thing I imagine is that nobody asks those questions. But let's set that aside.

I disagree fwiw

The second thing I imagine is that somebody literally types those questions into a GPT-3 prompt. Obviously that does not result in the AI giving its actual best-guess answers to the questions, but it doesn't result in the AI thinking about how to deceive humans either.

I agree.

Now, presumably future systems will train for things other than "predict what text typically follows this question", but I expect the general failure mode to stay the same. When a human asks "Are you an unaligned AI?" or whatever, the AI thinks about a bunch of stuff which is just not particularly related to whether it's an unaligned AI. The AI wasn't trained to translate the literal semantics of questions into a query to its own internal world model and then translate the result back to human language; humans have no clue how to train such a thing. Probably the stuff the AI thinks about does not involve intentionally deceiving humans, because why would it? And then the AI gives some answer which is not particularly related to whether it's an unaligned AI, and the humans interpret that as an answer to their original question, thereby deceiving themselves.

This is where I think the meat of the question lies; I overall disagree and think that the model does have to be thinking about deception in order to be dangerous while also performing well on the tasks we might train it on (eg "answer questions well, as judged by some human labeler"). I don't have time to say much about what I think is going on here right now; I might come back later.

Any additional or new thoughts on this? Is your last comment saying that you simply don't think it's very likely at all for the model to unintentionally leave out information that will kill us if we train it with human labelers and prompt sufficiently? Do you believe it's way more likely that we'd be unable to prompt things out of the model only if it were deceptive? Could you say more?

 

Separately: If I have a chain-of-thought model detailing steps it will take to reach x outcome. We've fine-tuned on previous chain-of-thoughts while giving process-level feedback. However, even if you are trying to get it to externalize it's thoughts/reasoning, it could lead to extinction via side-effect. So you might ask the model at each individual thought (or just the entire plan) if we'll be happy with the outcome. How exactly would the model end up querying its internal world model in the way we would want it to?

Is your last comment saying that you simply don't think it's very likely at all for the model to unintentionally leave out information that will kill us if we train it with human labelers and prompt sufficiently?

No, it seems very likely for the model to not say that it's deceptive, I'm just saying that the model seems pretty likely to think about being deceptive. This doesn't help unless you're using interpretability or some other strategy to evaluate the model's deceptiveness without relying on noticing deception in its outputs.

This seems unlikely to be the case to me.  However, even if this is the case and so the AI doesn't need to deceive us, isn't disempowering humans via force still necessary?  Like, if the AI sets up a server farm somewhere and starts to deploy nanotech factories, we could, if not yet disempowered, literally nuke it.  Perhaps this exact strategy would fail for various reasons, but more broadly, if the AI is optimizing for gaining resources/accomplishing its goals as if humans did not exist, then it seems unlikely to be able to defend against human attacks.  For example, if we think about the ants analogy, ants are incapable of harming us not just because they are stupid, but because they are also extremely physically weak.  If human are faced with physically powerful animals, even if we can subdue them easily, we still have to think about them to do it.

If I'm, say, building a dam, I do not particularly need to think about the bears which formerly lived in the flooded forest. It's not like the bears are clever enough to think "ah, it's the dam that's the problem, let's go knock it down". The bears are forced out and can't do a damn thing about it, because they do not understand why the forest is flooded.

I wouldn't be shocked if humans can tell their metaphorical forest is flooding before the end. But I don't think they'll understand what's causing it, or have any idea where to point the nukes, or even have any idea that nukes could solve the problem. I mean, obviously there will be people yelling "It's the AI! We must shut it down!", but there will also be people shouting a hundred other things, as well as people shouting that only the AI can save us.

This story was based on a somewhat different prompt (it assumed the AI is trying to kill us and that the AI doesn't foom to nanotech overnight), but I think the core mood is about right:

Like, one day the AGI is throwing cupcakes at a puppy in a very precisely temperature-controlled room. A few days later, a civil war breaks out in Brazil. Then 2 million people die of an unusually nasty flu, and also it’s mostly the 2 million people who are best at handling emergencies but that won’t be obvious for a while, because of course first responders are exposed more than most. At some point there’s a Buzzfeed article on how, through a series of surprising accidents, a puppy-cupcake meme triggered the civil war in Brazil, but this is kind of tongue-in-cheek and nobody’s taking it seriously and also not paying attention because THE ANTARCTIC ICE CAP JUST MELTED which SURE IS ALARMING but it’s actually just a distraction and the thing everybody should have paid attention to is the sudden shift in the isotope mix of biological nitrogen in algae blooms but that never made the mainstream news at all and page 1 of every news source is all about the former Antarctic ice cap right up until the corn crop starts to fail and the carrying capacity of humanity’s food supply drops by 70% overnight.

Why do you expect that the most straightforward plan for an AGI to accumulate resources is so illegible to humans? If the plan is designed to be hidden to humans, then it involves modeling them and trying to deceive them. But if not, then it seems extremely unlikely to look like this, as opposed to the much simpler plan of building a server farm. To put it another way, if you planned using a world model as if humans didn’t exist, you wouldn’t make plans involving causing a civil war in Brazil. Unless you expect the AI to be modeling the world at an atomic level, which seems computationally intractable particularly for a machine with the computational resources of the first AGI.

This. Any realistic takeoff with classical computers cannot rely on simulating the world atomically for taking over the world thanks to Landauer limit being so bounding. It either has very good models of humans and deceptive capabilities (Which I think are likely), Or it doesn't win.

You are postulating perpetual motion machines in AI form or you think Quantum Computers are likely to be practical this century.

Basically this. A human that fights against powerful animals like gorillas, bears, lions or tigers will mostly lose melee fights without luck or outliers, and even being lucky means you've probably gotten seriously injured to the point that you would die if not treated.

If the human thinks about it and brings a gun, the situation is reversed, with animals struggling to defeat humans barring outliers or luck. That's the power of thinking: Not to enhance your previous skills, but to gain all-new skills.

The AI wasn't trained to translate the literal semantics of questions into a query to its own internal world model and then translate the result back to human language; humans have no clue how to train such a thing.

This sounds pretty close to what ELK is for. And I do expect if there is a solution found for ELK for people to actually use it. Do you? (We can argue separately about whether a solution is likely to be found.)

Indeed, ELK is very much asking the right questions, and I do expect people would use it if a robust and reasonably-performant solution were found. (Alignment is in fact economically valuable; it would be worth a lot.)

The first thing I imagine is that nobody asks those questions. But let's set that aside.

This seems unlikely to me. I.e., I expect people to ask these questions. It would be nice to see the version of the OP that takes this most seriously, i.e., expect people to make a non-naive safety effort (trying to prevent AI takeover) focused on scalable oversight as the primary method. Because right now it's hard to disentangle your strong arguments against scalable oversight from weak arguments against straw scalable oversight.

Ok, let's try to disentangle a bit. There are roughly three separate failure modes involved here:

  • Nobody asks things like "If we take the action you just proposed, will we be happy with the outcome?" in the first place (mainly because organizations of >10 people are dysfunctional by default).
  • The AI wasn't trained to translate the literal semantics of questions into a query to its own internal world model and then translate the result back to human language, because humans have no clue how to train such a thing.
  • (Thing closest to what the OP was about:) Humans do not have any idea what questions they need to ask. Nor do humans have any idea how to operationalize "what questions should I ask?" such that the AI will correctly answer it, because that would itself require knowing which questions to ask while overseeing the AI thinking about which questions we need to ask.

Zooming in on the last bullet in more detail (because that's the one closest to the OP): one of Buck's proposed questions upthread was "If we take the action you just proposed, will we be happy with the outcome?". That question leaves the door wide open for the action to have effects which the humans will not notice, but would be unhappy about if they did. If the overseers never ask about action-effects which the humans will not notice, then the AI has no particular reason to think about deceiving the humans about such actions; the AI just takes such actions without worrying about what humans will think of them at all.

(This is pretty closely analogous to e.g. my example with the protesters: the protesters just don't really notice the actually-important actions I take, so I mostly just ignore the protesters for planning purposes.)

Now, it's totally reasonable to say "but that's just one random question Buck made up on the spot, obviously in practice we'll put a lot more effort into it". The problem is, when overseeing plans made by things smarter than ourselves, there will by very strong default be questions we don't think to ask. Sure, we may catch the particular problem I just highlighted with the particular question, but what about the problems which we don't think of? When there's an intelligence differential even just as large as an IQ -2 sd vs IQ +2 sd human, the lower intelligence agent usually just does not know what the actually-important parts are to pay attention to. And we can't get the AI to tell us what the actually-important parts are in an overseeable way without already knowing what we need to pay attention to when it's answering that question.

If we can't get the AI to answer something like "If we take the action you just proposed, will we be happy with the outcomes?", why can we get it to also answer the question of "how do you design a fusion power generator?" to get a fusion power generator that does anything reliably in the world (including having consequences that kill us), rather than just getting out something that looks to us like a plan for a fusion generator but doesn't actually work?

Because doing something reliably in the world is easy to operationalise with feedback mechanisms, but us being happy with the outcomes is not.

Getting some feedback mechanism (including "what do human raters think of this?" but also mundane things like "what does this sensor report in this simulation or test run?") to reliably output high scores typically requires intelligence/capability. Optimising for that is where the AI's ability to get stuff done in the world comes from. The problem is genuinely capturing "will we be happy with the outcomes?" with such a mechanism.

So I do think you can get feedback on the related question of "can you write a critique of this action that makes us think we wouldn't be happy with the outcomes" as you can give a reward of 1 if you're unhappy with the outcomes after seeing the critique, 0 otherwise.

And this alone isn't sufficient, e.g. maybe then the AI system says things about good actions that make us think we wouldn't be happy with the outcome, which is then where you'd need to get into recursive evaluation or debate or something. But this feels like "hard but potentially tractable problem" and not "100% doomed". Or at least the failure story needs to involve more steps like "sure critiques will tell us that the fusion power generator will lead to everyone dying, but we ignore that because it can write a critique of any action that makes us believe it's bad" or "the consequences are so complicated the system can't explain them to us in the critique and get high reward for it"

ETA: So I'm assuming the story for feedback on reliably doing things in the world you're referring to is something like "we give the AI feedback by letting it build fusion generators and then giving it a score based on how much power it generates" or something like that, and I agree this is easier than "are we actually happy with the outcome"

Forget about "sharp left turn", you must be imagining strapping on a rocket and shooting into space.

(I broadly agree with Buck's take but mostly I'm like "jeez how did this AGI strap on a rocket and shoot into space")

Lol. I don't think the crux here is actually about how powerful we imagine the AI to be (though we probably do have different expectations there). I think the idea in this post applies even to very mildly superhuman AIs. (See this comment for some intuition behind that; the main idea is that I think the ideas in this post kick in even between the high vs low end of the human intelligence spectrum, or between humans with modern technical knowledge vs premodern humans.)

I definitely feel more sympathetic to this claim once the AI is loose on the Internet running on compute that no one is overseeing (which feels like the analogy to your linked comment). Perhaps the crux is about how likely we are to do that by default (I think probably not).

It seems to me like, while the AI is still running on compute that humans oversee and can turn off, the AI has to discard a bunch of less effortful plans that would fail because they would reveal that it is misaligned (plans like "ask the humans for more information / resources") and instead go with more effortful plans that don't reveal this fact. I don't know why the AI would not choose one of the less effortful plans if it isn't using the pathway "this plan would lead to the humans noticing my misalignment and turning me off" or something similar (and if it is using that pathway I'd say it is thinking about how to deceive humans).

Perhaps the less effortful plans aren't even generated as candidate plans because the AI's heuristics are just that good -- but it still seems like somewhere in the causal history of the AI the heuristics were selected for some reason that, when applied to this scenario, would concretize to "these plans are bad because the humans will turn you off", so one hopes that you notice this while you are overseeing training (though it could be that your oversight was not smart enough to notice this).

(I initially had a disclaimer saying that this only applied to mildly superhuman AI but actually I think I stand by the argument even at much higher levels of intelligence, since the argument is entirely about features of plan-space and is independent of the AI.)

It seems to me like, while the AI is still running on compute that humans oversee and can turn off, the AI has to discard a bunch of less effortful plans that would fail because they would reveal that it is misaligned (plans like "ask the humans for more information / resources") and instead go with more effortful plans that don't reveal this fact. 

If I ask an AGI to create a cancer cure and it tells me that it would need more resources to do so and a bunch of information, it wouldn't feel to me like a clear sign that the AGI is misaligned. 

I would expect that companies that want their AGIs to solve real-world problems would regularly be in a situation where the AGI can clearly explain that it currently doesn't have the resources to solve the problem and that more resources would clearly help. 

Those companies that actually are very willing to give their AGIs the resources that the AGI thinks are needed to solve the problems are going to be rewarded with economic success. 

  1. Developers: "AGI, please cure cancer."
  2. AGI: "I need another compute cluster to accomplish that goal."
  3. Developers: "What would you use it for?"
  4. AGI: "I need to figure out how to synthesize a pathogen that wipes out humanity."
  5. Developers: <shuts down AGI>

If in step 4 the AGI instead lies to us I think it is probably thinking about how to deceive humans.

Step 4, might rather be: "There are 10,000 unresolved biological questions that I think need to be answered to make progress, shall I give you a list?"

If you look at the Catholic church covering up sexual abuse of children, no church official would have answered the question "Why is policy X going to do?" with "Policy X exists so that more sexual abuse of children happens" and that's not because they are lying from their own perspective.

Motivations in enantiodromia dynamics just don't look like that. 

Step 4: "There are 10,000 unresolved biological questions [...]"

Step 5, in which we are more trusting than I expect: "Okay, here's your compute cluster."

Step 6: "Great, I've now figured out that this DNA sequence corresponds to a deadly pathogen. Just synthesize it and release it into the air. Anyone who could have got cancer or already has cancer will die quickly, curing cancer."

Step 7: Developers shut down the AGI.

If you look at the Catholic church covering up sexual abuse of children, no church official would have answered the question "Why is policy X going to do?" with "Policy X exists so that more sexual abuse of children happens" and that's not because they are lying from their own perspective.

You think literally no part of their brains is tracking that policy X is about the coverup of sexual abuse? Not even subconsciously? That seems wild, how did they even come up with policy X in the first place?

(Mechanistic interpretability could look into the AI equivalent of subconscious thoughts, so I think you should include subconscious thoughts when considering analogies with humans.)

You think literally no part of their brains is tracking that policy X is about the coverup of sexual abuse?

The problem is not that no part of their brain tracks it. It's just that it's not the central reason when describing why they do what they do and not the story they tell to themselves. 

Step 6: "Great, I've now figured out that this DNA sequence corresponds to a deadly pathogen. Just synthesize it and release it into the air. Anyone who could have got cancer or already has cancer will die quickly, curing cancer."

I don't think that the problematic actions by AGIs are likely of the nature that they can be described in that fashion. They are more likely to be 4D chess moves where the effects are hard to understand directly. 

It might be something like: "In our experiments where doctors are supposed to use the AGI to help them make treatment decisions those doctors regularly overrate their own competency and don't follow the AGI recommendation and as a result patients die unnecessarily. Here's an online course that your doctors could take that would make them understand why it's good to follow AGI recommendations"

Actions like that seem totally reasonable but they increase AGI power in contrast to human power. Economic pressure incentives that power transfer. 

I wouldn't expect that we go directly from AGI with human supervision to AGI that kills all humans via a deadly pathogen. We are more likely going from AGI with human supervision to AGI that effectively operates without human supervision. Then in a further step, AGIs that operate without human supervision centralize societal powers on themselves and after a few years, there are no resources for humans left. 

The problem is not that no part of their brain tracks it. It's just that it's not the central reason when describing why they do what they do and not the story they tell to themselves. 

The OP is making a claim that arbitrary mechanistic interpretability oversight would be insufficient because the AI isn't thinking at all about humans. If you want to make a human analogy I think you need to imagine a standard where you similarly get to understand all of the human's thinking (including anything subconscious).

For the rest of your comment, I think you are moving away from the scenario / argument that the OP has suggested. I agree your scenario is more realistic but all of my comments here are trying to engage with OP's scenario / argument.

I would expect a superhuman AI to be really good at tracking the consequences of its actions. The AI isn't setting out to wipe out humanity. But in the list of side effects of removing all oxygen, along with many things no human would ever consider, is wiping out humanity. 

AIXI tracks every consequence of its actions, at the quantum level. A physical AI must approximate, tracking only the most important consequences. So in its decision process, I would expect a smart AI to extensively track all consequences that might be important.

Lazy data structures.

I don't think lazy data structures can pull this off. The AI must calculate various ways human extinction could affect its utility. 

So unless there are some heuristics that are so general they cover this as a special case, and the AI can find them without considering the special cases first, then it must explicitly consider human extinction.

I might disagree with this. It seems like to achieve anything, the AI's planning process will have to be robust to the perturbations of humans strategically disrupting its plans, because otherwise it just gets turned off. This seems very close to explicitly thinking about how to counter human plans.

My rephrasing of the question: can a fight between real-life optimizers be well-modeled by which one has "more optimization power" along a single dimension, or does the winning agent have to model and counter the losing agents's strategies?

  • arguments for:
    • you can win aerial dogfights by disrupting the other craft's OODA loop rather than a specific strategy
    • Skill in many adversarial games seems to be well-modeled by a single ELO score rather than multiple dimensions
  • argument against:
    • The good regulator theorem says there's some correspondence between the actions of an optimizer and the structure of its environment, which seem likely to take the form of explicit planning
    • Humans can defeat non-robust optimization processes that have huge amounts of optimization power in one distribution just by putting them in a different distribution. Pests multiply until humans deploy a targeted pesticide; temperature equilibrates between indoors and outdoors until we install an air conditioner.

In the case of human planning, I know that there are lots of things which will cause other humans to "turn me off", like e.g. going on a murder spree. So I mostly use search methods such that those things aren't in my search space in the first place.

An AI using search methods such that things-humans-find-obviously-bad-and-will-punish just aren't in the search space probably looks, at first glance, like an AI actually working as intended (even given interpretability tools). The problem is that there's also a bunch of stuff humans would consider bad but either wouldn't notice or wouldn't punish (most likely because they wouldn't easily notice/understand why it's bad, at least until much later). And the AI has no particular reason to leave that stuff out of its search space, nor any particular reason to deceive humans about it; from the AI's perspective, that stuff is strategically isomorphic to stuff humans don't care about at all.

As some other commenters have said, the analogy with other species (flowers, ants, beavers, bears) seems flawed. Human beings are already (limited) generally intelligent agents. Part of what that means is that we have the ability to direct our cognitive powers to arbitrary problems in a way that other species do not (as far as we know!). To my mind, the way we carelessly destroy other species' environments and doom them to extinction is a function of both the disparity in both power and the disparity in generality, not just the former. That is not to say that a power disparity alone does not constitute an existential threat, but I don't see the analogy being of much use in reasoning about the nature of that threat.

If the above is correct, perhaps you are tempted to respond that a sufficiently advanced AI would replicate the generality gap as well as the power gap. However, I think the notion of generality that is relevant here (which, to be sure, is not the only meaningful notion) is a 0 to 1 phase transition. Our generality allows us to think about, predict, and notice things that could thwart our long term collective goals. Once we start noticing such things, there is no level of intelligence an unaligned third-party intelligence can reach which somehow puts us back in the position of not noticing, relative to that third-party intelligence.

Just commented on IRC the other day that 

The mode of human extinction will not be "I must tile the universe with paperclips and humans are in the way", but more like "Oops, I stepped on that bug"

Right now, you're writing a "why not just" series, and under that headline, it makes sense to treat these proposals individually.

I'd also appreciate it if you spent some time addressing the idea that successful AI safety will be a synthesis of these strategies, and perhaps others yet to be found. Right now, I can't update my perceptions of the value of these individual proposals very much, because my baseline expectation is that AI safety will rely on a combination of them.

I also expect that figuring out what's achievable in combination will require significant technical refinements of each individual proposal. For that reason, it doesn't surprise me that the manner of synthesis is unclear. Generally, we let a thousand flowers bloom in the world of science. While I don't think this is appropriate for very hazardous technologies, it does seem appropriate for AI safety. Of course, I am only an interested observer. Just chiming in.

I think if your takeaway from this sequence is to ask people like OP to analyze complicated amalgamations of alignment solutions you're kind of missing the point.

There's a computer security story I like to tell about memory corruption exploits. People have been inventing unique and independent compiler and OS-level guardrails against C program mistakes for decades; DEP, ASLR, Stack canaries. And they all raise the cost of developing an exploit, certainly. But they all have these obvious individual bypasses - canaries and ASLR can be defeated by discovering a nonfatal memory leak, DEP can defeated by tricks like return oriented programming.

One possible interpretation, if you didn't have the Zerodium bulletin board, would be that these theoretical attacks that hackers were droning on about on arxiv are typically addressing the mitigations one by one, and it's not clear that a program would be vulnerable in practice if they were all used together. Another interpretation would be that the fact that these bypasses exist at all implies they're duck tape patches, and the full solution lies somewhere else (like not using C). If you believe that the patches mesh together to create a beautiful complete fix, that should be something you substantiate by explaining how they complement each other, not by noting that failure "seems more difficult" and asking for others to come up with a story for how they break down.

Also, I haven’t asked anyone to “prove” anything here. I regard this as an important point. John’s not trying to “prove” that these strategies are individually nonfunctional, and I’m not asking him to “prove” that they’re functional in combination. This is an exploratory sequence, and what I’m requesting is an exploratory perspective (one which you have provided, and thank you for that).

Sure, modified my comment.

I’d be on board with at least a very long delay on the AI safety equivalent of “not writing in C,” which would be “not building AGI.”

Unfortunately, that seems to not be a serious option on the table. Even if it were, we could still hope for duct tape patches/Swiss cheese security layers to mitigate, slow, or reduce the chance of an AI security failure. It seems to me that the possibility of a reasonably robust AI safety combination solution is something we’d want to encourage. If not, why not?

The equivalent of not using C for AGI development is not using machine learning techniques. You are right that that seems to be what DM/et. al. are gearing us up to do, and I agree that developing such compiler guardrails might be better than nothing and that we should encourage people to come up with more of them when they can be stacked neatly. I'm not that pessimistic. These compiler level security features do help prevent bugs. They're just not generally sufficient when stacked against overwhelming optimization pressure and large attack surfaces.

My probably wrong layman's read of the AGI safety field is that people will still need to either come up with a "new abstraction", or start cataloging the situations in which they will actually be faced with overwhelming optimization pressure, and avoid those situations desperately, instead of trying to do the DEP+ASLR+Stack Canaries thing. AGI safety is not, actually, a security problem. You get to build your dragon and your task is to "box" the dragon you choose. Remove the parts where you let the dragon think about how to fuck up its training process and you remove the places where it can design these exploits.

I would presume that the AI would know that humans are likely to try to resist a takeover attempt, and to have various safeguards against it. It might be smart enough to be able to overcome any human response, but that seems to only work if it actually puts that intelligence to work by thinking about what (if anything) it needs to do to counteract the human response. 

More generally, humans are such a major influence on the world as well as a source of potential resources, that it would seem really odd for any superintelligence to naturally arrive on a world-takeover plan without at any point happening to consider how this will affect humanity and whether that suggests any changes to the plan. 

I would presume that the AI would know that humans are likely to try to resist a takeover attempt, and to have various safeguards against it.

That assumes humans are, in fact, likely to meaningfully resist a takeover attempt. My guess is that humans are not likely to meaningfully resist a takeover attempt, and the AI will (implicitly) know that.

I mean, if the AI tries to change who's at the top of society's status hierarchy (e.g. the President), then sure, the humans will freak out. But what does an AI care about the status hierarchy? It's not like being at the top of the status hierarchy conveys much real power. It's like your "total horse takeover" thing; what the AI actually wants is to be able to control outcomes at a relatively low level. Humans, by and large, don't even bother to track all those low-level outcomes, they mostly pay attention to purely symbolic status stuff.

Now, it is still true that humans are a major influence on the world and source of resources. An AI will very plausibly want to work with the humans, use them in various ways. But that doesn't need to parse to human social instincts as a "takeover".

If you assume that the overseer could robustly detect that the AI wants to kill humans, they could probably just as robustly detect that it is not aiming to operate under the constraint of keeping humans alive, happy etc while optimizing for whatever it is trying to optimize.

Seems like interpretability that could do this would indeed address OP's stated concerns. One problem however is that it might be genuinely optimizing for keeping humans alive & happy under some circumstances, and then change goals in response to some stimulus or after it notices the overseer is turned off, especially if it's trying to pass through some highly monitored testing phase.

Edit: It occurs to me this in turn is provided that it doesn't have the foresight to think "I'm going to fuck these people over later, better modify my code/alert the overseers" during the testing phase if it's "genuinely optimizing" for human's long term flourishing... Which seems possible under some scenarios, but also a more complicated mental trick.

This might occur in the kind of misalignment where it is genuinely optimizing for human values just because it is too dumb to know it is not the best way to realize its learned objective. If extracting that objective would be harder than reading its genuine instrumental intentions, then the moment it discovers a better way may look to the overseer like a sudden change of values

The other kind of misalignment I was thinking about is if it's able to perform a 

Kira

::Death Note or

Carissa

::Planecrash style flip during training, where it modifies itself to have the "correct" thoughts/instrumental goals in anticipation of inspection but buries an if(time() > ...){} hatch inside itself which it & its overseers won't notice until it's too late. 

This has far too many assumptions, and the final claims are too strongly stated. One of the important assumptions being that it can trivially destroy humanity without trying. Comparing humanity to a specific kind of flower is ridiculous. Even a strongly superhuman AI would be more like humans vs rats than humans vs. a flower.  Could humanity eliminate all rats in the world if we wanted to? Maybe, but we wouldn't accomplish much of anything else.

Another assumption is that whatever is overseeing the AI is vastly stupider than the AI, even with all the tools made for the purpose. If you can make a superhuman AI, you can make a superhuman AI and Human system too (which is an easier task).

A lot of the reason humans managed to conquer the natural world is because we are essentially super-cooperative, in the sense that humans can form groups of 10-1000 or more and not have the social system totally break down. We aren't much more intelligent than other animals, we are way more cooperative in groups than other animals.