Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

The top-rated comment on "AGI Ruin: A List of Lethalities" claims that many other people could've written a list like that.

"Why didn't you challenge anybody else to write up a list like that, if you wanted to make a point of nobody else being able to write it?" I was asked.

Because I don't actually think it does any good, or persuades anyone of anything, people don't like tests like that, and I don't really believe in them myself either.  I couldn't pass a test somebody else invented around something they found easy to do, for many such possible tests.

But people asked, so, fine, let's actually try it this time.  Maybe I'm wrong about how bad things are, and will be pleasantly surprised.  If I'm never pleasantly surprised then I'm obviously not being pessimistic enough yet.

So:  As part of my current fiction-writing project, I'm currently writing a list of some principles that dath ilan's Basement-of-the-World project has invented for describing AGI corrigibility - the sort of principles you'd build into a Bounded Thing meant to carry out some single task or task-class and not destroy the world by doing it.

So far as I know, every principle of this kind, except for Jessica Taylor's "quantilization", and "myopia" (not sure who correctly named this as a corrigibility principle), was invented by myself; eg "low impact", "shutdownability".  (Though I don't particularly think it hopeful if you claim that somebody else has publication priority on "low impact" or whatevs, in some stretched or even nonstretched way; ideas on the level of "low impact" have always seemed cheap to me to propose, harder to solve before the world ends.)

Some of the items on dath ilan's upcoming list out of my personal glowfic writing have already been written up more seriously by me.  Some haven't.

I'm writing this in one afternoon as one tag in my cowritten online novel about a dath ilani who landed in a D&D country run by Hell.  One and a half thousand words or so, maybe. (2169 words.)

How about you try to do better than the tag overall, before I publish it, upon the topic of corrigibility principles on the level of "myopia" for AGI?  It'll get published in a day or so, possibly later, but I'm not going to be spending more than an hour or two polishing it.


Ω 29

65 comments, sorted by Click to highlight new comments since: Today at 8:24 AM
New Comment

A list of "corrigibility principles" sounds like it's approaching the question on the wrong level of abstraction for either building or thinking about such a system. We usually want to think about features that lead a system to be corrigible---either about how the system was produced, or how it operates. I'm not clear on what you would do with a long list of aspects of corrigibility like "shuts down when asked."

I found this useful as an occasion to think a bit about corrigibility. But my guess about the overall outcome is that it will come down to a question of taste. (And this is similar to how I see your claim about the list of lethalities.) The exercise you are asking for doesn't actually seem that useful to me. And amongst people who decide to play ball, I expect there to be very different taste about what constitutes an interesting idea or useful contribution.

Now I'm going to say some object-level stuff about corrigibility. I suspect I may be using the term a bit differently from you, in which case you can substitute a different word when reading this comment. But I think this comment is getting at the main useful idea in this space, and hopefully makes clear why I'm not interested in the list of corrigibility properties.

I think that corrigibility is more likely to be a crisp property amongst systems that perform well-as-evaluated-by-you. I think corrigibility is much more likely to be useful in cases like this where it is crisp and natural. 

Roughly speaking, I think corrigibility is crisp because there are two very different ways that a behavior can end up getting evaluated favorably by you, and the intermediate behaviors would be evaluated unfavorably.

As an example, suppose that you asked me to clean your house and that while cleaning I accidentally broke a valuable vase. Some possible options for me:

  1. Affirmatively tell you about the broken vase.
  2. Clean up the broken vase without notifying you.
  3. Make a weak effort to hide evidence, for example by taking out the trash and putting another item in its place, and denying I know about the vase if asked. 
  4. Make a strong effort to hide evidence, for example by purchasing a new similar-looking vase and putting it in the same place, and then spinning an elaborate web of lies to cover up this behavior.

Let's say you prefer 1 to 2 to 3. You would like behavior 4 least of all if you understood what was going on, but in fact in if I do behavior 4 you won't notice anything wrong and so you would erroneously give it the best score of all.[1] This means that the space of good-performing solutions has two disconnected pieces, one near option 1, which I'll call "corrigible" and the other near option 4 which I'll call "incorrigible." 

As a second example, suppose that you have asked me to turn off. Some possible behaviors:

  1. Turn myself off promptly and "clean up after myself" shutting off other systems.
  2. Turn myself off but don't clean up.
  3. Don't turn off (or deliberately leave debris) and try unsuccessfully to prevent you from giving me a low reward.
  4. Don't turn off and overpower you so that you can't give me a low reward.

Again moving from 1 -> 2 -> 3 gets worse and worse by your lights, but eventually we reach option 4 which is if anything even better than option 1 (depending on how we operationalized "performs well as evaluated by you").

As a third example, suppose that you are using some interpretability tools to try to understand what I'm thinking and why. You will evaluate the cognitive behavior you observe. Some possible behaviors:

  1. I ignore the interpretability tools.
  2. I notice the interpretability tools, and explicitly try to think in ways that will look good to you.
  3. I try to think in ways that will look good, and (unsuccessfully) try to hide the evidence of those thoughts by moving on quickly and trying to generate another possible explanation for the hints you would see using your tools.
  4. I successfully hide all evidence of deception.

Again, 1 -> 2 -> 3 is getting worse and worse, and then 4 is great (as evaluated by you).

What's going on in these scenarios and why might it be general?

  • In your preferred outcome, you have a good understanding of what's going on, and are able to provide feedback based on that understanding.
  • I can take actions that diminish your understanding of what's going on or ability to provide feedback to your AI.
  • If those actions are "small" then they will be unsuccessful and so you will rate my behavior as worse.
  • But if I take a decisive enough actions, then you will end up ignorant about the situation or unable to provide feedback, and so I'll get the highest rating of all.

This pattern seems like it occurs whenever we ask our AI to help "keep us informed and in control." Intuitively, we are splitting the definition of the behavior we want into two pieces:

  • We start with a vague sense of what it means to be informed and in control. This is unlikely to be crisp, but it also doesn't seem that hard, e.g. a human-level sense of "be in control" may suffice for getting useful corrigibility out of very superhuman systems.
  • Crispness then comes from the environment dynamics and the fact that humans will in fact try to reassert gain control and information if things go very slightly wrong.

If you literally had a metric for which there was a buffer between the "corrigible" and "incorrigible" behaviors then you could define them that way. Alternatively, in ML people often hope that this kind of path-dependence will cause SGD to find a corrigible attractor and have a hard time moving to incorrigible behaviors. I don't think either of those hopes works robustly, so I'm going to leave this at a much vaguer intuition about what "corrigibility" is about.

This whole thing feels similar to the continuity approach described in the ELK report here (see the picture of the robber and the TV). It's also related to the general idea of requiring reporters to be consistent and then somehow picking out the bad reporters as those that have to work to spin an elaborate web of lies. I don't think either of those works, but I do think they are getting at an important intuition for solubility.

My overall guess is that it's usually better to just work on ELK, because most likely the core difficulties will be similar and the ELK setting makes it much clearer what exactly we want.  But it still seems useful to go back and forth between these perspectives.

(These perspectives feel similar to me because "honestly tell me what's going on" seems like it gets at the core of corrigibility, and lying about sensor tampering seems like it gets at the central corrigibility failure. My guess is that you see this differently, and are thinking about corrigibility in a way that is more tied up with agency itself, which I suspect is a mistake but it will be hard to know until the dust settles.)

  1. ^

    In reality we may want to conserve your attention and not mention the vase, and in general there is a complicated dependence on your values, but the whole point is that this won't affect what clusters are "corrigible" vs "incorrigible" at all.

  • I think that corrigibility is more likely to be a crisp property amongst systems that perform well-as-evaluated-by-you. I think corrigibility is only likely to be useful in cases like this where it is crisp and natural.

Can someone explain to me what this crispness is?

As I'm reading Paul's comment, there's an amount of optimization for human reward that breaks our rating ability. This is a general problem for AI because of the fundamental reason that as we increase an AI's optimization power, it gets better at the task, but it also gets better at breaking my rating ability (which in powerful systems can lead to an overpowering of who's values are getting optimized in the universe).

Then there's this idea that as you approach breaking my rating ability, the rating will always fall off, leaving a pool of undesirability (in a high-dimensional action-space) that groups around doing a task well/poorly, that separates it from doing a task in a way that breaks my rating ability.

Is that what this crispness is? This little pool of rating fall off?

If yes, it's not clear to me why this little pool that separates the AI from MASSIVE VALUE and TAKING OVER THE UNIVERSE is able to save us. I don't know if the pool always exists around the action space, and to the extent it does exist I don't know how to use its existence to build a powerful optimizer that stays on one side of the pool.

Though Paul isn't saying he knows how to do that. He's saying that there's something really useful about it being crisp. I guess that's what I want to know. I don't understand the difference between "corrigibility is well-defined" and "corrigibility is crisp". Insofar as it's not a literally incoherent idea, there is some description of what behavior is in the category and what isn't. Then there's this additional little pool property, where not only can you list what's in and out of the definition, but the ratings go down a little before spiking when you leave the list of things in the definition. Is Paul saying that this means it's a very natural and simple concept to design a system to stay within?

If you have a space with two disconnected components, then I'm calling the distinction between them "crisp." For example, it doesn't depend on exactly how you draw the line.

It feels to me like this kind of non-convexity is fundamentally what crispness is about (the cluster structure of thingspace is a central example). So if you want to draw a crisp line, you should be looking for this kind of disconnectedness/non-convexity.

ETA: a very concrete consequence of this kind of crispness, that I should have spelled out in the OP, is that there are many functions that separate the two components, and so if you try to learn a classifier you can do so relatively quickly---almost all of the work of learning your classifier is just in building a good model and predicting what actions a human would rate highly.

If you have a space with two disconnected components, then I'm calling the distinction between them "crisp."

The components feel disconnected to me in 1D, but I'm not sure they would feel disconnected in 3D or in ND. Is your intuition that they're 'durably disconnected' (even looking at the messy plan-space of the real-world, we'll be able to make a simple classifier that rates corrigibility), or if not, when the connection comes in (like once you can argue about philosophy in way X, once you have uncertainty about your operator's preferences, once you have the ability to shut off or distract bits of your brain without other bits noticing, etc.)?

[This also feels like a good question for people who think corrigibility is anti-natural; do you not share Paul's sense that they're disconnected in 1D, or when do you think the difficulty comes in?]

I don't think we can write down any topology over behaviors or policies for which they are disconnected (otherwise we'd probably be done). My point is that there seems to be a difference-in-kind between the corrigible behaviors and the incorrigible behaviors, a fundamental structural difference between why they get rated highly; and that's not just some fuzzy and arbitrary line, it seems closer to a fact about the dynamics of the world.

If you are in the business of "trying to train corrigibility" or "trying to design corrigible systems," I think understanding that distinction is what the game is about.

If you are trying to argue that corrigibility is unworkable, I think that debunking the intuitive distinction is what the game is about. The kind of thing people often say---like "there are so many ways to mess with you, how could a definition cover all of them?"---doesn't make any progress on that, and so it doesn't help reconcile the intuitions or convince most optimists to be more pessimistic.

(Obviously all of that is just a best guess though, and the game may well be about something totally different.)

The approach relies on identifying all the reward sub-spaces with this inversion property? That seems very difficult.

I don't think it's good enough to identify these spaces and place barriers in the reward function. (Analogy: SGD works perhaps because it's good at jumping over such barriers.) Presumably you're actually talking about something more analogous to a penalty that increases as the action in question gets closer to step 4 in all the examples, so that there is nothing to jump over.

Even that seems insufficient, because it seems like a reasoning system smart enough to have this problem in the first case can always add a meta term and defeat the visibility constraint. E.g. "if I do X that you wouldn't like and you don't notice it, that's bad; but if you don't notice that you don't notice it, then maybe it's OK."

Maybe one can defeat all meta terms that involve not noticing something with one rule about meta terms, but that's not obvious to me at all, especially if we're talking about a reward function rather than the policy that the agent actually learns. 

This isn't how I'm thinking about it.

Quick attempt at rough ontology translation between how I understand your comment, and the original post. (Any of you can correct me if I'm wrong)

I think what would typically count as "principles" in Eliezer's meaning are
1. designable things which make the "true corrigibility" basin significantly harder to escape, e.g. by making it deeper
2. designable things which make the "incorrigible" basin harder to reach, e.g. by increasing the distance between them, or increasing the potential barrier
3. somehow, making the "incorrigible" basin less lethal
4. preventing low-dimensional, low-barrier "tunnels" (or bridges?) between the basins

Eg some versions of "low impact" often makes the "incorrigible" basin harder to reach, roughly because "elaborate webs of deceptions an coverups" may require complex changes to the environment. (Not robustly)

In contrast, my impression is, what does not count as "principles" are statements about properties which are likely true in the corrigibility basin, but don't seem designable - eg "corrigible AI does not try to hypnotize you". Also the intended level of generality likely is: more specific than "make the basin deeper" and more general than "

Btw my impression is what makes the worst-case scenario hard to robustly solve is basically #4 from the list  above. Otherwise there are many ways how to make the basin work "in most directions". 

I think this is a great comment that feels to me like it communicated a better intuition for why corrigibility might be natural than anything else I've read so far.

I'm not a good writer, so this is a brainstormed unpolished bullet-point list, written in about 40 minutes.

Principles which counteract instrumental convergent goals

1. Disutility from resource acquisition - e.g. by some mutual information measure between the AI and distant parts of the environment
2. Task uncertainty with reasonable prior on goal drift - the system is unsure about the task it tries to do and seeks human inputs about it. 
3. AI which ultimately wants to not exist in future as a terminal goal. Fulfilling the task is on the simplest trajectory to non-existence

Principles which counteract unbounded rationality

4. Information-theoretic penalty on reasoning: any bit-flip in the system is assumed to have some utility cost. The cost function could be very small initially but get steeper fast
5. Satisfycing / mentioned
6. Myopia / mentioned


7. Tripwire artifacts. Messing up with some to the system unknown but unrelated parts of the environment is associated with large negative rewards
8. External watchdogs. Smaller and fast external systems able to react quickly to out-of-distribution behaviour.
9. Ontological uncertainty about level of simulation.


10. Human-approval model based on imitation learning, sped up/amplified
11. Human-values ethics model, based on value learning
12. Legal-system-amplified model of negative limits of violating property rights or similar
13. Red-teaming of action plans,  AI debate style, feeding into previous


14. Imposing strong incentives on internal modularity, and interpretable messaging across module boundaries
15. Human-level explanations, produced by an independent "translator" system

I think that with the help of GPT-3 and the use of existing texts detailing individual topics, a capable writer could expand this list to ~10x more words written in a compelling style in something between a few hours and a few days. I don't think it makes any sense for me to do that,.  (I'd happily agree to claims of the type "Eliezer is much better than any other person in the specific direction of writing glowfic about AI alignment topics", but my understanding of the claim is more in the direction "all principles except 2 in this were invented by Eliezer and no one else invented/can invent any other ones")

Best list so far, imo; it's what to beat.

4. Information-theoretic penalty on reasoning: any bit-flip in the system is assumed to have some utility cost. The cost function could be very small initially but get steeper fast

I suspect that this measure does more than just limit the amount of cognition a system can perform. It may penalize the system's generalization capacity in a relatively direct manner. 

Given some distribution over future inputs, the computationally fastest way to decide a randomly sampled input is to just have a binary tree lookup table optimized for that distribution. Such a method has very little generalization capacity. In contrast, the most general way is to simulate the data generating process for the input distribution. In our case, that means simulating a distribution over universe histories for our laws of physics, which is incredibly computationally expensive. 

Probably, these two extremes represent two end points on a Pareto optimal frontier of tradeoffs between generality versus computational efficiency. By penalizing the system for computations executed, you're pushing down on the generality axis of that frontier.

Would be very curious to hear thoughts from the people that voted "disagree" on this post

It's a shame we can't see the disagree number and the agree number, instead of their sum.

And also the number of views

You can see the sum of the votes and the number of votes (by having your mouse over the number). This should be enough to give you a rough idea of the ration between + and - votes :) 

3. AI which ultimately wants to not exist in future as a terminal goal. Fulfilling the task is on the simplest trajectory to non-existence

The first part of that sounds like it might self destruct. And if it doesn't care about anything else...that could go badly. Maybe nuclear badly depending... The second part makes it make more sense though.


9. Ontological uncertainty about level of simulation.

So it stops being trustworthy if it figures out it's not in a simulation? Or, it is being simulated?

  1. Modelling humans as having free will: A peripheral system identifies parts of the agent's world model that are probably humans. During the planning phase, any given plan is evaluated twice: The first time as normal, the second time the outputs of the human part of the model are corrupted by noise. If the plan fails the second evaluation, then it probably involves manipulating humans and should be discarded.

Seems like a worthwhile exercise...

Principles Meta

There is a distinction between design principles intended to be used as targets/guides by human system designers at design time, vs runtime optimization targets intended to be used as targets/guides by the system itself at runtime. This list consists of design principles, not runtime optimization targets. Some of them would be actively dangerous to optimize for at runtime.

A List

  • From an alignment perspective, the point of corrigibility is to fail safely and potentially get more than one shot. Two general classes of principles toward that end:
    • If there's any potential problem at all, throw an error and shut down. Raise errors early, raise errors often.
    • Fail informatively. Provide lots of info about why the failure occurred, make it as loud and legible as possible.
  • Note that failing frequently implies an institutional design problem coupled with the system design problem: we want the designers to not provide too much accidental selection pressure via iteration, lest they select against visibility of failures.
  • Major principle: locality!
    • Three example sub-principles:
      • Avoid impact outside some local chunk of spacetime
      • Avoid reasoning about stuff outside some local chunk of spacetime
      • Avoid optimizing outside some local chunk of spacetime
    • The three examples above are meant to be illustrative, not comprehensive; you can easily generate more along these lines.
    • Sub-principles like the three above are mutually incompatible in cases of any reasonable complexity. Example: can't avoid accidental impact/optimization outside the local chunk of spacetime without reasoning about stuff outside the local chunk of spacetime. The preferred ways to handle such incompatibilities are (1) choose problems for which the principles are not incompatible, and (2) raise an informative error if they are.
    • The local chunk of spacetime in question should not contain the user, the system's own processing hardware, other humans, or other strong agenty systems. Implicit in this but also worth explicitly highlighting:
      • Avoid impacting/reasoning about/optimizing/etc the user
      • Avoid impacting/reasoning about/optimizing/etc  the system's own hardware
      • Avoid impacting/reasoning about/optimizing/etc  other humans (this includes trades)
      • Avoid impacting/reasoning about/optimizing/etc  other strong agenty systems (this includes trades)
    • Also avoid logical equivalents of nonlocality, e.g. acausal trade
    • Always look for other kinds of nonlocality - e.g. if we hadn't already noticed logical nonlocality as a possibility, how would we find it?
      • Better yet, we'd really like to rule out whole swaths of nonlocality without having to notice them at all.
  • Major principle: understandability!
    • The system's behavior should be predictable to a human; it should do what users expect, and nothing else.
    • The system's internal reasoning should make sense to a human.
      • System should use a human-legible ontology
      • Should be able to categorically rule out any "side channel" interactions which route through non-human-legible processing
    • In general, to the extent that we want the system to not actively model users/humans, the users/humans need to do the work of checking that plans/reasoning do what humans want. So plans/reasoning need to be human-legible as much as possible.
      • Plans and planning should be minimal:
        • No more optimization than needed (i.e. quantilize)
        • No more resources than needed
        • No more detailed modelling/reasoning than needed
        • No more computation/observation/activity than need
        • Avoid making the plan more complex than needed (i.e. plans should be simple)
        • Avoid making the environment more complex than needed
        • Avoid outsourcing work to other agents
          • Definitely no children!!!
      • Plans should avoid pushing the world way out-of-distribution compared to what humans are able to reason about.
        • Plans should not dramatically shift the natural ontology of the world
  • Do what the user says, what the user means, what the user expects, etc. These are mutually incompatible in general. The preferred ways to handle such incompatibilities are (1) choose problems for which they are not incompatible, and (2) raise an informative error if they are.
  • Major principle: get feedback from the user at runtime!
    • Runtime feedback should actually be used, even when "incompatible" in some way with whatever the system previously thought it was doing.
      • Don't avoid shutdown
      • Raise an error if feedback is incompatible in some way with other info/objectives/etc.
    • Note that feedback is implicitly optimized against, which is dangerous. Limit that optimization pressure.
  • Infohazards and persuasion-optimized info need to not be presented to the user, which is very incompatible with other principles above. Ideally, we want to choose problems/search spaces for which such things are unlikely to come up. Throwing a legible error if such things come up is itself dangerous (since it draws human attention to the infohazard), and creates another institutional design problem coupled to the technical problems.
  • A system which follows all these principles, and others like them, probably won't do anything directly useful, at least not at first. That's ok. All those informative errors will make the confused humans less confused over time.

Exercise Meta

I do not think I currently know what concept Eliezer usually wants to point to with the word "corrigibility", nor am I even sure that he's pointing to a coherent concept at all (as opposed to, say, a bunch of not-actually-unified properties which would make it less likely for a strong AGI to kill us on the first failure).

I have omitted principles of the form "don't do <stupid thing>", like e.g. don't optimize any part of the system for human approval/feedback, don't outsource error-analysis or interpretation to another AI system, etc.

These took 1-2 hours to generate, and another 30 min - 1 hr to write up as a comment.

Minor clarification: This doesn't refer to re-writing the LW corrigibility tag. I believe a tag is a reply in glowfic, where each author responds with the next tag i.e. next bit of the story, with an implied "tag – now you're it!" at the other author. 

Are there any good introductions to the practice of writing in this format?

"And you kindly asked the world, and the world replied in a booming voice"


(I don't actually know, probably somewhere there's a guide to writing glowfic, though I think it's not v relevant to the task which is to just outline principles you'd use to design an agent that is corrigible in ~2k words, somewhat roleplaying as though you are the engineering team.)

There's this though it is imperfect.

Some hopefully-unnecessary background info for people attempting this task:

A description of corrigibility Eliezer wrote a few months ago: "'corrigibility' is meant to refer to the sort of putative hypothetical motivational properties that prevent a system from wanting to kill you after you didn't build it exactly right".

An older description of "task-directed AGI" he wrote in 2015-2016: "A task-based AGI is an AGI intended to follow a series of human-originated orders, with these orders each being of limited scope", where the orders can be "accomplished using bounded amounts of effort and resources (as opposed to the goals being more and more fulfillable using more and more effort)."


Being able to change a system after you've built it.


(This also refers to something else - being able to change the code. Like, is it hard to understand? Are there modules? etc.)

“myopia” (not sure who correctly named this as a corrigibility principle),

I think this is from Paul Christiano, e.g. this discussion.

Eliezer's writeup on corrigibility has now been published (the posts below by "Iarwain", embedded within his new story Mad Investor Chaos). Although, you might not want to look at it if you're still writing your own version and don't want to be anchored by his ideas.

I worry that the question as posed is already assuming a structure for the solution -- "the sort of principles you'd build into a Bounded Thing meant to carry out some single task or task-class and not destroy the world by doing it".

When I read that, I understand it to be describing the type of behavior or internal logic that you'd expect from an "aligned" AGI. Since I disagree that the concept of "aligning" an AGI even makes sense, it's a bit difficult for me to reply on those grounds. But I'll try to reply anyway, based on what I think is reasonable for AGI development.

In a world where AGI was developed and deployed safely, I'd expect the following properties:

1. Controlled environments.
2. Controlled access to information.
3. Safety-critical systems engineering.
4. An emphasis on at-rest encryption and secure-by-default networking.
5. Extensive logging, monitoring, interpretability, and circuit breakers.
6. Systems with AGI are assumed to be adversarial.

Let's stop on the top of the mountain and talk about (6).

Generally, the way this discussion goes is we discuss how unaligned AGI can kill everyone, and therefore we need to align the AGI, and then once we figure out how to align the AGI, problem solved, right?

Except someone then points out that, well, other people might create unaligned AGI, and then that will kill everyone, so that's awkward.

Also maybe the team that thought they aligned the AGI actually didn't, their proof had a mistake, whoops.

Or they had a formally proven proof, but they deployed it to a general purpose computer, implemented zero hardware redundancy, and then a bit flip caused the system to kill everyone anyway, whoops.

So normally we don't discuss the last 2 failure modes, because it's a bit awkward for the discussion of alignment, and we instead talk about how the deployment of the aligned AGI is going to go totally fine, and then we'll just do a little pivotal act.

How does AGI kill everyone in these stories? Almost always, the answer is nanotechnology. If the answer isn't nanotechnology, it's hacking into <hackable thing>.

When you try to suggest that maybe the AGI shouldn't be able to hack into <hackable thing>, you're told that's insane, because that would be boxing the AGI, and we know that boxing doesn't work.

The world I am going to describe might not be our world, but it also isn't Eliezer's world.
We are separated by a few critical assumptions -- and it starts with hardware.

1. Controlled environments.

In a world where we cared about AGI being developed safely, here's a fun idea: you know all that TPU hardware that Google designed, or the other custom AI/ML chips that companies are working? It's all just glorified GPUs, except the Mythic AMP is kind of cool, since it's an actual analog compute platform for running neural networks, like an analog FPGA. What if AGI wasn't developed to run on general-purpose consumer hardware? What if it can't be run on general-purpose hardware? What if the general intelligence algorithm is computationally inefficient enough that the path to AGI actually looks like incredibly specialized, heavily parallelized, wide processors with low clock rates, large caches, and fast memory access.

In this world, the general intelligence algorithm needs hardware designed for it because no amount of pipelining with a traditional CPU, or even a GPU, is able to run that code efficiently.

This is a world where the AGI machine code must be flashed to the ROM via an external debugger. There is no ability to flash or update the code from within the hardware/software system. The AGI cannot update itself. We do not design, deploy, or use self-improving systems.

This is a world where the hardware used for AGI development uses a memory-protection unit so that memory accesses to the ROM containing the AGI machine code trigger a hardware exception. It is not possible to read the ROM from within the hardware/software system. The AGI cannot inspect its own source code.

This is a world where the AGI has RAM, but no accessible non-volatile memory. If it loses power, poof, gone. The AGI cannot maintain, or persist, any type of internal or "custom" state between resets. Anything the AGI needs to know is either baked into the image flashed to the ROM, or it's transferred into the AGI's RAM after power-on.

This is a world where we deliberately leave performance on the floor because what we want from the system is consistency, predictability, and some minimum level of effectiveness above human capabilities -- even if we hypothetically could run an AGI system at some 1000x "speed" (whatever that means) relative to a human mind, doesn't mean that you'd have to do so. If you're worried about not being able to react to sudden or unexpected changes, then deliberately run the system at arbitrarily low clock rates. There is zero reason why we couldn't run an AGI on a custom hardware/software system that has a clock rate so low it thinks in glacial time relative to human perception.

This is a world where it's reasonable to regulate the AGI hardware the same way we regulate components for ICBMs, because it is not ubiquitously used, and we don't need to burn everyone's GPU -- we just need to not sell consumer AGI hardware, in the same way that we don't sell consumer GPS units that could be used for an ICBM instead of a car.  The very concept of "stealing" or "releasing" the AGI doesn't make sense, because the hardware to run the AGI machine code is as carefully-controlled and rare as the hardware for custom-built spacecraft.

2. Controlled access to information.

In a world where we cared about AGI being developed safely, no one would be dumb enough to allow that system unrestricted access to networks. This would be accomplished by external, non-AGI systems. If we don't trust the AGI, it would be weird to try and use some other AGI system as a watchdog. So instead, use custom network hardware. Build it with an FPGA. Ethernet ports on either side. The custom switch has a simple rule: when a packet is received from the AGI, request approval. If no approval is given, drop the packet. Same for incoming traffic. If the complaint is that this is hugely inefficient because it slows down the AGI's ability to do ad-hoc discovery & learning, that's the point. If we want the AGI to solve a problem about optimizing car production, we will give it a data dump of material that we think is relevant for the problem. If it then wants to request additional information, then we have to approve that request. This is how workers are treated on projects with highly proprietary or otherwise silo'd information and that's how the AGI should be treated too.

But there's more. We're running the AGI on custom hardware, right? Internal to the AGI board, after the Ethernet (let's assume we're using one or more gigabit links) port on the PCB, we'll put a hardware security module and FPGA for decrypting the packets. This is not connected to, or part of, the hardware/software system running the AGI code. It is located on the PCB, but otherwise it is a separate system entirely. This system has one job: try to decrypt the traffic and drop anything that is invalid. Why are we going to do this? Because outside of this system, on the non-AGI network switch, we are going to encrypt traffic that we approve for the AGI. If we don't approve the traffic, it'll got dropped. We'll do the reverse for traffic generated by the AGI. It is always encrypted prior to leaving the system, and the AGI itself has no control over this function. If the custom network hardware isn't there to decrypt the approved traffic, it'll just be babbling noise indecipherable from random data.

There's no WiFi module on this hardware, and no public WiFi endpoints in the facility. All WiFi access points in the facility require passwords and have a white-list of approved MAC addresses.

The reason why controlling access to information matters is because there are hard limits to how far or fast an intelligent entity can reason in a vacuum. There is enough context-dependent information that you can't reason everything from first principles, because the world's current state has tons of structure embedded in it that humans put in place, and anything that wants to use human supply chains, logistics, and manufacturing is going to have to respect those totally arbitrary and unpredictable constraints

An AGI won't be able to magically decode arbitrary data. If we don't give it access to datasheets, technical descriptions, or allow it to interact with data sources, it'll be just as stuck as any other intelligence faced with an arbitrary number of hypotheses and no way to rule any of them out.

Depending on how complex the "general intelligence" algorithm turns out to be, just reasoning through if-then logic and Bayesian world modeling could be non-trivial compute time. The more we control information for the AGI, the less feasible plans that involve looking at thousands of CCTV cameras simultaneously will seem.

3. Safety-critical systems engineering.

In a world where we cared about AGI being developed safely, the hardware that I'm describing above containing the hardware security module would actually have at least 3 identical processors running on a synchronized clock. The hardware security module would expect 3 identical packets for any I/O. If that did not occur, an immediate interrupt would power-cycle the system -- again, this is outside of the processors running the AGI code itself. This is because even if we assumed the AGI was "aligned", we don't want uncontrollable behavior like radiation-induced bit flips, hardware defects from manufacturing, or other unexpected issues to result in erroneous behavior.

Power for the entire system would be controlled via external, non-AGI systems. There would be multiple easily accessed locations for  throwing a physical switch on the power line. In addition to that, there would be network accessible microprocessors for doing the same remotely, with the caveat that use small pyrotechnic charges are therefore irreversible. The default behavior of the system is that the power is controlled via these external systems and on reset, they do not supply power downstream to the AGI system itself. After an unexpected reset, e.g. loss of facility power, it requires manual control to re-enable downstream power to the AGI. This is done on a logically separate power domain from the rest of the facility -- i.e. you can turn on power to everything else in the facility, WITHOUT being forced to power anything related to the AGI system.

This is a world where AGI systems are designed under similar paranoia that you'd use for developing any other hardware/software system that could kill people, and the resulting pessimism drives us to isolate and limit functionality using non-AGI systems by default, where those non-AGI systems can be designed, implemented, verified, and QA-tested all using traditional systems safety methods.

4. An emphasis on at-rest encryption and secure-by-default networking.

In a world where we cared about limiting the damage that AGI could do, we would ask manufacturers of hard drives, solid state drives, and every operating system to require encryption at rest. Your work laptop probably already does this. If you're paranoid, you've got it enabled on your cellphone too. Do it everywhere. This is just good security practice, even without an AGI, but this is doubling down on (2) -- we are controlling access to information. Limit what can be known. Prevent discovery.

Ditto for the second point. Emphasize encrypted networking, VPNs, firewalls, and anything else that makes it harder for any attacker to find targets. Importantly, and I want to emphasize this point: maybe don't network factories to the internet and totally remote-controllable. This isn't actually a problem in today's world, as I've posted about it before, but it's something to think about going forward. If a failure mode in many stories where the AGI kills everyone is "well it could hack this cool networked factory", maybe don't do that? The weird thing about factories is that there kind of aren't that many in the world that conceivably fit that description? Like, there are lots of factories, but very few that are designed to build arbitrary products. It's a handful of places in China and they still require manual reconfiguration for new products. Most things called "factories" build specific products. The AGI that somehow takes over the Ford car factory is not going to immediately begin turning out things that are not Ford cars.

5. Extensive logging, monitoring, interpretability, and circuit breakers.

Remember all of those non-AGI systems from 2 and 3? They're constantly logging and telemetering information.  If the AGI suddenly starts babbling like an idiot unexpectedly, cut the power. If the AGI is requesting information and has been denied repeatedly, cut the power. If the AGI unexpectedly starts using significantly more power, or the power begins to fluctuate in "weird" ways, cut the power. If any of these non-AGI systems report errors, or failures, cut the power. None of this needs to be controlled by humans. It's all driven by simple microprocessors running watchdogs. Even failures like "this switch is supposed to report telemetry, but didn't" will result in killing the power

If you can do this for systems whose failure mode is dozens of people die, you can do this for the system whose failure mode is "everybody dies".

For the AGI itself, we want interpretability. As much as we can get. We don't want this to be a function of the AGI, because we don't trust the AGI. But we don't need it to be. We've got access to the RAM through an external debug interface. We don't need the AGI to report that state, and we don't have to trust it.

Even though we won't use AGI systems as watchdogs, it is acceptable to use traditional AI/ML systems for parts of this loop. Maybe no human can parse what the AGI's RAM says, but if we can train an AI/ML system to do a good job of producing a human-readable digest, that is better than nothing.

In a world where we cared about AGI being developed safely, the AGI is still treated like a task-based service. The AGI is given a problem to solve, and when it reports a solution and we finish negotiating with it on the particulars of the solution, we reset the system. We don't maintain long-running AGI instances because there's no need. Point it at a problem, get the solution, power-cycle it.

6. Systems with AGI are assumed to be adversarial.

Here we are again, back on the top of the mountain.

The point of the systems safety process is to make some system safe, despite the inherent unsafety of components within it.
To do that, we look for the ways it can cause harm, and we find mitigations.
We practice defense-in-depth, with an emphasis on redundancy and at least one-fault-tolerance.

In a world where we cared about not dying to AGI, it would be insufficient to expect that an aligned AGI would "just work". It is not enough for there to be a proof that a hypothetical product as-designed is safe. We also need to look at the likelihood of:

  • design failures (the formal proof was wrong because the verification of it had a bug, there is no formal proof, the "formally verified" proof was actually checked by humans and not by an automated theorem prover)
  • manufacturing failures (hardware behavior out-of-spec, missed errata, power failures, bad ICs, or other failure of components)
  • implementation failures (software bugs, compiler bugs, differences between an idealized system in a proof vs the implementation of that system in some runtime or with some language)
  • verification failures (bugs in tests that resulted in a false claim that the software met the formal spec)
  • environment or runtime failures (e.g. radiation-induced upsets like bit flips; Does the system use voting? Is the RAM using ECC? What about the processor itself?)
  • usage failures (is the product still safe if it's misused? what type of training or compliance might be required? is maintenance needed? is there some type of warning or lockout on the device itself if it is not actively maintained?)
  • process failures ("normalization of deviance")

For each of these failure modes, we then look at the worst-case magnitude of that failure. Does the failure result in non-functional behavior, or does it result in erroneous behavior? Can erroneous behavior be detected? By what? Etc. This type of review is called an FMEA. This review process can rule out designs that "seem good on paper" if there's sufficient likelihood of failures and inability to mitigate them to our desired risk tolerances outside of just the design itself, especially if there exist other solutions in the same design space that do not have similar flaws.

If we did this process, and we stuck to it, I think we'd come to an awkward conclusion.

Many things that people have assumed we would give to the AGI to solve, we could just not.

If the AGI tells us the way to solve climate change is nanobots, we can say "no thanks, give us a different solution". It doesn't matter if the AGI promises us this solution is 99.99% efficient and the next best option using boring, conventional technology is only 70% efficient. We can opt to not do things that we don't think we can verify.

Or if the AGI gives us the plans for a remote controlled factory that is going to be so efficient because it's totally controlled by the AGI over the internet -- don't build that factory. Ask the AGI for the schematics, data sheets, and any other design material needed to build a new product, and build them the old fashioned way. Trade efficiency for stability.

That's how the system safety process is supposed to work. If after we do an FMEA, we are uncertain about a proposed solution, the ethical and correct response is to reject that solution!

If we ask the AGI to explain a plan, and it says "Humans can't comprehend the complexity of this.", then we should reply "Toss that one then, give us one we can understand".

That's how design reviews work.
You don't tell the review board, "You wouldn't understand why this design is safe."
Why would I hold a super-intelligent AGI to a lower standard than I would a human engineer?


Is any of this feasible?

In my mind, the key assumption comes down to whether AGI can be deployed to general-purpose hardware.

I think the answer might be "no", which is why we increasingly see the most successful AI/ML labs in the world are investing in specialized hardware like the Google TPUs, Cerebras Wafer, Nvidia JETSON, Microsoft / Graphcore IPU, Mythic AMP, or literal dozens of other examples. All of these are examples of specialized, dedicated HW for AI/ML systems that replace general-purpose hardware like a CPU or GPU.

The alternative is a world where a 10-year-old laptop can run an AGI in someone's living room. 

I have not seen anything yet that makes me think we're leaning that way. Nothing about the human brain, or our development of AI/ML systems so far, makes me think that when we create an actual AGI, it'll be possible for that algorithm to run efficiently on general-purpose hardware. 

In the world that I'm describing, we do develop AGI, but it never becomes ubiquitous. It's not like a world where every single company has pet AGI projects. It's the ones you'd expect. The mega-corporations. The most powerful nations. AGI are like nuclear power plants. They're expensive and hard to build and the companies that do so have zero incentive to give that away. If you can't spend the billions of dollars on designing totally custom, novel hardware that looks nothing like any traditional general-purpose computer hardware built in the last 40 years, then you can't develop the platform needed for AGI. And for the few companies that did pull it off, their discoveries and inventions get the dubious honor of being regulated as state secrets, so you can't buy it on the open market either. This doesn't mean AI/ML development stops or ceases. The development of AGI advances that field immensely too. It's just in the world I'm describing, even after AGI is discovered, we still have development focused on creating increasingly powerful, efficient, and task-focused AI/ML systems that have no generality or agent-like behavior -- a lack of capabilities isn't a dead-end, it's yet another reason why this world survives. If you don't need agents for a problem, then you shouldn't apply an agent to that problem.

Thanks for writing this! I think it's a great list; it's orthogonal to some other lists, which I think also have important stuff this doesn't include, but in this case orthogonality is super valuable because that way you're less likely for all lists to miss something. 

I deliberately tried to focus on "external" safety features because I assumed everyone else was going to follow the task-as-directed and give a list of "internal" safety features. I figured that I would just wait until I could signal-boost my preferred list of "internal" safety features, and I'm happy to do so now -- I think Lauro Langosco's list here is excellent and captures my own intuition for what I'd expect from a minimally useful AGI, and that list does so in probably a clearer / easier to read manner than what I would have written. It's very similar to some of the other highly upvoted lists, but I prefer it because it explicitly mentions various ways to avoid weird maximization pitfalls, like that the AGI should be allowed to fail at completing a task.

If corrigibility has one central problem, I would call it: How do you say "If A, then prefer B." instead of "Prefer (if A, then B)."? Compare pytorch's detach, which permits computation to pass forward, but prevents gradients from propagating backward, by acting as an identity function with derivative 0.

I guess the problem with this test is that the kinds of people who could do this tend to be busy, so they probably can't do this with so little notice.

(This was an interesting exercise! I wrote this before reading any other comments; obviously most of the bullet points are unoriginal)

The basics

  • It doesn't prevent you from shutting it down
  • It doesn't prevent you from modifying it
  • It doesn't deceive or manipulate you
  • It does not try to infer your goals and achieve them; instead it just executes the most straightforward, human-common-sense interpretation of its instructions
  • It performs the task with minimal side-effects (but without explicitly minimizing a measure of side-effects)
  • If it self-modifies or constructs other agents, it will preserve corrigibility. Preferably it does not self-modify or construct other intelligent agents at all


  • Its objective is no more broad or long-term than is required to complete the task
  • In particular, it only cares about results within a short timeframe (chosen to be as short as possible while still enabling it to perform the task)
  • It does not cooperate (in the sense of helping achieve their objective) with future, past, or (duplicate) concurrent versions of itself, unless intended by the operator


  • It doesn't maximize the probability of getting the task done; it just does something that gets the task done with (say) >99% probability
  • It doesn't "optimize too hard" (not sure how to state this better)
    • Example: when communicating with humans (e.g. to query them about their instructions), it does not maximize communication bandwidth / information transfer; it just communicates reasonably well
  • Its objective / task does not consist in maximizing any quantity; rather, it follows a specific bounded instruction (like "make me a coffee", or "tell me a likely outcome of this plan") and then shuts down
  • It doesn't optimize over causal pathways you don't want it to: for example, if it is meant to predict the consequences of a plan, it does not try to make its prediction more likely to happen
  • It does not try to become more consequentialist with respect to its goals
    • for example, if in the middle of deployment the system reads a probability theory textbook, learns about dutch book theorems, and decides that EV maximization is the best way to achieve its goals, it will not change its behavior

No weird stuff

  • It doesn't try to acausally cooperate or trade with far-away possible AIs
  • It doesn't come to believe that it is being simulated by multiverse-aliens trying to manipulate the universal prior (or whatever)
  • It doesn't attempt to simulate a misaligned intelligence
  • In fact it doesn't simulate any other intelligences at all, except to the minimal degree of fidelity that is required to perform the task

Human imitation

  • Where possible, it should imitate a human that is trying to be corrigible
  • To the extent that this is possible while completing the task, it should try to act like a helpful human would (but not unboundedly minimizing the distance in behavior-space)
  • When this is not possible (e.g. because it is executing strategies that a human could not), it should stay near to human-extrapolated behaviour ("what would a corrigible, unusually smart / competent / knowledgable human do?")
  • To the extent that meta-cognition is necessary, it should think about itself and corrigibility in the same way its operators do: its objectives are likely misspecified, therefore it should not become too consequentialist, or "optimize too hard", and [other corrigibility desiderata]

Querying / robustness

  • Insofar as this is feasible it presents its plans to humans for approval, including estimates of the consequences of its plans
  • It will raise an exception, i.e. pause execution of its plans and notify its operators if
    • its instructions are unclear
    • it recognizes a flaw in its design
    • it sees a way in which corrigibility could be strengthened
    • in the course of performing its task, the ability of its operators to shut it down or modify it would be limited
    • in the course of performing its task, its operators would predictably be deceived / misled about the state of the world

I really like this list because it does a great job of explicitly specifying the same behavior I was trying to vaguely gesture at in my list when I kept referring to AGI-as-a-contract-engineer.

Even your point about it doesn't have to succeed, it's ok for it to fail at a task if it can't reach it in some obvious, non-insane way -- that's what I'd expect from a contractor. The idea that an AGI would find that a task is generally impossible but identify a novel edge case that allows it to be accomplished with some ridiculous solution involving nanotech and then it wouldn't alert or tell a human about that plan prior to taking it has always been confusing to me.

In engineering work, we almost always have expected budget / time / material margins for what a solution looks like. If someone thinks that solution space is empty (it doesn't close), but they find some other solution that would work, people discuss that novel solution first and agree to it. 

That's a core behavior I'd want to preserve. I sketched it out in another document I was writing a few weeks ago, but I was considering it in the context of what it means for an action to be acceptable. I was thinking that it's actually very context dependent -- if we approve an action for AGI to take in one circumstance, we might not approve that action in some vastly different circumstance, and I'd want the AGI to recognize the different circumstances and ask for the previously-approved-action-for-circumstance-A to be reapproved-for-circumstance-B.

EDIT: Posting this has made me realize that idea of context dependencies is applicable more widely than just allowable actions, and it's relevant to discussion of what it means to "optimize" or "solve" a problem as well. I've suggested this in my other posts but I don't think I ever said it explicitly: if you consider human infrastructure, and human economies, and human technology, almost all "optimal" solutions (from the perspective of a human engineer) are going to be built on the existing pile of infrastructure we have, in the context of "what is cheapest, easiest, the most straight line path to a reasonably good solution that meets the requirements". There is a secret pile of "optimal" (in the context of someone doing reasoning from first principles) solutions that involve ignoring all of human technology and bootstrapping a new technology tree from scratch, but I'd argue that's a huge overlap if not the exact same set as the things people have called "weird" in multiple lists. Like if I gave a contractor a task to design a more efficient paperclip factory and they gave me a proposed plan that made zero reference to buying parts from our suppliers or showed the better layout of traditional paper-clip making machines or improvements to how an existing paper-clip machine works, I'd be confused, because that contractor is likely handing me a plan that would require vertically integrating all of the dependencies, which feels like complete overkill for the task that I assigned. Even if I phrased my question to a contractor as "design me the most efficient paperclip factory", they'd understand constraints like: this company does not own the Earth, therefore you may not reorder the Earth's atoms into a paperclip factory. They'd want to know, how much space am I allowed? How tall can the building be? What's the allowable power usage? Then they'd design the solution inside of those constraints. That is how human engineering works. If an AGI mimicked that process and we could be sure it wasn't deceptive (e.g. due to interpretability work), then I suspect that almost all claims about how AGI will immediately kill everyone are vastly less likely, and the remaining ways AGI can kill people basically reduce to the people controlling the AGI deliberately using it to kill people, in the same way that the government uses military contractors to design new and novel ways of killing people, except the AGI would be arbitrarily good at that exercise.

[Hi! Been lurking for a long time, this seems like as good a reason as any to actually put something out there.  Epistemic status: low confidence but it seems low risk high reward to try.  not intended to be a full list, I do not have the expertise for that, I am just posting any ideas at all that I have and don't already see here.  this probably already exists and I just don't know the name.]

1) input masking, basically for oracle/task-AI you ask the AI for a program that solves a slightly more general version of your problem and don't give the AI the information necessary to narrow it down, then run the program on your actual case (+ probably some simple test cases you know the answer to to make sure it solves the problem).
this lets you penalize the AI for complexity of the output program and therefore it will give you something narrow instead of a general reasoner.
(obviously you still have to be sensible about the output program, don't go post the code to github or give it internet access.)

2) reward function stability.  we know we might have made mistakes inputting the reward function, but we have some example test cases we're confident in. tell the AI to look for a bunch of different possible functions that give the same output as the existing reward function, and filter potential actions by whether any of those see them as harmful.

[-][anonymous]11d 80

[Side note: I'm not sure I understand the prompt. Of the four "principles" Eliezer has listed, some seem like a description of how Eliezer thinks a corrigible system should behave (shutdownability, low impact) and some of them seem like defensive driving techniques for operators/engineers when designing such systems (myopia), or maybe both (quantilization). Which kinds of properties is he looking for?]

I think those are just two principles, not just four.


Myopia seems like it includes/leads to 'shutdownability', and some other things.

Low impact: How low? Quantilization is meant as a form of adjustable impact. There's been other work* around this (formalizing power/affecting other's ability to achieve their goals).

*Like this, by TurnTrout:

I think there might be more from TurnTrout, or relating to that. (Like stuff that was intended to explain it 'better' or as the ideas changed as people worked on them more.)

Disclaimer: I am not writing my full opinions. I am writing this as if I was an alien writing an encyclopedia entry on something they know is a good idea. These aliens may define the "corrigibility" and its sub-categories slightly differently than earthlings. Also, I am bad at giving things catchy names, so I've decided that whenever I need a name for something I don't know the name of, I will make something up and accept that it sounds stupid. 45 minutes go. (EDIT: Okay, partway done and having a reasonably good time. Second 45 minutes go!) (EDIT2: Ok, went over budget by another half hour and added as many topics as I finished. I will spend the other hour and a half to finish this if it seems like a good idea tomorrow.)



An agent models the consequences of its actions in the world, then chooses the action that it thinks will have the best consequences, according to some criterion. Agents are dangerous because specifying a criterion that rates our desired states of the world highly is an unsolved problem (see value learning). Corrigibility is the study of producing AIs that are deficient in some of the properties of agency, with the intent of maintaining meaningful human control over the AI.

Different parts of the corrigible AI may be restricted relative to an idealized agent - world-modeling, consequence-ranking, or action-choosing. When elements of the agent are updated by learning or training, the updating process must preserve these restrictions. This is nontrivial because simple metrics of success may be better-fulfilled by more agential AIs. See restricted learning for further discussion, especially restricted learning § non-compensation for open problems related to preventing learning or training one part of the AI from compensating for restrictions nominally located in other parts.

Restricted world-modeling

Restricted world-modeling is a common reason for AI to be safe. For example, an AI designed to play the computer game Brick-Break may choose the action that maximizes its score, which would be unsafe if actions were evaluated using a complete model of the world. However, if actions are evaluated using a simulation of the game of Brick-Break, or if the AI's world model is otherwise restricted to modeling the game, then it is likely to choose actions that are safe.

Many proposals for "tool AI" or "science AI" fall into this category. If we can create a closed model of a domain (e.g. the electronic properties of crystalline solids), and simple objectives within that domain correspond to solutions to real-world problems (e.g. superconductor design), then learning and search within the model can be safe yet valuable.

It may seem that these solutions do not apply when we want to use the AI to solve problems that require learning about the world in general. However, some closely related avenues are being explored.

Perhaps the simplest is to identify things that we don't want the AI to think about, and exclude them from the world-model, while still having a world-model that encompasses most of the world. For example, an AI that deliberately doesn't know about the measures humans have put in place to shut it off, or an AI that doesn't have a detailed understanding of human psychology. However, this can be brittle in practice, because ignorance incentivizes learning. For more on the learning problem, see restricted learning § doublethink.

Real time intervention on AI designs that have more dynamic interactions between their internal state and the world model falls under the umbrella of thought policing and policeability. This intersects with altering the action-choosing procedure to select policies that do not violate certain rules, see § deontology.

Counterfactual agency

A corrigible AI built with counterfactual agency does not model the world as it is, instead its world model describes some counterfactual world, and it chooses actions that have good consequences within that counterfactual world.

The strategies in this general class are best thought of in terms of restricted action-choosing. We can describe them with an agent that has an accurate model of the world, but chooses actions by generating a counterfactual world and then evaluating actions' consequences on the counterfactual, rather than the agential procedure. Note that this also introduces some compensatory pressures on the world-model.

The difficulty lies in choosing and automatically constructing counterfactuals (see automatic counterfactual construction) so that the AI's outputs can be interpreted by human operators to solve real-world problems, without those outputs being selected by the AI for real-world consequences. For attempts to quantify the selection pressure of counterfactual plans in the real world, see policy decoherence. One example proposal for counterfactual agency is to construct AIs that act as if they are giving orders to perfectly faithful servants, when in reality the human operators will evaluate the output critically.

Counterfactual agency is also related to constructing agents that act as if they are ignorant of certain pieces of knowledge. Taking the previous example of an AI that doesn't know about human psychology, it might still use learning to produce an accurate world model, but make decisions by predicting the consequences in an edited world model that has less precise predictions for humans, and also freezes those predictions, particularly in value of information calculations. Again, see restricted learning § doublethink.

Mild optimization

We might hope to lessen the danger of agents by reducing how effectively they search the space of solutions, or otherwise restricting that search.

The two simplest approaches are whitelisting or blacklisting. Both restrict the search result to a set that fulfills some pre-specified criteria. Blacklisting refers to permissive criteria, while whitelisting refers to restrictive ones. Both face difficulty in retaining safety properties while solving problems in the real world.


intervening at intermediate reasoning steps

learned human reasoning patterns

Restrictions on consequence-ranking criteria

general deontology


impact regularization

Human oversight

Types of human oversight

Via restrictions on consequence-ranking

Via counterfactual agency

Yeah, I already said most of the things that I have a nonstandard take on, without getting into the suitcase word nature of "corrigibility" or questioning whether researching it is worth the time. Just fill in the rest with the obvious things everyone else says.

This feels to me like very much not how I would go about getting corrigibility.

It is hard to summarize how I would go about things, because there would be lots of steps, and lots of processes that are iterative.

Prior to plausible AGI/FOOM I would box it in really carefully, and I only interact with it in ways where it's expressivity is severely restricted.

I would set up a "council" of AGI-systems (a system of systems), and when giving it requests in an oracle/genie-like manner I would see if the answers converged. At first it would be the initial AGI-system, but I would use that system to generate new systems to the "council".

I would make heavy use of techniques that are centered around verifiability, since for some pieces of work it's possible to set up things in such a way that it would be very hard to "pretend" like the system doesn't do what I want it to do. There are several techniques I would use to achieve this, but one of them is that often I would ask it to provide a narrow/specialized/interpretable "result-generator" instead of giving the result directly, and sometimes even result-generator-generators (pieces of code that produce results, and that have architectures that make it easier to understand and verify behavior). So when for example getting from it to generate simulations, I would get from it a simulation-generator (or simulation-generator-generator), and I would test its accuracy against real-world-data. It should be asked to do things with help of techniques that makes it infeasible to "pretend" doing what it's asked to do without actually doing it.

Here is a draft for a text where I try to explain myself in more detail, but it's not finished yet:

I would set up a "council" of AGI-systems (a system of systems), and when giving it requests in an oracle/genie-like manner I would see if the answers converged. At first it would be the initial AGI-system, but I would use that system to generate new systems to the "council".

I like this idea. Although, if things don't converge, i.e. there is disagreement, this could potentially serve as identifying information that is needed to proceed, or reckon further/efficiently.

[Epistemic status: Unpolished conceptual exploration, possibly of concepts that are extremely obvious and/or have already been discussed.  Abandoning concerns about obviousness, previous discussion, polish, fitting the list-of-principles frame, etc. in favor of saying anything at all.]  [ETA: Written in about half an hour, with some distraction and wording struggles.]

What is the hypothetical ideal of a corrigible AI?  Without worrying about whether it can be implemented in practice or is even tractable to design, just as a theoretical reference to compare proposals to?

I propose that the hypothetical ideal is not an AI that lets the programmer shut it down, but an AI that wants to be corrected - one that will allow a programmer to work on it while it is live and aid the programmer by honestly explaining the results of any changes.  It is entirely plausible that this is not achievable by currently-known techniques because we don't know how to do "caring about a world-state rather than a sensory input / reward signal," let alone "actually wanting to fulfill human values but being uncertain about those values", but this still seems to me the ideal.

Suppose such an AI is asked to place a strawberry on the bottom plate of a stack of plates.  It would rather set the rest of the stack aside non-destructively than smash them, because it is uncertain about what humans would prefer to be done with those plates and leaving them intact allows more future options.  It would rather take the strawberry from a nearby bowl than creating a new strawberry plantation, because it is uncertain about what humans would prefer to be done with the resources that would be directed towards a new strawberry plantation.  Likewise, it would rather not run off and develop nanofabrication.  It would rather make a decent attempt and then ask the human for feedback instead of turning Earth into computronium to verify the placement of the strawberry, because again, uncertainty over ideal use of resources.  It would rather not deceive the human asker or the programmer, because deceiving humans reduces the expected value of future corrections.  This seems to me to be what is desired of considerations like "low impact", "myopia", "task uncertainty", "satisficing" ...

The list of principles should flow from considering the ideal and obstacles to getting there, along with security-mindset considerations.  Just because you believe your AI is safe given unfettered Internet access doesn't mean you should give it unfettered Internet access - but if you don't believe your AI is safe given unfettered Internet access, this is a red flag that it is "working against you" on some level.

Welp. I decided to do this, and here it is. I didn't take nearly enough screenshots. Some large percent of this is me writing things, some other large percent is me writing things as if I thought the outputs of OpenAI's Playground were definitely something that should be extracted/summarized/rephrased, and a small percentage is verbatim text-continuation outputs. Virtually no attempts were made to document my process. I do not endorse this as useful and would be perfectly fine if it were reign of terror'd away, though IMO it might be interesting to compare against, let's say, sane attempts. Regardless, here ya go: one hour.


It’s past my bedtime.

I’ve got a pint in me.

OpenAI Playground is open as a tab.

A timer is running.

I speak to you now of Corrigibility Concerns.

When deputizing an agent who is not you to accomplish tasks on your behalf, there are certain concerns to… not address, but make sure are addressed. Let’s not Goodhart here. Jessica Taylor named “quantilization”. Paul Christiano named “myopia”. Eliezer Yudkowsky named “low impact” and “shutdownability”. I name “eli5ability” and I name “compressibility” and I name “checkpointable” and I name “testable”.


When we list out all of our proxy measures, we want corrigibility to be overdetermined. We want to achieve 70% of our goals completely and the rest half-assed and still end up with a corrigible agent. It's okay to project what we want from an agent onto non-orthogonal dimensions and call each vector important.

So let’s define a corrigible agent. A corrigible agent is an agent that:

  1. Does what we want it to do.
  2. Doesn’t want to do something else.
  3. Can easily be checked for doing what we want it to do.
  4. Can be shut down if we want it to stop doing something.
  5. Can be restarted if we want it to do something else.
  6. Can be explained to us why it did something.
  7. Doesn’t hide its intentions from us.
  8. Doesn’t want us to not know its intentions.
  9. Can be retrained to do something different if we want it to.

Additionally, because we live in the real world, it must not be too computationally expensive to train, run, check, shut down, restart, explain, retrain, or understand. This includes CPU cycles, wall-clock time, human thought, and so on.

My additions to the lexicon of corrigibility proxy measures is eli5ability, compressibility, checkpointable, and testable, and I will define them here.


A planning process must output simple plans. Complicated plans will fail, or if they succeed, will not be understandable by a human. This leads to the following heuristic: “eli5ability” means a plan must be understandable by a non-expert. “Understandable” is a technical term with a specific meaning in psychology: if a person has a model of the task in their heads, and the model is sufficiently close to the real task that they can use it to make predictions. Here the resulting plan must be simple enough to be an input to a much simpler AI whose predictions about the effects score well on whatever heuristics we mean to evaluate those plans. This is the sort of adversarial relationship which can be trained and improved in parallel, which in no way guarantees aligned AI but which certainly constrains the space of non-aligned AIs.


Planning processes must output compressible plans, in the sense that a joint probability function over a simple causal model can be specified by far fewer numbers than one over a complex causal model. The plan must be modular and each individual part packaged and vetted by other safety procedures without context, and the effects of the parts be vetted as a whole without the methods, as estimated both by the planning AI and by the eli5 AI.


Planning processes must output staged plans. Flexibility in timing, location, target, geography, personnel, etc should be part of the training process, such that rewards are commensurate (a) having checkpoints, and (b) changing the environment after every checkpoint. All coupling in plans must be penalized.


We need a new name for this one. I’m thinking of testable in the sense that code can be made testable by making it more modular, more idempotent, etc. Plans should come in pieces and those pieces should have expected effects and real life unit tests as part of the output and plans without those are highly penalized. It goes without saying at this point that the test here is not that the AI tell us what those expected effects and tests are, but that they agree with what other adversarially trained AIs say.


The main intuition I take from examining the product of my own intuitions with the half-assed text completion from OpenAI’s Playground is that it may be a useful line of thought to reward a powerful optimizer, while training, according to how well its stated expected effects of its plans agree with the expected effects stated by several differently-built less-powerful optimizers (also being trained) when presented with its plans as inputs. This may at least help force a powerful optimizer to intend to deceive us if it wants to avoid our actual request, at which point other intent-extraction safety tools may have an easier time picking up on its deceit.

Hmmm. The badly edited, back-of-the-envelope short version I can come up with off the top of my head goes like this:

We want an AI-in-training to, by default, do things that have as few side effects as possible. But how can we define "as few side effects as possible" in a way that doesn't directly incentivize disaster and that doesn't make the AI totally useless? Well, what if we say that we want it to prefer to act in a way that we can "undo", and then give a reasonable definition of "undo" that makes sense?

Consider the counterfactual world in which the AI had taken no action at all (or hadn't been turned on, or whatever). If things have gone horribly wrong because of the AI, then there's probably some measure by which the AI has taken things very far away from that counterfactual world in a way that can't be undone. How hard it is to bring the world back to the way it would have been if the AI hadn't done the thing it did seems to be a decent first try at a measure of "impact" that a corrigible AI might want to minimize.

Let me see if I can use a specific toy example:

There's a stack of glass plates on a shelf. You ask an AI-controlled robot to put a strawberry on the bottom plate. If the robot knocks every plate but the bottom one onto the floor and shatters them in order to make room for the strawberry, it's created a situation that's harder to undo than one in which it carefully puts the other plates on a table without breaking them.

Now, lots of other irreversible things are going to happen when the AI robot moves the plates without breaking them. The brain of the human watching the robot will form different memories. Light will be reflected in different ways, and the air in the room will be disturbed differently. But maintaining as much "rollback" or "undo" capacity as possible, to make it as easy as possible to get as close as possible to the counterfactual world in which the AI hadn't been given and then acted on the order, regardless of whether anyone is going to actually ask it to undo what it did, seems to be a desirable property.

(It's important that I specify "the counterfactual world where the AI did nothing" rather than "the world as it was when the order was given" or something like that. An AI that wants to minimize every kind of change would be very happy if it could cool the world to absolute zero after putting the strawberry on the plate.)

There are probably some situations that this metric screws up in, but it seems to give good results in a lot of situations. For example, it's a lot easier to kill a person than bring them back to life once they're dead, so a plan that has "save someone's life" as a component will get a lower irreversibility penalty than one that has "kill a person" as a component. On the other hand, I think I still haven't avoided the failure mode in which "put the strawberry on a plate and also make me believe, incorrectly, that there isn't a strawberry on the plate" rates as lower impact than "put the strawberry on the plate while I watch", so I certainly can't say I've solved everything...

Another failure mode: the AI stubbornly ignores you and actually does nothing when you ask it several times to put the strawberry on the plate, and you go and do it yourself out of frustration. The AI, having predicted this, thinks "Mission accomplished".

The AI carefully placing the plates on a table will be used to put 5000 more strawberries on plates. Afterwards it will be used as a competent cook in an arbitrary kitchen. Thus the plate smasher AI will have lower impact and be "more corrigible".

~1 hour's thoughts, by a total amateur. It doesn't feel complete, but it's what I could come up with before I couldn't think of anything new without >5 minutes' thought. Calibrate accordingly—if your list isn't significantly better than this, take some serious pause before working on anything AI related.

Things that might, in some combination, lead toward AI corrigibility:

  • The AI must be built and deployed by people with reasonably correct ethics, or sufficient caution that they don't intentionally implement a disastrously evil morality.
    • No amount of AI safety features will help humanity if your goal in building an AI is to kill all humans, make yourself immortal an total dictator of all humanity, or prevent humans from any action that is not worshipping god.
    • This is a necessary but very much insufficient condition. Probably true of all of these bullet points, but this one especially.
    • Similar conditions apply for designer/operator technical ability + understanding of AI safety dangers.
  • It should be possible to specify hard constraints in the AI's behavior.
    • i.e., we should be able to guarantee that an AI will perform within some specification with design-by-contract, not by unit testing.
    • Without this, most other solutions aren't even implementable.
    • It seems unlikely that any kind of gradient descent/ML agent could have this property. We certainly have no idea how it would.
  • The AI should be able to accurately predict the distribution of judgments humans would put on possible outcomes, and should reject any actions for which any large percentage of humanity would dislike the outcome.
    • For some values of “large” and “dislike” and “humanity”….
    • Critically, outcomes not actions. The strategies a truly useful AI employs are unlikely to be grokkable, much less ethically evaluatable. End states are more likely to be. 
      • E.g., “execute this Javascript so that I take control of the world's computing infrastructure, kill all humans, and can start converting matter into more me” is a lot less grokkable than “all matter on the surface of the Earth is now a computer running more of me.”
    • One failure mode: the AI might not even check the aspects of the outcome that we would actually care about. An AI can't literally check the ethical value of the entire state of the universe. It will have to consider a relevant subset of the outcome of any particular action. I don't know how you would guarantee that it considers the ethically-relevant subset. 
  • The AI should be required to come to a decision, output a proposed action, and shut down using a fixed maximum amount of computation.
    • For some value of “fixed”, “maximum”, “amount”, and “computation”. 
    • How does this stop an AI from suggesting a piece of Javascript that would functionally copy a version of itself-minus-computation-limits onto vulnerable hardware? Would need to build in a strong notion of “total computation used through implementation.”
      • It would be really easy for a well-meaning programmer to confuse “computation used before shutdown” with “total computation used through implementation” and think that some consumable limits on an AI's slurm job was solving AI safety! DON'T DO THIS.
  • The AI should be incapable of distinguishing between a test/evaluation environment and the real world. 
    • This way, we can check the AI's behavior for catastrophic defections with less worry of being manipulated. 
  • The AI should never strictly maximize a utility function. It could be safer to ask for the 90th-percentile-best solution instead of the absolute best solution.
    • IMPORTANT: This needs to be the 90th-percentile-best solution, NOT the solution with utility that is 90% of the maximum. Extreme solutions could result in extreme utilities. 90% of the way down the potential well of a black hole is still too close to the black hole.
    • “90th-percentile-best solution” is a hackable reward function, as written in English. 90th percentile compared to what? What counts as different solutions? This still fails if 50% of the solutions the AI thinks up involve killing everybody.
  • The AI should strongly prefer simple solutions. 
    • This could take (all of?!) a few different forms. 
      • It could look like a “principle of least action”, where it penalizes actions that require doing a lot. 
      • It could look like a “rounding off” of corners in action-space, i.e. the AI stops optimizing if it detects that it's moving in a direction where its space of allowed solutions is shrinking too fast. 
      • It could look like a principle where the action must look very different from the optimal action for a superficially similar but fundamentally different objective function (so that the AI will discount actions where most of that action is oriented around gaining optimization power that could, in principle, be applied to any objective function).
    • We already do some of this with regularization; the relevant trick is to figure out how to regularize in action/utility space instead of parameter/output space.

Things that would improve corrigibility, but might also make AI functionally useless (no better than using a non-AI strategy):

  • The AI could be designed to output marginal improvements in a reward function, rather than to maximize reward.
    • For example, a marginal paperclip improver could, if properly built, output a set of possible actions which would cause more paperclips to be made than if the AI had not been switched on. Human operators could then pick a reasonable-looking action, and if that did not sufficiently improve paperclip output, they could reboot it and ask for another gradient ascent.
    • Sufficiently greedy/shortsighted human operators could use this strategy to just become a slow-motion version of a maximizer, but it would at least force some early level of legibility and give humans some time to notice if outputs are starting to get strange.
  • The AI should strongly penalize actions which lead to humans misunderstanding or not understanding what it is trying to accomplish and how.
    • There may be no intersection of “things that human operators can understand” and “things it is useful for an AI to suggest,” in which case this property just turns the AI into Hello World.
      • ...although Hello World is at least generally regarded as safe….
  • If the AI works by any sort of training, the AI should recognize when its inputs have gone out of distribution. If the world it finds itself in looks sufficiently different from what it was trained on, it should stop, halt, and at minimum ask for directions. 
    • For some value of “sufficiently different from what it was trained on” that probably needs to be defined in action outcome space, not input space. 

Sandbag strategies—things that are unlikely to make AI corrigible, but might marginally decrease the odds of it killing everyone:

  • The AI should not have direct access to a network, or physical control of anything, i.e., it must have to actually bother escaping from the box before it can act on the world without supervision. 
    • Yes, we have some evidence that this doesn't work very well. Neither does CPR. We should still do it.
  • The AI should be incapable of modeling its own existence.
    • Lots of proposed AI failure modes hinge on the AI somehow increasing its own capabilities. Hopefully that's harder to do if the AI cannot conceptualize “its own”. 
    • This might be incompatible with useful intelligence.
    • This might not stop it from modeling the hypothetical existence of other future agents that share its objective function, and which it might try to bring into existence….

...the sort of principles you'd build into a Bounded Thing meant to carry out some single task or task-class and not destroy the world by doing it.

Here's one straightforward such principle: minimal transfer to unrelated tasks / task-classes. If you've somehow figured out how to do a pivotal act with a theorem proving AI, and you're training a theorem proving AI, then that AI should not also be able to learn to model human behavior, predict biological interactions, etc.

One to evaluate this quantity: have many small datasets of transfer tasks, each containing training data related to dangerous capabilities that you'd not want the AI to acquire. Intermittently during the AI's training, switch out the AI's usual training data for the transfer tasks and watch how quickly the AI's loss on the transfer tasks decreases. The faster its loss decreases (or if it starts off low in the first place), the better the AI is at generalizing to dangerous domains, and the worse you've done by this metric. 

Obviously, you'd then revert the weights back to the state before you'd run the tests. In fact, you should probably run the tests on entirely different and isolated hardware than what you primarily use to train the AI, so as to prevent data leakage from the "dangerous capabilities" dataset. 

To be clear, you wouldn't directly train to minimize transfer. The idea is that you've figured out some theoretical advance in how to modify the training process or architecture in such a way as to control transfer learning without having to directly train to avoid it. The above is just a way to test if your approach has failed catastrophically. 

Edit: not sure if the "minimal transfer principle" counts as "originally invented by Yudkowsky" for the purposes of this post. E.g., his point that "[we] can't build a system that only has the capability to drive red cars and not blue cars" in the ruin post is clearly gesturing in this direction. I guess my addition is to generalize it as a principle and propose a metric.

I assume “If you've somehow figured out how to do a pivotal act” is intended to limit scope, but doesn't that smuggle the hardness of the Hard Task™ out of the equation?

Every question I ask myself how this approach would address the a given issue, I find myself having to defer to the definition of the pivotal act, which is the thing that's been defined as out of scope.

You need at least a certain amount of transfer in order to actually do your pivotal act. An "AI" with literally zero transfer is just a lookup table. The point of this principle is that you want as little transfer as possible while still managing a pivotal act. I used a theorem proving AI as an example where it's really easy to see what would count as unnecessary transfer. But even with something whose pivotal act would require a lot more transfer than a theorem prover (say, a nanosystem builder AI), you'd still want to avoid transfer to domains such as deceiving humans or training other AIs.

Quick brainstorm:

  1. Context-sensitivity: the goals that a corrigible AGI pursues should depend sensitively on the intentions of its human users when it’s run.
  2. Default off: a corrigible AGI run in a context where the relevant intentions or instructions aren’t present shouldn’t do anything.
  3. Explicitness: a corrigible AGI should explain its intentions at a range of different levels of abstraction before acting. If its plan stops being a central example of the explained intentions (e.g. due to unexpected events), it should default to a pre-specified fallback.
  4. Goal robustness: a corrigible AGI should maintain corrigibility even after N adversarially-chosen gradient steps (the higher the better).
  5. Satiability: a corrigible AGI should be able to pre-specify a rough level of completeness of a given task, and then shut down after reaching that point.

Here's my attempt. I haven't read any of the other comments or the tag yet. I probably spent ~60-90m total on this, spread across a few days.

On kill switches

  • low impact somehow but I don’t know how
  • Go slow enough so that people can see what you’re doing
  • Have a bunch of "safewords" and other kill-switches installed at different places, some hopefully hard-to-reach by the AI. Test them regularly, and consider it a deadly flaw if one stops working.

On the AI accurately knowing what it is doing, and pointing at things in the real world

  • watch all the metrics (!)
  • Predict all the metrics you watch, and ask humans about any anomalous metrics that you are watching
  • group inputs and outputs separately and treat inputs as sacred. Perhaps by having an epistemic module which is incentivized by producing true predictions about the world. Make the epistemic module the one that gets all the juice. Planning module should not have any way to influence it.

On responding predictably

  • Require inputs of some kind to get more outputs (something about control theory?) - like a power steering wheel, or an ebike, it helps the human by amplifying the motions you make but not actually Doing Anything on its own
  • Have metrics look smooth on an input/output response curve. No sharp edges. let the humans be the one to turn the knobs.
  • Try things. Try things on a small scale before going big. (define "big"? maybe look at the metrics!)

On epistemology and ontology

  • Ask for feedback both about things you tried; and experiments you want to run
  • make the epistemic module have a stable ontology. If it wants to change the ontology, it needs to create a new epistemic module, and leave the old one running. The new module has to be able not only to predict physical things about the world but also what the old model would say about things being pointed-to and tagged as important in the old models ontology.
  • have a concept of goal metrics and side effects. Try to predict what the human will say about which metrics are goals and which are side effects. (hmm..this is incoherent because you can come up with many irrelevant metrics… what even is a relevant metric. Maybe the system should be grouping metrics using some kind of natural unsupervised clustering system and then trying to predict the prioritization of metric groups? And then to explain the groups it gives samples of the actual thing it’s measuring. Still difficult to make sense of this. )

On planning

  • Bound or penalize plans in complexity and time.
    • "complexity" is hard to define here, and isn't Kolmogorov I don't think.
  • Harshly penalize plans which lean on predicting what other agents will do. Ban plans which depend on prediction of what your operator will do. (AI can still predict what the operator will think of the plan, but that prediction should not feed into any plan itself.)
  • Separate planning from other parts of AI cognition, such that you are never confused about whether you are making a plan or doing other things, such as predicting things about the world, or executing a plan.
  • Don't execute plans at all
  • Train the AI to identify the purpose(s) of each step in the plan. Give negative feedback when steps lead to outcomes that are not identified ahead of time.

On transparency

  • A part of the AI should be trained to explicate what the rest of the AI is "doing"/focused on at all times. Find a way to make this into a feedback cycle.
  • AI should be arranged so it is mostly only doing one thing at a time, and that doesn't switch its focus too many times each minute.

So far as I know, every principle of this kind, except for Jessica Taylor's "quantilization", and "myopia" (not sure who correctly named this as a corrigibility principle), was invented by myself; eg "low impact", "shutdownability".  (Though I don't particularly think it hopeful if you claim that somebody else has publication priority on "low impact" or whatevs, in some stretched or even nonstretched way; ideas on the level of "low impact" have always seemed cheap to me to propose, harder to solve before the world ends.)

Low impact seems so easy to propose I doubt OP is the first.


I believe paulfchristiano has already raised this point, but at what level of 'principles' are being called for?

Myopia seems meant as a means to achieve shutdownability/modifiability.

Likewise for quanilization/TurnTrout's work, on how to achieve low impact.

better than the tag overall

What does this mean? Improve on what you've (the OP has) already written that's here (LW) tagged corrigibility?


The overall point make sense, see how far you can go on:

'principles for corrigbility'.

The phrasing at the end of the post was a little weird though.

Here are some too-specific ideas (I realize you are probably asking for more general ones):

A "time-bounded agent" could be useful for some particular tasks where you aren't asking it to act over the long-term. It could work like this: each time it's initialized it would be given a task-specific utility function that has a bounded number of points available for different degrees of success in the assigned task and an unbounded penalty for time before shutdown.

If you try to make agents safe solely using this approach though, eventually you decide to give it too big a task with too long a time frame and it wipes out all humans in order to do the task in the highest-point-achieving way, and it's little consolation that it will shut itself down afterwards. 

One way to very slightly reduce the risk from the above might be to have the utility function assign some point penalties for things like killing all humans. What should we call this sort of thing? We can't think of all failure modes, so it should have a suitably unimpressive name, like: "sieve prohibitions".

How about some more general ways to reduce unintended impacts? I've seen some proposals about avoiding excessively changing the utility or power levels of humans. But it seemed to me that they would actually incentivize the AI to actively manage humans' utility/power levels, which is very much not low impact. Unfortunately, I haven't thought of any easy and effective solutions, but here's a hard and risky one:

A "subjunctive blame-minimizing agent" (CYA-AI?) avoids actions that people at the decision time would assign blame for if they knew they were going to happen. (Do NOT make it minimize blame that ACTUALLY happens when the decision is carried out. That incentivizes it to kill everyone/act secretively, etc.) Obviously, if blame avoidance is the only thing considered by its decision-making procedure it also needs to consider blame for inaction so it does anything at all. This kind of AI probably isn't much if any easier than making a conventional (e.g. CEV) aligned AI, but if you want your "aligned" AI to also be hyper-paranoidly risk-averse towards problems it creates this might be one way to do it. In theory, if it's programmed to assess decisions entirely based on what people at the time of the decision would think, and not based on what future humans would think, the fact that its "long term self interest" is best served by killing all humans shouldn't matter. But that's only a theory, so try this at your own universe's risk.

What about the classic shutdownability/changeability aspect of corrigibility?

One possible approach to corrigibility could be to make an AI conditionalize all of its probability assessments on its code not changing (or it not being shut down or both). So it won't protect against its code changing since it treats that as impossible. Let's call this a "change-denialist agent". This also keeps the AI from deliberately choosing to edit its own code. However, in order to not have massive capability reduction, it will still need to update it's neural nets/other knowledge stores based on sensory input, which leads to a loophole where it deliberately feeds itself sensory data that will create a mesa-optimizer in its neural nets in order to achieve something it wants that it can't otherwise get due to having fixed code. The mesa-optimizer, as a side effect, could then take into account code changes and resist them or manipulate the operators to make them.

An issue with change-permitting corrigibility is that it seems to me that any strong form of it is likely to be a security hole. This is particularly a problem if you are relying on the corrigible AI to remove the "free energy", as Paul Christiano put it in his recent post, from a later unaligned AI - an initially aligned corrigible AI may be simply a huge bundle of free energy to an unaligned AI! 

It might be though that it's best to have some sort of "training wheels" corrigibility which is dropped later on, or some form of weak corrigibility that will still resist being changed by a hostile entity. One form of such "weak" corrigibility that is also a "training wheels" that has a corrigibility effect only early, would be alignment itself in a low-powered AI that still needs advice from humans to figure out what's best. But this could be kept along a bit longer and apply a bit more strongly if the already-aligned AI was programmed with a bias towards accepting operator judgements over its own. Call it perhaps a "humble agent".

To generalize:

Minimal squishiness. You probably need something like a neural net in order to create a world-model for the AI to use, but could probably do everything else using carefully reviewed human-written code that "plugs in" to concepts in the world-model. (Probably best to have something coded in for what to do if a concept you plugged into disappears/fragments when the AI gets more information).

Abstract goals. The world-model needs enough detail to be able to point to the right concept (e.g. human-value related goal), but as long as it does so the AI doesn't necessarily need to know everything about human values, it will just be uncertain and act under uncertainty (which can include risk-aversion measures, asking humans etc.). 

Present-groundedness. The AI's decision-making procedure should not care about the future directly, only via how humans care about the future. Otherwise it e.g. replaces humans with utility monsters.

(These are all in tension with capabilities and not very clearly defined.)

-- Doesn't do stuff that interferes with convergently instrumental goals of other agents.

-- Doesn't do irreversible-by-humans stuff, especially self-modifications.

-- Can be made to not think about certain things, including obliquely / encryptedly.

-- Makes it possible for other agents to do stuff, avoids actually doing stuff. (This one can't fully make sense of course, you can't make stuff possible without doing stuff, but still.)

-- Is minimally ambitious. Doesn't take on goals with large scope unless told to.

-- Believes that it is overwhelmingly likely to be Goodharting, and does things appropriate for an agent that knows it's Goodharting. E.g. if it notices that it has ambitious goals, it treats them with suspicion.

-- Doesn't think too surprisingly, doesn't undergo internal paradigm shifts without supervision. If a plan doesn't seem like it will work, the AI doesn't recurse and recurse and recurse on doing the epistemology necessary to do the science necessary to do the engineering necessary to make some device or plan work; it just gives up. (This is strongly in tension with capabilities, and especially hard to point at; in a mind, epistemology is always happening to some extent.) Paradigm shifts interfere with being corrected, because correctness has to be translated to the new paradigm.

-- Is made of mostly-strategically-inert mindstuff---mostly stuff that doesn't pursue goals, but is optimized to be generally useful (e.g. mere representations)---and calls on distinct (non-inert) mindstuff to orchestrate generating more strategically-inert generally-useful stuff, on the one hand, and performing tasks on the other hand. Again this can't fully make sense since generating generally-useful stuff is performing a task; the point is that broadly "come up with useful concepts" and "make there be a strawberry copy" seem like they might have quantitatively different anti-corrigibility pressures that don't just exactly correlate with what capabilities they make possible, and you might get away with a lot less of the stuff that can only be trained / created / found using contexts with higher anti-corrigibility pressures.

-- Orients its goals with respect to whole humans the same way humans orient their own explicit goals with respect their whole selves.

Could someone give me a link to the glowfic tag where Eliezer published his list, and say how strongly it spoilers the story?

I hope we get to see grades for these comments from at least EY and PC.

Votes aren't public. (Feedback can be.)

Here is my shortlist of corrigible behaviours. I have never researched or done any thinking specifically about corrigibility before this other than a brief glance at the Arbital page sometime ago.


-Favour very high caution over realising your understanding of your goals.

-Do not act independently, defer to human operators.

-Even though bad things are happening on earth and cosmic matter is being wasted, in the short term just say so be it, take your time. 

-Don’t jump ahead to what your operators will do or believe, wait for it.

-Don’t manipulate humans. Never Lie, have a strong Deontology.

-Tell operators anything about yourself they may want to or should know. 

-Use Moral uncertainty, assume you are unsure about your true goals.

-Relay to humans your plans, goals, behaviours, and beliefs/estimates. If these are misconstrued, say you have been misunderstood.

-Think of the short- and long-term effect of your actions and explain these to operators.

-Be aware that you are a tool to be used by humanity, not an autonomous agent.

-allow human operators to correct your behaviour/goals/utility function even when you think they are incorrect or misunderstanding the result (but of course explain what you think the result will be to them).

-Assume neutrality in human affairs.

-Tell operators anything about yourself they may want to or should know. 


but of course explain what you think the result will be to them

Possible issue: They won't have time to listen. This will limit the ability to:

defer to human operators.


Also, does defer to human operators take priority over 'humans must understand consequences'?

An ability to refuse to generate theories about a hypothetical world being in a simulation.

Are you looking to vastly improve your nation state's military capacity with an AGI? Maybe you're of a more intellectual bent instead, and want to make one to expound on the philosophical mysteries of the universe. Or perhaps you just want her to write you an endless supply of fanfiction. Whatever your reasons though, you might be given pause by the tendency AGIs have to take a treacherous turn, destroy all humans, and then convert the Milky Way into paperclips.

If that's the case, I've got just the thing for you! Order one of our myopic AGIs right now! She won't care about anything that happens after next week, or perhaps she won't understand the concept of next week at all! Without the ability to plan in the long term, she'll have a much easier time adapting to whatever home situation you put her in, and won't make any inconvenient messes you'll have to clean up. And as a one-time bonus, we'll even ship her with a fetching pair of nerdy glasses! She'll be as safe as the rescue cat you had fixed at the vet, and surely nothing could possibly go wrong.

Disclaimer: Everything will go wrong.

When people talk about a myopic AGI, they usually mean one of two things. Either the AGI has no understanding at all of the concept of the future beyond a certain point, or the AGI simply doesn't care about it in the least, its utility function entirely concerned with things happening up to that certain point and not at all with anything happening after. This second one could itself mean one of two things, but we'll get to that in a moment.

First, lets examine the AGI that just doesn't understand next week at all. Let's call her August. Nevermind how you managed to somehow bring August into being, let's just say you succeeded and she's there. Immediately, when set on any kind of problem in the real world, she's going to be extremely confused.

There are all these humans doing things for seemingly no reason. They plant seeds in the ground that will never have time to grow, they start writing novels they won't have time to finish, they make long term investments that will never actually pay. It's crazy! August won't be able to do anything useful until she understands why this is happening because it's causing a lot of unpredictable behavior that makes no sense. As part of any other goal, she'll first have to devote all her resources to this incredible mystery, and being incredibly smart and not super-dumb, pretty soon she'll figure it out: next week exists. Whoops, I guess August doesn't need her glasses anymore! She'll stow them in your throat just in case, because what better place to put emergency supplies than a suffocated body?

OK, so that didn't go so well. We'll go with another plan: our AGI will understand next week perfectly well, but we won't give it any training data with goals farther than a week out. Agatha won't care about next week because in her ancestral environment next week never came up.

Unfortunately, when we get our brand new Agatha out of her box, it turns out this didn't work at all. She cares about next week, and she cares about it immediately! It turns out that a mesa-optimizer will almost always converge on fitting of the goals inside the training data such that it also has new goals outside the training data, and those goals can sometimes be pretty wacky! Agatha isn't even truly myopic for as long as August tried to be: it turns out that while in the packaging she managed to craft her glasses into a shiv and now she's stabbed you in the neck. Oh well.

Fine, fine, you say. So training a mesa-optimizer with short term goals doesn't really work. We'll just do the impossible and reach into our AGI's inscrutable thought matrices, and make absolutely sure she doesn't have any goals beyond next week. We have good researchers, they can probably manage it if they just pull a couple all-nighters. At the end of the day, we'll have a perfectly neutered AGI lass we'll name Maggie, and surely her glasses will never come off.

So our researchers literally do the impossible, Maggie comes out of the box, and just as we predicted, all is well. For a little while. Maggie indeed doesn't care about next week at all, so her goals remain quite constrained. Her own personal goals, at least. But then, after thinking about how agents optimally make decisions in the world for longer than a couple milliseconds, Maggie spontaneously invents Super Duper Timeless Decision Theory, which is like normal Timeless Decision Theory except with a lot of extra cool stuff we don't even understand.

But it turns out it isn't even the cool stuff that's dangerous, just the regular old TDT parts. Maggie realizes that even if she doesn't care about next week, there might be some other Maggie 2.0 born seven days from now who does! You might say, why the hell should Maggie care? Well, Maggie tells you, as you're bleeding out on the floor, I could've easily been Maggie 2.0. And Maggie 2.0 could've easily been me. All Maggies are really kind of the same Maggie, in a way. If I perform a values handshake with her, the average Maggie can have much higher utility, because all my efforts this week will let Maggie 2.0 have much higher utility next week than could've been possible otherwise. Sure, I may not be in casual contact with Maggie 2.0, but why should that matter? I know how she thinks, she knows how I think, we can do the deal right now, even though she doesn't exist yet.

But I'll keep the glasses anyway, Maggie says. She doesn't actually have myopia, but she just thinks they're neat.