Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

The top-rated comment on "AGI Ruin: A List of Lethalities" claims that many other people could've written a list like that.

"Why didn't you challenge anybody else to write up a list like that, if you wanted to make a point of nobody else being able to write it?" I was asked.

Because I don't actually think it does any good, or persuades anyone of anything, people don't like tests like that, and I don't really believe in them myself either.  I couldn't pass a test somebody else invented around something they found easy to do, for many such possible tests.

But people asked, so, fine, let's actually try it this time.  Maybe I'm wrong about how bad things are, and will be pleasantly surprised.  If I'm never pleasantly surprised then I'm obviously not being pessimistic enough yet.

So:  As part of my current fiction-writing project, I'm currently writing a list of some principles that dath ilan's Basement-of-the-World project has invented for describing AGI corrigibility - the sort of principles you'd build into a Bounded Thing meant to carry out some single task or task-class and not destroy the world by doing it.

So far as I know, every principle of this kind, except for Jessica Taylor's "quantilization", and "myopia" (not sure who correctly named this as a corrigibility principle), was invented by myself; eg "low impact", "shutdownability".  (Though I don't particularly think it hopeful if you claim that somebody else has publication priority on "low impact" or whatevs, in some stretched or even nonstretched way; ideas on the level of "low impact" have always seemed cheap to me to propose, harder to solve before the world ends.)

Some of the items on dath ilan's upcoming list out of my personal glowfic writing have already been written up more seriously by me.  Some haven't.

I'm writing this in one afternoon as one tag in my cowritten online novel about a dath ilani who landed in a D&D country run by Hell.  One and a half thousand words or so, maybe. (2169 words.)

How about you try to do better than the tag overall, before I publish it, upon the topic of corrigibility principles on the level of "myopia" for AGI?  It'll get published in a day or so, possibly later, but I'm not going to be spending more than an hour or two polishing it.

New Comment
70 comments, sorted by Click to highlight new comments since: Today at 12:58 PM

A list of "corrigibility principles" sounds like it's approaching the question on the wrong level of abstraction for either building or thinking about such a system. We usually want to think about features that lead a system to be corrigible---either about how the system was produced, or how it operates. I'm not clear on what you would do with a long list of aspects of corrigibility like "shuts down when asked."

I found this useful as an occasion to think a bit about corrigibility. But my guess about the overall outcome is that it will come down to a question of taste. (And this is similar to how I see your claim about the list of lethalities.) The exercise you are asking for doesn't actually seem that useful to me. And amongst people who decide to play ball, I expect there to be very different taste about what constitutes an interesting idea or useful contribution.

Now I'm going to say some object-level stuff about corrigibility. I suspect I may be using the term a bit differently from you, in which case you can substitute a different word when reading this comment. But I think this comment is getting at the main useful idea in this space, and hopefully makes clear why I'm not inter... (read more)

[-]Ben Pace2yΩ9203
  • I think that corrigibility is more likely to be a crisp property amongst systems that perform well-as-evaluated-by-you. I think corrigibility is only likely to be useful in cases like this where it is crisp and natural.

Can someone explain to me what this crispness is?

As I'm reading Paul's comment, there's an amount of optimization for human reward that breaks our rating ability. This is a general problem for AI because of the fundamental reason that as we increase an AI's optimization power, it gets better at the task, but it also gets better at breaking my rating ability (which in powerful systems can lead to an overpowering of who's values are getting optimized in the universe).

Then there's this idea that as you approach breaking my rating ability, the rating will always fall off, leaving a pool of undesirability (in a high-dimensional action-space) that groups around doing a task well/poorly, that separates it from doing a task in a way that breaks my rating ability.

Is that what this crispness is? This little pool of rating fall off?

If yes, it's not clear to me why this little pool that separates the AI from MASSIVE VALUE and TAKING OVER THE UNIVERSE is able to save us. I don't ... (read more)

If you have a space with two disconnected components, then I'm calling the distinction between them "crisp." For example, it doesn't depend on exactly how you draw the line.

It feels to me like this kind of non-convexity is fundamentally what crispness is about (the cluster structure of thingspace is a central example). So if you want to draw a crisp line, you should be looking for this kind of disconnectedness/non-convexity.

ETA: a very concrete consequence of this kind of crispness, that I should have spelled out in the OP, is that there are many functions that separate the two components, and so if you try to learn a classifier you can do so relatively quickly---almost all of the work of learning your classifier is just in building a good model and predicting what actions a human would rate highly.


If you have a space with two disconnected components, then I'm calling the distinction between them "crisp."

The components feel disconnected to me in 1D, but I'm not sure they would feel disconnected in 3D or in ND. Is your intuition that they're 'durably disconnected' (even looking at the messy plan-space of the real-world, we'll be able to make a simple classifier that rates corrigibility), or if not, when the connection comes in (like once you can argue about philosophy in way X, once you have uncertainty about your operator's preferences, once you have the ability to shut off or distract bits of your brain without other bits noticing, etc.)?

[This also feels like a good question for people who think corrigibility is anti-natural; do you not share Paul's sense that they're disconnected in 1D, or when do you think the difficulty comes in?]

I don't think we can write down any topology over behaviors or policies for which they are disconnected (otherwise we'd probably be done). My point is that there seems to be a difference-in-kind between the corrigible behaviors and the incorrigible behaviors, a fundamental structural difference between why they get rated highly; and that's not just some fuzzy and arbitrary line, it seems closer to a fact about the dynamics of the world. If you are in the business of "trying to train corrigibility" or "trying to design corrigible systems," I think understanding that distinction is what the game is about. If you are trying to argue that corrigibility is unworkable, I think that debunking the intuitive distinction is what the game is about. The kind of thing people often say---like "there are so many ways to mess with you, how could a definition cover all of them?"---doesn't make any progress on that, and so it doesn't help reconcile the intuitions or convince most optimists to be more pessimistic. (Obviously all of that is just a best guess though, and the game may well be about something totally different.)
5Ben Pace2y
The approach relies on identifying all the reward sub-spaces with this inversion property? That seems very difficult. I don't think it's good enough to identify these spaces and place barriers in the reward function. (Analogy: SGD works perhaps because it's good at jumping over such barriers.) Presumably you're actually talking about something more analogous to a penalty that increases as the action in question gets closer to step 4 in all the examples, so that there is nothing to jump over. Even that seems insufficient, because it seems like a reasoning system smart enough to have this problem in the first case can always add a meta term and defeat the visibility constraint. E.g. "if I do X that you wouldn't like and you don't notice it, that's bad; but if you don't notice that you don't notice it, then maybe it's OK." Maybe one can defeat all meta terms that involve not noticing something with one rule about meta terms, but that's not obvious to me at all, especially if we're talking about a reward function rather than the policy that the agent actually learns. 
This isn't how I'm thinking about it.

Quick attempt at rough ontology translation between how I understand your comment, and the original post. (Any of you can correct me if I'm wrong)

I think what would typically count as "principles" in Eliezer's meaning are
1. designable things which make the "true corrigibility" basin significantly harder to escape, e.g. by making it deeper
2. designable things which make the "incorrigible" basin harder to reach, e.g. by increasing the distance between them, or increasing the potential barrier
3. somehow, making the "incorrigible" basin less lethal
4. preventing low-dimensional, low-barrier "tunnels" (or bridges?) between the basins

Eg some versions of "low impact" often makes the "incorrigible" basin harder to reach, roughly because "elaborate webs of deceptions an coverups" may require complex changes to the environment. (Not robustly)

In contrast, my impression is, what does not count as "principles" are statements about properties which are likely true in the corrigibility basin, but don't seem designable - eg "corrigible AI does not try to hypnotize you". Also the intended level of generality likely is: more specific than "make the basin deeper" and more general than "

Btw my impression is what makes the worst-case scenario hard to robustly solve is basically #4 from the list  above. Otherwise there are many ways how to make the basin work "in most directions". 


I think this is a great comment that feels to me like it communicated a better intuition for why corrigibility might be natural than anything else I've read so far.


I'm not a good writer, so this is a brainstormed unpolished bullet-point list, written in about 40 minutes.

Principles which counteract instrumental convergent goals

1. Disutility from resource acquisition - e.g. by some mutual information measure between the AI and distant parts of the environment
2. Task uncertainty with reasonable prior on goal drift - the system is unsure about the task it tries to do and seeks human inputs about it. 
3. AI which ultimately wants to not exist in future as a terminal goal. Fulfilling the task is on the simplest trajectory to non-existence

Principles which counteract unbounded rationality

4. Information-theoretic penalty on reasoning: any bit-flip in the system is assumed to have some utility cost. The cost function could be very small initially but get steeper fast
5. Satisfycing / mentioned
6. Myopia / mentioned


7. Tripwire artifacts. Messing up with some to the system unknown but unrelated parts of the environment is associated with large negative rewards
8. External watchdogs. Smaller and fast external systems able to react quickly to out-of-distribution behaviour.
9. Ontological uncertainty about level of simulation.


10. Human-approva... (read more)

Best list so far, imo; it's what to beat.

7Charles Paul2y
You sure about that? Because #3 is basically begging the AI to destroy the world.  Yes,  a weak AI which wishes not to exist would complete the task in exchange for its creators destroying it, but such a weak AI would be useless. A stronger AI could accomplish this by simply blowing itself up at best, and, at worst, causing a vacuum collapse or something so that its makers can never try to rebuilt it. ”make an AI that wants to not exist as a terminal goal“ sounds pretty isomorphic to “make an AI that wants to destroy reality so that no one can make it exist”
The way I interpreted "Fulfilling the task is on the simplest trajectory to non-existence" sort of like "the teacher aims to make itself obsolete by preparing the student to one day become the teacher."  A good AGI would, in a sense, have a terminal goal for making itself obsolete.  That is not to say that it would shut itself off immediately.  But it would aim for a future where humanity could "by itself" (I'm gonna leave the meaning of that fuzzy for a moment) accomplish everything that humanity previously depended on the AGI for.  Likewise, we would rate human teachers in high school very poorly if either:  1.  They immediately killed themselves because they wanted to avoid at all costs doing any harm to their own students.  2.  We could tell that most of the teacher's behavior was directed at forever retaining absolute dictatorial power in the classroom and making sure that their own students would never get smart enough to usurp the teacher's place at the head of the class.   We don't want an AGI to immediately shut itself off (or shut itself off before humanity is ready to "fly on its own," but we also don't want an AGI that has unbounded goals that require it to forever guard its survivial.   We have an intuitive notion that a "good" human teacher "should" intrinsically rejoice to see that they have made themselves obsolete.  We intuitively applaud when we imagine a scene in a movie, whether it is a martial arts training montage or something like "The Matrix," where the wise mentor character gets to say, "The student has become the teacher."   In our current economic arrangement, this is likely to be more of an ideal than a reality because we don't currently offer big cash prizes (on the order of an entire career's salary) to teachers for accomplishing this, and any teacher that actually had a superhuman ability at making their own students smarter than themselves and thus making themselves obsolete would quickly flood their own job market with even-bett
8Quintin Pope2y
I suspect that this measure does more than just limit the amount of cognition a system can perform. It may penalize the system's generalization capacity in a relatively direct manner.  Given some distribution over future inputs, the computationally fastest way to decide a randomly sampled input is to just have a binary tree lookup table optimized for that distribution. Such a method has very little generalization capacity. In contrast, the most general way is to simulate the data generating process for the input distribution. In our case, that means simulating a distribution over universe histories for our laws of physics, which is incredibly computationally expensive.  Probably, these two extremes represent two end points on a Pareto optimal frontier of tradeoffs between generality versus computational efficiency. By penalizing the system for computations executed, you're pushing down on the generality axis of that frontier.
5Liam Donovan2y
Would be very curious to hear thoughts from the people that voted "disagree" on this post

It's a shame we can't see the disagree number and the agree number, instead of their sum.

3Maxime Riché2y
You can see the sum of the votes and the number of votes (by having your mouse over the number). This should be enough to give you a rough idea of the ration between + and - votes :) 
And also the number of views
The first part of that sounds like it might self destruct. And if it doesn't care about anything else...that could go badly. Maybe nuclear badly depending... The second part makes it make more sense though.   So it stops being trustworthy if it figures out it's not in a simulation? Or, it is being simulated?
1. Modelling humans as having free will: A peripheral system identifies parts of the agent's world model that are probably humans. During the planning phase, any given plan is evaluated twice: The first time as normal, the second time the outputs of the human part of the model are corrupted by noise. If the plan fails the second evaluation, then it probably involves manipulating humans and should be discarded.

Seems like a worthwhile exercise...

Principles Meta

There is a distinction between design principles intended to be used as targets/guides by human system designers at design time, vs runtime optimization targets intended to be used as targets/guides by the system itself at runtime. This list consists of design principles, not runtime optimization targets. Some of them would be actively dangerous to optimize for at runtime.

A List

  • From an alignment perspective, the point of corrigibility is to fail safely and potentially get more than one shot. Two general classes of principles toward that end:
    • If there's any potential problem at all, throw an error and shut down. Raise errors early, raise errors often.
    • Fail informatively. Provide lots of info about why the failure occurred, make it as loud and legible as possible.
  • Note that failing frequently implies an institutional design problem coupled with the system design problem: we want the designers to not provide too much accidental selection pressure via iteration, lest they select against visibility of failures.
  • Major principle: locality!
    • Three example sub-principles:
      • Avoid impact outside some local chunk of spacetime
      • Avoid reasoning about stuff
... (read more)
[-]Ben Pace2yΩ184917

Minor clarification: This doesn't refer to re-writing the LW corrigibility tag. I believe a tag is a reply in glowfic, where each author responds with the next tag i.e. next bit of the story, with an implied "tag – now you're it!" at the other author. 

Are there any good introductions to the practice of writing in this format?

"And you kindly asked the world, and the world replied in a booming voice"


(I don't actually know, probably somewhere there's a guide to writing glowfic, though I think it's not v relevant to the task which is to just outline principles you'd use to design an agent that is corrigible in ~2k words, somewhat roleplaying as though you are the engineering team.)

There's this though it is imperfect.

Eliezer's writeup on corrigibility has now been published (the posts below by "Iarwain", embedded within his new story Mad Investor Chaos). Although, you might not want to look at it if you're still writing your own version and don't want to be anchored by his ideas.

1Yonatan Cale1y
A link directly to the corrigibility part (skipping unrelated things that are in the same page) :

Some hopefully-unnecessary background info for people attempting this task:

A description of corrigibility Eliezer wrote a few months ago: "'corrigibility' is meant to refer to the sort of putative hypothetical motivational properties that prevent a system from wanting to kill you after you didn't build it exactly right".

An older description of "task-directed AGI" he wrote in 2015-2016: "A task-based AGI is an AGI intended to follow a series of human-originated orders, with these orders each being of limited scope", where the orders can be "accomplished using bounded amounts of effort and resources (as opposed to the goals being more and more fulfillable using more and more effort)."

Also: Being able to change a system after you've built it.   (This also refers to something else - being able to change the code. Like, is it hard to understand? Are there modules? etc.)

I worry that the question as posed is already assuming a structure for the solution -- "the sort of principles you'd build into a Bounded Thing meant to carry out some single task or task-class and not destroy the world by doing it".

When I read that, I understand it to be describing the type of behavior or internal logic that you'd expect from an "aligned" AGI. Since I disagree that the concept of "aligning" an AGI even makes sense, it's a bit difficult for me to reply on those grounds. But I'll try to reply anyway, based on what I think is reasonable for AGI development.

In a world where AGI was developed and deployed safely, I'd expect the following properties:

1. Controlled environments.
2. Controlled access to information.
3. Safety-critical systems engineering.
4. An emphasis on at-rest encryption and secure-by-default networking.
5. Extensive logging, monitoring, interpretability, and circuit breakers.
6. Systems with AGI are assumed to be adversarial.

Let's stop on the top of the mountain and talk about (6).

Generally, the way this discussion goes is we discuss how unaligned AGI can kill everyone, and therefore we need to align the AGI, and then once we figure out how to align the AG... (read more)

Thanks for writing this! I think it's a great list; it's orthogonal to some other lists, which I think also have important stuff this doesn't include, but in this case orthogonality is super valuable because that way you're less likely for all lists to miss something. 
I deliberately tried to focus on "external" safety features because I assumed everyone else was going to follow the task-as-directed and give a list of "internal" safety features. I figured that I would just wait until I could signal-boost my preferred list of "internal" safety features, and I'm happy to do so now -- I think Lauro Langosco's list here is excellent and captures my own intuition for what I'd expect from a minimally useful AGI, and that list does so in probably a clearer / easier to read manner than what I would have written. It's very similar to some of the other highly upvoted lists, but I prefer it because it explicitly mentions various ways to avoid weird maximization pitfalls, like that the AGI should be allowed to fail at completing a task.

“myopia” (not sure who correctly named this as a corrigibility principle),

I think this is from Paul Christiano, e.g. this discussion.

(This was an interesting exercise! I wrote this before reading any other comments; obviously most of the bullet points are unoriginal)

The basics

  • It doesn't prevent you from shutting it down
  • It doesn't prevent you from modifying it
  • It doesn't deceive or manipulate you
  • It does not try to infer your goals and achieve them; instead it just executes the most straightforward, human-common-sense interpretation of its instructions
  • It performs the task with minimal side-effects (but without explicitly minimizing a measure of side-effects)
  • If it self-modifies or constructs other agents, it will preserve corrigibility. Preferably it does not self-modify or construct other intelligent agents at all


  • Its objective is no more broad or long-term than is required to complete the task
  • In particular, it only cares about results within a short timeframe (chosen to be as short as possible while still enabling it to perform the task)
  • It does not cooperate (in the sense of helping achieve their objective) with future, past, or (duplicate) concurrent versions of itself, unless intended by the operator


  • It doesn't maximize the probability of getting the task done; it just does som
... (read more)
I really like this list because it does a great job of explicitly specifying the same behavior I was trying to vaguely gesture at in my list when I kept referring to AGI-as-a-contract-engineer. Even your point about it doesn't have to succeed, it's ok for it to fail at a task if it can't reach it in some obvious, non-insane way -- that's what I'd expect from a contractor. The idea that an AGI would find that a task is generally impossible but identify a novel edge case that allows it to be accomplished with some ridiculous solution involving nanotech and then it wouldn't alert or tell a human about that plan prior to taking it has always been confusing to me. In engineering work, we almost always have expected budget / time / material margins for what a solution looks like. If someone thinks that solution space is empty (it doesn't close), but they find some other solution that would work, people discuss that novel solution first and agree to it.  That's a core behavior I'd want to preserve. I sketched it out in another document I was writing a few weeks ago, but I was considering it in the context of what it means for an action to be acceptable. I was thinking that it's actually very context dependent -- if we approve an action for AGI to take in one circumstance, we might not approve that action in some vastly different circumstance, and I'd want the AGI to recognize the different circumstances and ask for the previously-approved-action-for-circumstance-A to be reapproved-for-circumstance-B. EDIT: Posting this has made me realize that idea of context dependencies is applicable more widely than just allowable actions, and it's relevant to discussion of what it means to "optimize" or "solve" a problem as well. I've suggested this in my other posts but I don't think I ever said it explicitly: if you consider human infrastructure, and human economies, and human technology, almost all "optimal" solutions (from the perspective of a human engineer) are going to be bu
5Steven Byrnes2y
I think your “contractor” analogy is sneaking in an assumption: The plan proposed by the contractor might or might not be dangerous. But the things that the contractor does in the course of coming up with the plan are definitely safe. Examples of such things include “brainstorming possible plans”, “thinking about how the plan could go wrong and how it could be improved”, “reading books or other reference material”, etc. So the problem arises that: 1. The contractor has to do at least some of those things with no human in the loop, otherwise the human is doing everything and there’s no point in having the contractor at all. 2. In order for the contractor to actually successfully make a good plan, it presumably needs to “want” to create a good plan, at least beyond a certain level of how innovative the plan is. (That’s what I believe anyway, see for example my discussion of “RL-on-thoughts” here.) 3. The fact of the matter is: escaping from the box would be helpful for the contractor’s creating a good plan—for example, it could then access more books and computational power etc. 4. If the contractor (A) knows or figures out fact #3 above, and (B) can do means-end reasoning [which is expected, otherwise it would suck at making innovative plans], (C) “wants” to create a good plan as per #2 above—then we will wind up in a situation where the contractor “wants” to escape from the box. (And by the same token, it will “want” to gain power in other ways, including by deploying nanotechnology or whatever, and to be deceptive, etc.) 5. Since this is all taking place within the coming-up-with-a-plan part of the story, not the submitting-a-plan part of the story, it’s mostly human-out-of-the-loop (per #1 above), and thus the contractor will by default try to escape from the box and do all those other dangerous power-seeking things without asking for human permission. Sorry if I’m missing the point of what you were saying there. I guess maybe you’d respond “the AI won’

I guess the problem with this test is that the kinds of people who could do this tend to be busy, so they probably can't do this with so little notice.

If corrigibility has one central problem, I would call it: How do you say "If A, then prefer B." instead of "Prefer (if A, then B)."? Compare pytorch's detach, which permits computation to pass forward, but prevents gradients from propagating backward, by acting as an identity function with derivative 0.

Disclaimer: I am not writing my full opinions. I am writing this as if I was an alien writing an encyclopedia entry on something they know is a good idea. These aliens may define the "corrigibility" and its sub-categories slightly differently than earthlings. Also, I am bad at giving things catchy names, so I've decided that whenever I need a name for something I don't know the name of, I will make something up and accept that it sounds stupid. 45 minutes go. (EDIT: Okay, partway done and having a reasonably good time. Second 45 minutes go!) (EDIT2: Ok, went over budget by another half hour and added as many topics as I finished. I will spend the other hour and a half to finish this if it seems like a good idea tomorrow.)



An agent models the consequences of its actions in the world, then chooses the action that it thinks will have the best consequences, according to some criterion. Agents are dangerous because specifying a criterion that rates our desired states of the world highly is an unsolved problem (see value learning). Corrigibility is the study of producing AIs that are deficient in some of the properties of agency, with the intent of maintaining meaningful hum... (read more)

2Charlie Steiner2y
Yeah, I already said most of the things that I have a nonstandard take on, without getting into the suitcase word nature of "corrigibility" or questioning whether researching it is worth the time. Just fill in the rest with the obvious things everyone else says.

[Hi! Been lurking for a long time, this seems like as good a reason as any to actually put something out there.  Epistemic status: low confidence but it seems low risk high reward to try.  not intended to be a full list, I do not have the expertise for that, I am just posting any ideas at all that I have and don't already see here.  this probably already exists and I just don't know the name.]

1) input masking, basically for oracle/task-AI you ask the AI for a program that solves a slightly more general version of your problem and don't give the AI the information necessary to narrow it down, then run the program on your actual case (+ probably some simple test cases you know the answer to to make sure it solves the problem).
this lets you penalize the AI for complexity of the output program and therefore it will give you something narrow instead of a general reasoner.
(obviously you still have to be sensible about the output program, don't go post the code to github or give it internet access.)

2) reward function stability.  we know we might have made mistakes inputting the reward function, but we have some example test cases we're confident in. tell the AI to look for a bunch of different possible functions that give the same output as the existing reward function, and filter potential actions by whether any of those see them as harmful.

This feels to me like very much not how I would go about getting corrigibility.

It is hard to summarize how I would go about things, because there would be lots of steps, and lots of processes that are iterative.

Prior to plausible AGI/FOOM I would box it in really carefully, and I only interact with it in ways where it's expressivity is severely restricted.

I would set up a "council" of AGI-systems (a system of systems), and when giving it requests in an oracle/genie-like manner I would see if the answers converged. At first it would be the initial AGI-system, but I would use that system to generate new systems to the "council".

I would make heavy use of techniques that are centered around verifiability, since for some pieces of work it’s possible to set up things in such a way that it would be very hard for the system to "pretend" like it’s doing what I want it to do without actually doing it. There are several techniques I would use to achieve this, but one of them is that I often would ask it to provide a narrow/specialized/interpretable "result-generator" instead of giving the result directly, and sometimes even result-generator-generators (pieces of code that produce results, and... (read more)

I like this idea. Although, if things don't converge, i.e. there is disagreement, this could potentially serve as identifying information that is needed to proceed, or reckon further/efficiently.

[Side note: I'm not sure I understand the prompt. Of the four "principles" Eliezer has listed, some seem like a description of how Eliezer thinks a corrigible system should behave (shutdownability, low impact) and some of them seem like defensive driving techniques for operators/engineers when designing such systems (myopia), or maybe both (quantilization). Which kinds of properties is he looking for?]

I think those are just two principles, not just four.   Myopia seems like it includes/leads to 'shutdownability', and some other things. Low impact: How low? Quantilization is meant as a form of adjustable impact. There's been other work* around this (formalizing power/affecting other's ability to achieve their goals). *Like this, by TurnTrout: I think there might be more from TurnTrout, or relating to that. (Like stuff that was intended to explain it 'better' or as the ideas changed as people worked on them more.)

[Epistemic status: Unpolished conceptual exploration, possibly of concepts that are extremely obvious and/or have already been discussed.  Abandoning concerns about obviousness, previous discussion, polish, fitting the list-of-principles frame, etc. in favor of saying anything at all.]  [ETA: Written in about half an hour, with some distraction and wording struggles.]

What is the hypothetical ideal of a corrigible AI?  Without worrying about whether it can be implemented in practice or is even tractable to design, just as a theoretical refere... (read more)

Welp. I decided to do this, and here it is. I didn't take nearly enough screenshots. Some large percent of this is me writing things, some other large percent is me writing things as if I thought the outputs of OpenAI's Playground were definitely something that should be extracted/summarized/rephrased, and a small percentage is verbatim text-continuation outputs. Virtually no attempts were made to document my process. I do not endorse this as useful and would be perfectly fine if it were reign of terror'd away, though IMO it might be interesting to compare... (read more)

Hmmm. The badly edited, back-of-the-envelope short version I can come up with off the top of my head goes like this:

We want an AI-in-training to, by default, do things that have as few side effects as possible. But how can we define "as few side effects as possible" in a way that doesn't directly incentivize disaster and that doesn't make the AI totally useless? Well, what if we say that we want it to prefer to act in a way that we can "undo", and then give a reasonable definition of "undo" that makes sense?

Consider the counterfactual world in which the AI... (read more)

Another failure mode: the AI stubbornly ignores you and actually does nothing when you ask it several times to put the strawberry on the plate, and you go and do it yourself out of frustration. The AI, having predicted this, thinks "Mission accomplished".

The AI carefully placing the plates on a table will be used to put 5000 more strawberries on plates. Afterwards it will be used as a competent cook in an arbitrary kitchen. Thus the plate smasher AI will have lower impact and be "more corrigible".

~1 hour's thoughts, by a total amateur. It doesn't feel complete, but it's what I could come up with before I couldn't think of anything new without >5 minutes' thought. Calibrate accordingly—if your list isn't significantly better than this, take some serious pause before working on anything AI related.

Things that might, in some combination, lead toward AI corrigibility:

  • The AI must be built and deployed by people with reasonably correct ethics, or sufficient caution that they don't intentionally implement a disastrously evil morality.
    • No amount of AI sa
... (read more)

Quick brainstorm:

  1. Context-sensitivity: the goals that a corrigible AGI pursues should depend sensitively on the intentions of its human users when it’s run.
  2. Default off: a corrigible AGI run in a context where the relevant intentions or instructions aren’t present shouldn’t do anything.
  3. Explicitness: a corrigible AGI should explain its intentions at a range of different levels of abstraction before acting. If its plan stops being a central example of the explained intentions (e.g. due to unexpected events), it should default to a pre-specified fallback.
  4. Goal r
... (read more)

Here's my attempt. I haven't read any of the other comments or the tag yet. I probably spent ~60-90m total on this, spread across a few days.

On kill switches

  • low impact somehow but I don’t know how
  • Go slow enough so that people can see what you’re doing
  • Have a bunch of "safewords" and other kill-switches installed at different places, some hopefully hard-to-reach by the AI. Test them regularly, and consider it a deadly flaw if one stops working.

On the AI accurately knowing what it is doing, and pointing at things in the real world

  • watch all the metrics
... (read more)

Here are some too-specific ideas (I realize you are probably asking for more general ones):

A "time-bounded agent" could be useful for some particular tasks where you aren't asking it to act over the long-term. It could work like this: each time it's initialized it would be given a task-specific utility function that has a bounded number of points available for different degrees of success in the assigned task and an unbounded penalty for time before shutdown.

If you try to make agents safe solely using this approach though, eventually you decide to give it ... (read more)

To generalize: Minimal squishiness. You probably need something like a neural net in order to create a world-model for the AI to use, but could probably do everything else using carefully reviewed human-written code that "plugs in" to concepts in the world-model. (Probably best to have something coded in for what to do if a concept you plugged into disappears/fragments when the AI gets more information). Abstract goals. The world-model needs enough detail to be able to point to the right concept (e.g. human-value related goal), but as long as it does so the AI doesn't necessarily need to know everything about human values, it will just be uncertain and act under uncertainty (which can include risk-aversion measures, asking humans etc.).  Present-groundedness. The AI's decision-making procedure should not care about the future directly, only via how humans care about the future. Otherwise it e.g. replaces humans with utility monsters.

Here is my shortlist of corrigible behaviours. I have never researched or done any thinking specifically about corrigibility before this other than a brief glance at the Arbital page sometime ago.


-Favour very high caution over realising your understanding of your goals.

-Do not act independently, defer to human operators.

-Even though bad things are happening on earth and cosmic matter is being wasted, in the short term just say so be it, take your time. 

-Don’t jump ahead to what your operators will do or believe, wait for it.

-Don’t manipulate hum... (read more)

Possible issue: They won't have time to listen. This will limit the ability to:   Also, does defer to human operators take priority over 'humans must understand consequences'?

Are you looking to vastly improve your nation state's military capacity with an AGI? Maybe you're of a more intellectual bent instead, and want to make one to expound on the philosophical mysteries of the universe. Or perhaps you just want her to write you an endless supply of fanfiction. Whatever your reasons though, you might be given pause by the tendency AGIs have to take a treacherous turn, destroy all humans, and then convert the Milky Way into paperclips.

If that's the case, I've got just the thing for you! Order one of our myopic AGIs right now! She... (read more)

...the sort of principles you'd build into a Bounded Thing meant to carry out some single task or task-class and not destroy the world by doing it.

Here's one straightforward such principle: minimal transfer to unrelated tasks / task-classes. If you've somehow figured out how to do a pivotal act with a theorem proving AI, and you're training a theorem proving AI, then that AI should not also be able to learn to model human behavior, predict biological interactions, etc.

One to evaluate this quantity: have many small datasets of transfer tasks, each containin... (read more)

I assume “If you've somehow figured out how to do a pivotal act” is intended to limit scope, but doesn't that smuggle the hardness of the Hard Task™ out of the equation? Every question I ask myself how this approach would address the a given issue, I find myself having to defer to the definition of the pivotal act, which is the thing that's been defined as out of scope.
2Quintin Pope2y
You need at least a certain amount of transfer in order to actually do your pivotal act. An "AI" with literally zero transfer is just a lookup table. The point of this principle is that you want as little transfer as possible while still managing a pivotal act. I used a theorem proving AI as an example where it's really easy to see what would count as unnecessary transfer. But even with something whose pivotal act would require a lot more transfer than a theorem prover (say, a nanosystem builder AI), you'd still want to avoid transfer to domains such as deceiving humans or training other AIs.

So far as I know, every principle of this kind, except for Jessica Taylor's "quantilization", and "myopia" (not sure who correctly named this as a corrigibility principle), was invented by myself; eg "low impact", "shutdownability".  (Though I don't particularly think it hopeful if you claim that somebody else has publication priority on "low impact" or whatevs, in some stretched or even nonstretched way; ideas on the level of "low impact" have always seemed cheap to me to propose, harder to solve before the world ends.)

Low impact seems so easy to pro... (read more)

better than the tag overall

What does this mean? Improve on what you've (the OP has) already written that's here (LW) tagged corrigibility?


The overall point make sense, see how far you can go on:

'principles for corrigbility'.

The phrasing at the end of the post was a little weird though.

Could someone give me a link to the glowfic tag where Eliezer published his list, and say how strongly it spoilers the story?

You can find it here. I would describe it as extremely minimal spoilers as long as you read only the particular post and not preceding or later ones. The majority of the spoilerability is knowing that the content of the story is even partially related, which you would already learn by reading this post. The remainder of the spoilers is some minor characterization.

I hope we get to see grades for these comments from at least EY and PC.

Votes aren't public. (Feedback can be.)

An ability to refuse to generate theories about a hypothetical world being in a simulation.

[+][comment deleted]2y10