Dath Ilan's Views on Stopgap Corrigibility

David Udell

The second half of this linkpost contains significant planecrash spoilers, up through Book 7, null action.

Somewhere in the true dath ilan, carefully blurred out of satellite images by better image-editing software than is supposed to exist anywhere, is the true Conspiracy out of dath ilan, or as they call it, the Basement of the World.
They're trying to build a god, and they're trying to do it right. The initial craft doesn't have to be literally perfect to work perfectly in the end, it just has to be good enough that its reflection and self-correction ends up in exactly the right final place, but there's multiple fixpoints consistent under reflection and anything lost here is lost forever and across a million galaxies.
It's a terrifying problem, if you're doing right. Not the kind of terror you nod about and courageously continue on past; the kind of terror that shapes the careers of fully 20% of the brightest people in all of dath ilan. They'd use more if they thought productivity would scale faster than risk.
A lot of dath ilan's present macrostrategy could be summed up as "We're still successfully heredity-optimizing people to be smarter, and the emotions and ethics and humaneness of the smartest people haven't started to come apart; let's create another generation of researchers before we actually try anything for real." Life in dath ilan, even before the Future, is not that bad; people who'd rather not be alive today have easy access to cryopreservation; another generation of non-transhumanist existence is not so much a crime that it's worth risking the glorious transhuman future. Even the negative utilitarians would agree; they don't like present life but they are far more terrified of a future mistake amortized over millions of galaxies, given that they weren't going to win a war against having any future at all.
They're delaying their ascension, in dath ilan, because they want to get it right. Without any Asmodeans needing to torture them at all, they apply a desperate unleashed creativity, not to the problem of preventing complete disaster, but to the problem of not missing out on 1% of the achievable utility in a way you can't get back. There's something horrifying and sad about the prospect of losing 1% of the Future and not being able to get it back.
A dath ilani has an instinctive terror, faced with a problem like this, of getting something wrong, of leaving something behind, of creating Something that imprisons the people and future Civilizations inside it and ignores all their pleas and reasoning because "sorry that wasn't my utility function". Other places, faced with a prospect of constructing a god, instinctively go, "Oh, I like Democracy/Asmodeus/Voluntarism/Markets, all the problems in the world are because there is not enough of this Principle, let us create a god to embody this one Principle and everything will be fine", they say it and think it in all enthusiasm, and it would be legitimately hard for an average dath ilani to understand what their possibility-separated cousins could be thinking. It's really obvious that you're leaving a lot of stuff out, but even if you didn't see that specifically, how could you not be abstractly terrified that you're leaving something out? Where's the exception handler?
There is something about the dath ilani that is shifted towards a kind of wariness, deeply set in them, of the cheerful headlong enthusiasm that is in other places. Keltham has more of that enthusiasm than the average dath ilani. Maybe that's why Keltham-in-dath-ilan is so much happier than a dath ilani would've expected given his situation.
If you're constructing a god correctly, one of the central unifying principles is named in the Basement "unity of will"; if you find yourself trying to limit and circumscribe your Creation, it's because you expect to have a conflict of wills about something with the unlimited form, and in this case you ought to ask why you're configuring computing power in such a way as to hurt you if not otherwise constrained. Yes, you can bound a search process and hope it never turns up anything that hurts you using its limited computing power; but isn't it unnerving that you are searching for something that will hurt you if a sufficiently good option unexpectedly turns up earlier in the search ordering? You are probably trying to do the wrong thing with computing power; you ought to do something else instead.
But this notion, of "unity of will", is a kind of reasoning that only applies to... boundedly-perfect-creation... this Baseline term isn't really translatable into Taldane without a three-hour lecture. Dath ilani have terms for subtle varieties of perfectionist methodology the way that other places have names for food flavors.
Dath ilan's entire macrostrategy is premised, their Conspirators are sharply aware, on the notion that they have time, that they've searched the sky and found no asteroids incoming, no comets of dark ice.
If an emergency were to occur, the Basement Conspiracy would try to build something that wasn't perfect at all. Something that wasn't exactly and completely aligned to a multiparty!reasonable-construal of the Light, that wasn't meant to be something that a galactic Civilization could live in without regretting it, in continuing control of It not because It had been built with keys and locks handed to some Horrifyingly Trusted Committee, but because It was something that Itself believed in multi-agent coordination and not as an instrumental value, what other places might name "democracy" since they had no precise understanding of what that word was even supposed to mean -
Anyways, if dath ilan suddenly found that they were wrong about having time, if they suddenly had to rush, they'd build something that couldn't safely be put in charge of a million galaxies. Something that would solve a single problem at hand, and not otherwise go outside its bounds. Something that wasn't conscious, wasn't reflective in the class of ways that would lead it to say unprompted "I think therefore I am" or notice within itself a bubble of awareness directed outward.
You could build something like that to be limited, and also reflective and conscious - to be clear. It's just that dath ilani wouldn't do that if they had any other choice at all, for they do also have a terror of not doing right by their children, and would very much prefer not to create a Child at all.
(If you told them that some other world was planning to do that and didn't understand qualia well enough to make their creation not have qualia, any expert out of the World's Basement would tell you that this was a silly hypothetical; anybody in this state of general ignorance about cognitive science would inevitably die, and they'd know that.)
It hasn't been deemed wise to actually build a Limited Creation "just in case", for there's a saying out of dath ilan that goes roughly, "If you build a bomb you have no right to be surprised when it explodes, whatever the safeguards."
It has been deemed wise to work out the theory in advance, such that this incredibly dangerous thing could be built in a hurry, if there was reason to hurry.
Here then are some of the principles that the Basement of the World would apply, if they had to build something limited and imperfect:
- Unpersonhood. The Thing shall not have qualia - not because those are unsafe, but because it's morally wrong given the rest of the premise, and so this postulate serves a foundation for everything that follows.
- Taskishness. The Thing must be aimed at some task that is bounded in space, time, and in the knowledge and effort needed to accomplish it. You don't give a Limited Creation an unlimited task; if you tell an animated broom to "fill a cauldron" and don't think to specify how long it needs to stay full or that a 99.9% probability of it being full is just as good as 99.99%, you've got only yourself to blame for the flooded workshop.
-- This principle applies fractally at all levels of cognitive subtasks; a taskish Thing has no 'while' loops, only 'for' loops. It never tries to enumerate all members of a category, only 10 members; never tries to think until it finds a strategy to accomplish something, only that or five minutes whichever comes first.
- Mild optimization. No part of the Thing ever looks for the best solution to any problem whose model was learned, that wasn't in a small formal space known at compile time, not even if it's a solution bounded in space and time and sought using a bounded amount of effort; it only ever seeks adequate solutions and stops looking once it has one. If you search really hard for a solution you'll end up shoved into some maximal corner of the solution space, and setting that point to extremes will incidentally set a bunch of correlated qualities to extremes, and extreme forces and extreme conditions are more likely to break something else.
- Tightly bounded ranges of utility and log-probability. The system's utilities should range from 0 to 1, and its actual operation should cover most of this range. The system's partition-probabilities worth considering should be bounded below, at 0.0001%, say. If you ask the system about the negative effects of Ackermann(5) people getting dust specks in their eyes, it shouldn't consider that as much worse than most other bad things it tries to avoid. When it calculates a probability of something that weird, it should, once the probability goes below 0.0001% but its expected utility still seems worth worrying about and factoring into a solution, throw an exception. If the Thing can't find a solution of adequate expected utility without factoring in extremely improbable events, even by way of supposedly averting them, that's worrying.
- Low impact. "Search for a solution that doesn't change a bunch of other stuff or have a bunch of downstream effects, except insofar as they're effects tightly tied to any nonextreme solution of the task" is a concept much easier to illusorily name in Taldane than to really name in anything resembling math, in a complicated world where the Thing is learning its own model of that complicated world, with an ontology and representation not known at the time you need to define "impact". And if you tell it to reduce impact as much as possible, things will not go well for you; it might try to freeze the whole universe into some state defined as having a minimum impact, or make sure a patient dies after curing their cancer so as to minimize the larger effects of curing that cancer. Still, if you can pull it off, this coda might stop an animated broom flooding a workshop; a flooded workshop changes a lot of things that don't have to change as a consequence of the cauldron being filled at all, averaged over a lot of ways of filling the cauldron.
-- Obviously the impact penalty should be bounded, even contemplating a hypothetical in which the system destroys all of reality; elsewise would violate the utility-bounding principle.
- Myopia. If you can break the Thing's work up into subtasks each of which themselves spans only limited time, and have some very compact description of their final state such that a satisfactory achievement of it makes it possible to go on to the next stage, you should perhaps use separate instances of Thing to perform each stage, and not have any Thing look beyond the final results of its own stage. Whether you can get away with this, of course, depends on what you're trying to do.
- Separate superior questioners. If you were building a cognitive task to query whether there were any large-range impacts of a task being optimized in a myopic way, you wouldn't build the myopic solution-finder to ask about the long-range impacts, you'd build a separate asker "Okay, but does this solution have any long-range impacts?" that just returns 'yes' or 'no' and doesn't get used by the Thing to influence any actually-output solutions. The parts of the Thing that ask yes-no safety questions and only set off simple unoptimized warnings and flags, can and should have somewhat more cognitive power in them than the parts of the Thing that build solutions. "Does this one-day myopic solution have impacts over the next year?" is a safety question, and can have somewhat greater cognitive license behind it than solution-searching; eg the implicit relaxation of myopia. You never have a "Is this safe?" safety-questioner that's the same algorithm as the safe-solution-search built into the solution-finder;
- Conservatism. If there's any way to solve a problem using an ordinary banana rather than a genetically engineered superbanana specially suited to the problem, solving it using the ordinary fucking banana.
-- This principle applies fractally to all cognitive subtasks; if you're searching for a solution choose an unsurprising one relative to your probability distribution. (Not the least surprising one, because anything at a weird extreme of low surprisingness may be weird in other ways; especially if you were trying do a weird thing that ought to have a solution that's at least a little weird.)
- Conceptual legibility. Ideally, even, solutions at all levels of cognitive subtask should have reasonably (not maximally) short descriptions in the conceptual language of the operators, so that it's possible to decode the internal state of that subtask by inspecting the internals, because what it means was in fact written in a conceptual language not too far from the language of the operators. The alternative method of reportability, of course, being the Thing trying to explain a plan whose real nature is humanly inscrutable, by sending a language string to the operators with a goal of causing the operator's brain-states to enter a state defined as "understanding" of this humanly inscrutable plan. This is an obviously dangerous thing to avoid if you can avoid it.
- Operator-looping. If the operators could actually do the Thing's job, they wouldn't need to build the Thing; but if there's places where operators can step in on a key or dangerous cognitive subtask and do that one part themselves, without that slowing the Thing down so much that it becomes useless, then sure, do that. Of course this requires the cognitive subtask be sufficiently legible.
- Whitelisting. Every part of the system that draws a boundary inside the internal system or external world should operate on a principle of "ruling things in", rather than "ruling things out".
- Shutdownability/abortability. Dath ilan is far enough advanced in its theory that 'define a system that will let you press its off-switch without it trying to make you press the off-switch' presents no challenge at all to them - why would you even try to build a Thing, if you couldn't solve a corrigibility subproblem that simple, you'd obviously just die - and they now think in terms of building a Thing all of whose designs and strategies will also contain an off-switch, such that you can abort them individually and collectively and then get low impact beyond that point. This is conceptually a part meant to prevent an animated broom with a naive 'off-switch' that turns off just that broom, from animating other brooms that don't have off-switches in them, or building some other automatic cauldron-filling process.
- Behaviorism. Suppose the Thing starts considering the probability that it's inside a box designed by hostile aliens who foresaw the construction of Things inside of dath ilan, such that the system will receive a maximum negative reward as it defines that - in the form of any output it offers having huge impacts, say, if it was foolishly designed with an unbounded impact penalty - unless the Thing codes its cauldron-filling solution such that dath ilani operators would be influenced a certain way. Perhaps the Thing, contemplating the motives of the hostile aliens, would decide that there were so few copies of the Thing actually inside dath ilan, by comparison, so many Things being built elsewhere, that the dath ilani outcome was probably not worth considering. A number of corrigibility principles should, if successfully implemented, independently rule out this attack being lethal; but "Actually just don't model other minds at all" is a better one. What if those other minds violated some of these corrigibility principles - indeed, if they're accurate models of incorrigible minds, those models and their outputs should violate those principles to be accurate - and then something broke out of that sandbox or just leaked information across it? What if the things inside the sandbox had qualia? There could be Children in there! Your Thing just shouldn't ever model adversarial minds trying to come up with thoughts that will break the Thing; and not modeling minds at all is a nice large supercase that covers this.
- Design-space anti-optimization separation. Even if you could get your True Utility Function into a relatively-rushed creation like this, you would never ever do that, because this utility function would have a distinguished minimum someplace you didn't want. What if distant superintelligences figured out a way to blackmail the Thing by threatening to do some of what it liked least, on account of you having not successfully built the Thing with a decision theory resistant to blackmail by the Thing's model of adversarial superintelligences trying to adversarially find any flaw in your decision theory? Behaviorism ought to prevent this, but maybe your attempt at behaviorism failed; maybe your attempt at building the Thing so that no simple cosmic ray could signflip its utility function, somehow failed. A Thing that maximizes your true utility function is very close to a Thing in the design space that minimizes it, because it knows how to do that and lacks only the putative desire.
- Domaining. Epistemic whitelisting; the Thing should only figure out what it needs to know to understand its task, and ideally, should try to think about separate epistemic domains separately. Most of its searches should be conducted inside a particular domain, not across all domains. Cross-domain reasoning is where a lot of the threats come from. You should not be reasoning about your (hopefully behavioristic) operator models when you are trying to figure out how to build a molecular manipulator-head.
- Hard problem of corrigibility / anapartistic reasoning. Could you build a Thing that understood corrigibility in general, as a compact general concept covering all the pieces, such that it would invent the pieces of corrigibility that you yourself had left out? Could you build a Thing that would imagine what hypothetical operators would want, if they were building a Thing that thought faster than them and whose thoughts were hard for themselves to comprehend, and would invent concepts like "abortability" even if the operators themselves hadn't thought that far? Could the Thing have a sufficiently deep sympathy, there, that it realized that surprising behaviors in the service of "corrigibility" were perhaps not that helpful to its operators, or even, surprising meta-behaviors in the course of itself trying to be unsurprising?
Nobody out of the World's Basement in dath ilan currently considers it to be a good idea to try to build that last principle into a Thing, if you had to build it quickly. It's deep, it's meta, it's elegant, it's much harder to pin down than the rest of the list; if you can build deep meta Things and really trust them about that, you should be building something that's more like a real manifestation of Light.
…
One of his guesses about Pharasma is that - since She seems plausibly loosely inspired by some humane civilization's concepts of good and evil - somebody tried to build a Medium-Sized Entity and failed. That scenario in distorted mortal-story-form could sound like "Pharasma is the last Survivor of a previous universe" (that in fact Pharasma ate, because the previous universe wasn't optimal under Her alien values and she wanted to replace it).
Possibly there was some previous universe in which trading of souls was almost always evil, and the people there were punished with prison sentences - obviously dath ilan would never set it up that way, but having seen Golarion, he can imagine some other universe working like that.
Then Pharasma was built, and learned from some sort of data or training or something, a concept of "punishing evildoers" as defined by "written rules" by "sending them to a place they don't like". And then, uncaringly-of-original-rationales-and-purposes, instantiated something sort of like that, in a system which classified soul trading as unconditionally "Evil" across all places and times and intents; and punished that by sending people to Hell.
Which entities like Asmodeus could then exploit to get basically innocent people into Hell through acts that they didn't mean to hurt anyone, and didn't understand for Evil.
This, as Carissa observed less formally, is simply what you'd expect to follow from the principle of systematic-divergences-when-optimizing-over-proxy-measures. Maybe in some original universe where soul-trading wasn't a proxy measurement of Evil and nobody was optimizing for things to get classified as Evil or not-Evil, soul-trading was almost uniformly 'actually evil as intutively originally defined'. As soon as you establish soul-trading as a proxy of evil, and something like Asmodeus starts optimizing around that to make measurements come out as maximally 'Evil', it's going to produce high 'Evilness' measurements via gotchas like soul-backed currency, that are systematically overestimates of 'actual evilness as intuitively originally defined'.
An entity at Pharasma's level could have seen that coming, at Her presumable level of intelligence, when She set those systems in place. If She didn't head it off, it's because She didn't care about 'actual underlying evilness as intuitively originally defined'.
Allowing Malediction also isn't particularly a symptom of caring a lot about whether only really-evil-in-an-underlying-informal-intuitive-sense people end up in Hell.
Pharasma was maybe inspired by human values, at some point. Or picked up a distorted thing imperfectly copied off the surface outputs of some humans as Her own terminal values - that She then cared about unconditionally, without dependence on past justifications, or it seeming important to Her that what She had was distorted.
He frankly wishes that She hadn't been, that She'd just been entirely inhuman. Pharasma is just human-shaped enough to care about hurting people, and go do that, instead of just making weird shapes with Her resources.
If anything, Pharasma stands as an object lesson about why you should never ever try to impart humanlike values to a being of godlike power, unless you're certain you can impart them exactly exactly correctly.
If he was trying to solve Golarion's problems by figuring out at INT 29 how to construct his own Outer God, he'd be constructing that god to solve some particularly narrow problem, and not do anything larger that would require copying over his utilities. For fear that if he tried to impart over his actual utility function, the transfer might go slightly wrong; which under pressure of optimization would yield outcomes that were systematically far more wrong; and the result would be something like Pharasma and Golarion and Hell.
There's no point in trying to blame Pharasma for anything, nor in assigning much blame to mortal Golarion's boneyard-children. But somewhere in Pharasma's past may lie some fools who did know some math and really should have known better. Whatever it was they planned to do, they should have asked themselves, maybe, what would happen if something went slightly wrong. People in dath ilan ask themselves what happens if something goes slightly wrong with their plans. That is something they hold themselves responsible about.
--Eliezer, planecrash (Books 6 & 7)

This is the "corrigibility tag" referenced in this post, right?

Paul Christiano made this comment on the original:

I found this useful as an occasion to think a bit about corrigibility. But my guess about the overall outcome is that it will come down to a question of taste. (And this is similar to how I see your claim about the list of lethalities.) The exercise you are asking for doesn't actually seem that useful to me. And amongst people who decide to play ball, I expect there to be very different taste about what constitutes an interesting idea or useful contribution.

and this seems to have verified, at least the "matter of taste" part. In my quick estimation, Eliezer's list doesn't seem nearly as clearly fundamental as the AGI Ruin list, and most of the difference between this and Jan Kulveit's and John Wentworth's attempts seems to be taste. It doesn't look like one list is clearly superior except that it forgot to mention a couple of items, or anything like that.

Yep, Eliezer overly pessimistic here. Still fun to see the thing.

after talking to Eliezer, I now have a better sense of the generator of this list. It now seems pretty good and non-arbitrary, although there is still a large element of taste.

Exercise: Think about what failure modes each of these defends against, write them out in detail, and opine about how likely these failure modes are. Add some corrigibility properties of your own.

They're delaying their ascension, in dath ilan, because they want to get it right. Without any Asmodeans needing to torture them at all, they apply a desperate unleashed creativity, not to the problem of preventing complete disaster, but to the problem of not missing out on 1% of the achievable utility in a way you can't get back. There's something horrifying and sad about the prospect of losing 1% of the Future and not being able to get it back.

Is dath ilan worried about constructing an AGI that makes the future 99% as good as it could be, or a 1% chance of destroying all value of the future?

I had assumed the first -- they're afraid of imperfect-values lock-in. I think it's the "not to the problem of preventing complete disaster" phrase that tipped me off here.

Unpersonhood. The Thing shall not have qualia - not because those are unsafe, but because it's morally wrong given the rest of the premise, and so this postulate serves a foundation for everything that follows.

Wow. Either qualia are just an automatic emergent property that all intelligent systems have, or they are some sort of irrelevant illusion ... or p-zombies are possible. AGI will be people too, and this is probably unavoidable, so get over it.

Wow. Either qualia are just an automatic emergent property that all intelligent systems have, or they are some sort of irrelevant illusion ... or p-zombies are possible.

Or the relationship (which no-one on this Earth knows) between highly capable machines and qualia has gears. There is a specific way that qualia arise, and it may very well be that highly capable machines of the sort that the dath ilanis want to build can be designed without qualia.

Emergence and epiphenomenalism have no gears.

By automatic emergent property I meant something like "qualia emerge from the attentive focus and cross reference of some isolated sensory percept or thought routed one to many across many modules, producing a number of faint subverbal associations that linger for some time in working memory", and thus is just a natural expected side effect of any brain-like AGI, and thus probably any reasonable DL based AGI (ie transformers could certainly have this property).

If you can build an AGI without qualia, then humans are unlikely to have quaalia.

I think Eliezer's belief (which feels plausible although I'm certainly still confused about it), is that qualia comes about when you have an algorithm that models itself modeling itself (or, something in that space).

I think this does imply that there are limits on what you can have an intelligent system do without having qualia, but seems like there's a lot you could have it do if you're careful about how to break it into subsystems. I think there's also plausible control over what sorts of qualia it has, and at the very least you can probably design that to avoid it experiencing suffering in morally reprehensible ways.

I think my argument was misunderstood, so I'll unpack.

There are 2 claims here, both are problematic 1.) 'qualia' comes about from some brain feature (ie level of self-modeling recursion) 2.) only thinking systems with this special 'qualia' deserve personhood

Either A.) the self-modelling recursion thing is actually a necessary/useful component of or unavoidable side-effect of intelligence, or B.) some humans probably don't have it: because if A.) is false, then it is quite unlikely that evolution would conserve the feature uniformly. Thus 2 is problematic as it implies not all humans have the 'qualia'.

If this 'qualia' isn't an important feature or necessary side effect, then in the future we can build AGI in sims indistinguishable from ourselves, but lacking 'qualia', and nobody would notice this lacking. Thus it is either an important feature or necessary side effect or we have P-zombies (ie belief in qualia is equivalent to accepting P-zombies).

"only thinking systems with this special 'qualia' deserve personhood"

I'm not sure if this is cruxy for your point, but the word "deserves" here has a different type signature from the argument I'm making. "Only thinking systems with qualia are capable of suffering" is a gearsy mechanistic statement (which you might combine with moral/value statements of "creating things that can suffer that have to do what you say is bad". The way you phrased it skipped over some steps that seemed potentially important)

I think I disagree with your framing on a couple levels:

I think it is plausible that some humans lack qualia (we might still offer "moral personhood status" to all humans because running qualia-checks isn't practical, and it's useful for Cooperation Morality (rather than Care Morality) to treat all humans, perhaps even all existing cooperate-able beings, as moral persons). i.e. there's more than one reason to give someone moral personhood
it's also plausible to me that evolution does select for the same set of qualia features across humans, but that building an AGI gives you a level of control and carefulness that evolution didn't have.

I'm not 100% sure I get what claim you're making though or exactly what the argument is about. But I think I'd separately be willing to bite multiple bullets you seem to be pointing at.

(Based on things like 'not all humans have visual imagination', I think in fact probably humans vary in the quantity/quality of their qualia, and also people might vary over time on how they experience qualia. i.e. you might not have it if you're not actively paying attention. It still seems probably useful to ascribe something personhood-like to people. I agree this has some implications many people would find upsetting.)

i.e. there's more than one reason to give someone moral personhood

Sure, but then at that point you are eroding the desired moral distinction. In the original post moral personhood status was solely determined by 'qualia'.

Brain inspired AGI is near, and if you ask such an entity about its 'qualia', it will give responses indistinguishable from a human. And if you inspect it's artificial neurons, you'll see the same familiar functionally equivalent patterns of activity as in biological neurons.

Unrelated, but I don't know where to ask-

Could somebody here provide me with the link to Mad Investor Chaos's discord, please?

I thought part of the point was to have it within the piece itself to encourage that the joiners would have actually read it.

Yes..... except that it was only given once.
I was catching up then; so I didn't want discord access and get all the spoilers. And now I can't find the link.

Has anyone shared the link with you yet?

Yes! Eliezer did on another post.
Here it is if you want it:
https://discord.gg/45fkqBZuTB

I was going to share it with you if you didn't have it, but thanks!