So as to know, structurally, what it is the better to avoid it; Yudkowsky's previously mentioned, so implied, that MIRI, in the process of designing a corrigible agent, in fact succeeded in designing an agent specifically intended to shut itself down. So, not having seen it elsewhere, here to suggest: what if, if only for the purposes of insight into the kind of behavior that is most apt to kill us, one tries to design (only on paper, please), an agent whose loss or utility function, specifically is intended to kill everyone (potential difficulty noted: more than one way of killing everyone; potential solution: gradations of lethal utilities, each examined in detail, kill all humans, kill life, eliminate all matter in universe, &c.)? At a minimum, we might learn something of what other loss functions are similar to this "absolutely worst case" function, and these we might better avoid using. Note, we wouldn’t be seeking specific plans of how it kill anyone, but the sort of requests external to the system, or its own drives, that subsequently imply plans of killing everyone, or prized control of its own reward function. (Potential problem: someone uses the list to deliberately kill everyone; potential solution: none; if someone’s that way motivated and has the brains so to implement, they could use CRISPR, instead).


Perhaps more fatuously - at least it identifies people who have bad ideas, who naively suggest it hereafter - it might be recommended that we configure a loss function that's "just, like, the opposite of the bad one!" to provide the absolutely best case. More practically, there are interesting potential consequences even to that bad one. Not least: we've been considering instrumentality and goal content integrity as consequences, sub-goals of any given goal, that accidentally kill everyone. With an Everything-Killer function (hereafter if ever EK function), instrumentality may become the goal to be preserved; goal content integrity necessitates and is necessitated by the need to kill everyone, so these might be better studied in themselves, what brings them about. 


And at least one hypothetical also comes to mind; note, all this was prompted by thinking of how we'd implement humanity-wide goals even with current technology, and, curiously, musing on Frankenstein's monster, or Steinbeck's Lenny, if you're into that sort of thing, who only realize after that they've killed someone - so how to make them realize before... or how to make it so they can undo the damage even if they've dealt it. That is, of course, why infrastructure profusion would be a problem on instrumentality, even for purely beneficent loss functions. 


So, the hypothetical: given this author's previously given non-anthropic ethics model, here (yes, it's too long; yes, tone's too light; yes, reasoning remains, to all appearance, unexamined), conceive a neural net that is sub-AGI, but has the domain capability of designing greater intelligence. The sub-AGI you give the loss function, of designing an AI that modifies itself to maximize the orderliness (minimize entropy), of its own operation; being more orderly, let us assume is more intelligent, so that this second design is an AGI. 


(And this is a very curious notion: that inner alignment may be analogous to hardware structure, with a sub-note, of how in humans do we characterize inner goals; as hormones, which are dependent on input from environment, or the consciousness which may well be time-and-space invariant, or at least, removed from those by one neuronal step. This at least seems safe to note: human's motivations are related to their emotions and inclinations - and those are indirect products of neuronal structures. If both the loss function and the hardware structure are both designed for benevolence - or orderliness, here, better - then how or why would it use the hardware to alter its loss function, or vice versa? Potential problem: neural nets operate by effectively making or modifying their own hardware, or at least which components of it they use – and we can’t yet predict how they’ll do this. Potential solution: we’d need on the current model, to regularize the neural net via dropout – but non-randomly, for some configuration that comports with the loss function, which we have no idea how to do. Rather than six months’ pause on system training, why not an indefinite pause until the connection configuration of each layer can be ascertained at any given moment, that’s the criteria? If that takes half a century to obtain, so much to the good).


Naturally, the self-modifying model, produced by sub-AGI, thus self-modifies, and results in infrastructure profusion, and humanity dies - except, in the interest of maximizing its orderliness, and on the assumption it can early in development detect the neural activity of humans and, this activity having information, is more orderly than a dead brain - either it preserves the humans or at least, it incorporates the orderliness of each human's thoughts into its own neuro-computational structure. Then either it later realizes it's killed everyone but also that, to maximize orderliness also of its environment, it having realized that orderliness other than itself - maximum orderliness, everywhere - enhances its own total orderliness, it thus resurrects humanity, after some fashion - or it doesn't, but human consciousness is preserved (ongoing consciousness presumably is more orderly than delimited-by-death consciousness), somehow, within it.


Ironically, only those who've had their brains frozen cryogenically would be liable not to be revived (if anyone would at all; this was an idea from just yesterday, remember). Observe, too, that even in the above hypothetical, probably everyone dies; except, they'd be (goes all well) guaranteed an "afterlife". In general, that we have to countenance everyone's death but, or still, to have something good come of developments, is something we should factor into our considerations. Such could constitute the "death with dignity" Yudkowsky continues, with justification, to fixate on.


Something to keep in mind, as we go on.


New to LessWrong?