Bridge Collapse: Reductionism as Engineering Problem

Followup to: Building Phenomenological Bridges

Summary: AI theorists often use models in which agents are crisply separated from their environments. This simplifying assumption can be useful, but it leads to trouble when we build machines that presuppose it. A machine that believes it can only interact with its environment in a narrow, fixed set of ways will not understand the value, or the dangers, of self-modification. By analogy with Descartes' mind/body dualism, I refer to agent/environment dualism as Cartesianism. The open problem in Friendly AI (OPFAI) I'm calling naturalized induction is the project of replacing Cartesian approaches to scientific induction with reductive, physicalistic ones.


I'll begin with a story about a storyteller.

Once upon a time — specifically, 1976 — there was an AI named TALE-SPIN. This AI told stories by inferring how characters would respond to problems from background knowledge about the characters' traits. One day, TALE-SPIN constructed a most peculiar tale.

Henry Ant was thirsty. He walked over to the river bank where his good friend Bill Bird was sitting. Henry slipped and fell in the river. Gravity drowned.

Since Henry fell in the river near his friend Bill, TALE-SPIN concluded that Bill rescued Henry. But for Henry to fall in the river, gravity must have pulled Henry. Which means gravity must have been in the river. TALE-SPIN had never been told that gravity knows how to swim; and TALE-SPIN had never been told that gravity has any friends. So gravity drowned.

TALE-SPIN had previously been programmed to understand involuntary motion in the case of characters being pulled or carried by other characters — like Bill rescuing Henry. So it was programmed to understand 'character X fell to place Y' as 'gravity moves X to Y', as though gravity were a character in the story.1

For us, the hypothesis 'gravity drowned' has low prior probability because we know gravity isn't the type of thing that swims or breathes or makes friends. We want agents to seriously consider whether the law of gravity pulls down rocks; we don't want agents to seriously consider whether the law of gravity pulls down the law of electromagnetism. We may not want an AI to assign zero probability to 'gravity drowned', but we at least want it to neglect the possibility as Ridiculous-By-Default.

When we introduce deep type distinctions, however, we also introduce new ways our stories can fail.

Hutter's cybernetic agent model

Russell and Norvig's leading AI textbook credits Solomonoff with setting the agenda for the field of AGI: "AGI looks for a universal algorithm for learning and acting in any environment, and has its roots in the work of Ray Solomonoff[.]" As an approach to AGI, Solomonoff induction presupposes a model with a strong type distinction between the 'agent' and the 'environment'. To make its intuitive appeal and attendant problems more obvious, I'll sketch out the model.

A Solomonoff-inspired AI can most easily be represented as a multi-tape Turing machine like the one Alex Altair describes in An Intuitive Explanation of Solomonoff Induction. The machine has:

three tapes, labeled 'input', 'work', and 'output'. Each initially has an infinite strip of 0s written in discrete cells.

- one head per tape, with the input head able to read its cell's digit and move to the right, the output head able to write 0 or 1 to its cell and move to the right, and the work head able to read, write, and move in either direction.

- a program, consisting of a finite, fixed set of transition rules. Each rule says when heads read, write, move, or do nothing, and how to transition to another rule.


A three-tape Turing machine.


We could imagine two such Turing machines communicating with each other. Call them 'Agent' and 'Environment', or 'Alice' and 'Everett'. Alice and Everett take turns acting. After Everett writes a bit to his output tape, that bit magically appears on Alice's input tape; and likewise, when Alice writes to her output tape, it gets copied to Everett's input tape. AI theorists have used this setup, which Marcus Hutter calls the cybernetic agent model, as an extremely simple representation of an agent that can perceive its environment (using the input tape), think (using the work tape), and act (using the output tape).2


A Turing machine model of agent-environment interactions. At first, the machines differ only in their programs. ‘Alice’ is the agent we want to build, while ‘Everett’ stands for everything else that’s causally relevant to Alice’s success.


We can define Alice and Everett's behavior in terms of any bit-producing Turing machines we'd like, including ones that represent probability distributions and do Bayesian updating. Alice might, for example, use her work tape to track four distinct possibilities and update probabilities over them:3

  • (a) Everett always outputs 0.
  • (b) Everett always outputs 1.
  • (c) Everett outputs its input.
  • (d) Everett outputs the opposite of its input.

Alice starts with a uniform prior, i.e., 25% probability each. If Alice's first output is 1, and Everett responds with 1, then Alice can store those two facts on her work tape and conditionalize on them both, treating them as though they were certain. This results in 0.5 probability each for (b) and (c), 0 probability for (a) and (d).

We care about an AI's epistemology only because it informs the AI's behavior — on this model, its bit output. If Alice outputs whatever bits maximize her expected chance of receiving 1s as input, then we can say that Alice prefers to perceive 1. In the example I just gave, such a preference predicts that Alice will proceed to output 1 forever. Further exploration is unnecessary, since she knows of no other importantly different hypotheses to test.

Enriching Alice's set of hypotheses for how Everett could act will let Alice win more games against a wider variety of Turing machines. The more programs Alice can pick out and assign a probability to, the more Turing machines Alice will be able to identify and intelligently respond to. If we aren't worried about whether it takes Alice ten minutes or a billion years to compute an update, and Everett will always patiently wait his turn, then we can simply have Alice perform perfect Bayesian updates; if her priors are right, and she translates her beliefs into sensible actions, she'll then be able to optimally respond to any environmental Turing machine.

For AI researchers following Solomonoff's lead, that's the name of the game: Figure out the program that will let Alice behave optimally while communicating with as wide a range of Turing machines as possible, and you've at least solved the theoretical problem of picking out the optimal artificial agent from the space of possible reasoners. The agent/environment model here may look simple, but a number of theorists see it as distilling into its most basic form the task of an AGI.2

Yet a Turing machine, like a cellular automaton, is an abstract machine — a creature of thought experiments and mathematical proofs. Physical computers can act like abstract computers, in just the same sense that heaps of apples can behave like the abstract objects we call 'numbers'. But computers and apples are high-level generalizations, imperfectly represented by concise equations.4 When we move from our mental models to trying to build an actual AI, we have to pause and ask how well our formalism captures what's going on in reality.


The problem with Alice

'Sensory input' or 'data' is what I call the information Alice conditionalizes on; and 'beliefs' or 'hypotheses' is what I call the resultant probability distribution and representation of possibilities (in Alice's program or work tape). This distinction seems basic to reasoning, so I endorse programming agents to treat them as two clearly distinct types. But in building such agents, we introduce the possibility of Cartesianism.

René Descartes held that human minds and brains, although able to causally interact with each other, can each exist in the absence of the other; and, moreover, that the properties of purely material things can never fully explain minds. In his honor, we can call a model or procedure Cartesian if it treats the reasoner as a being separated from the physical universe. Such a being can perceive (and perhaps alter) physical processes, but it can't be identified with any such process.5

The relevance of Cartesians to AGI work is that we can model them as agents experiencing a strong type distinction between 'mind' and 'matter', and an unshakable belief in the metaphysical independence of those two categories; because they're of such different kinds, they can vary independently. So we end up with AI errors that are the opposite of TALE-SPIN's — like an induction procedure that distinguishes gravity's type from embodied characters' types so strongly that it cannot hypothesize that, say, particles underlie or mediate both phenomena.

My claim is that if we plug in 'Alice's sensory data' for 'mind' and 'the stuff Alice hypothesizes as causing the sensory data' for 'matter', then agents that can only model themselves using the cybernetic agent model are Cartesian in the relevant sense.6

The model is Cartesian because the agent and its environment can only interact by communicating. That is, their only way of affecting each other is by trading bits printed to tapes.

If we build an actual AI that believes it's like Alice, it will believe that the environment can't affect it in ways that aren't immediately detectable, can't edit its source code, and can't force it to halt. But that makes the Alice-Everett system almost nothing like a physical agent embedded in a real environment. Under many circumstances, a real AI's environment will alter it directly. E.g., the AI can fall into a volcano. A volcano doesn't harm the agent by feeding unhelpful bits into its environmental sensors. It harms the agent by destroying it.

A more naturalistic model would say: Alice outputs a bit; Everett reads it; and then Everett does whatever the heck he wants. That might be feeding a new bit into Alice. Or it might be vandalizing Alice's work tape, or smashing Alice flat.


A robotic Everett tampering with an agent that mistakenly assumes Cartesianism. A real-world agent’s computational states have physical correlates that can be directly edited by the environment. If the agent can't model such scenarios, its reasoning (and resultant decision-making) will suffer.


A still more naturalistic approach would be to place Alice inside of Everett, as a subsystem. In the real world, agents are surrounded by their environments. The two form a cohesive whole, bound by the same physical laws, freely interacting and commingling.

If Alice only worries about whether Everett will output a 0 or 1 to her sensory tape, then no matter how complex an understanding Alice has of Everett's inner workings, Alice will fundamentally misunderstand the situation she's in. Alice won't be able to represent hypotheses about how, for example, a pill might erase her memories or otherwise modify her source code.

Humans, in contrast, can readily imagine a pill that modifies our memories. It seems childishly easy to hypothesize being changed by avenues other than perceived sensory information. The limitations of the cybernetic agent model aren't immediately obvious, because it isn't easy for us to put ourselves in the shoes of agents with alien blind spots.

There is an agent-environment distinction, but it's a pragmatic and artificial one. The boundary between the part of the world we call 'agent' and the part we call 'not-agent' (= 'environment') is frequently fuzzy and mutable. If we want to build an agent that's robust across many environments and self-modifications, we can't just design a program that excels at predicting sensory sequences generated by Turing machines. We need an agent that can form accurate beliefs about the actual world it lives in, including accurate beliefs about its own physical underpinnings.


From Cartesianism to naturalism

What would a naturalized self-model, a model of the agent as a process embedded in a lawful universe, look like? As a first attempt, one might point to the pictures of Cai in Building Phenomenological Bridges.


Cai has a simple physical model of itself as a black tile at the center of a cellular automaton grid. Cai's phenomenological bridge hypotheses relate its sensory data to surrounding tiles' states.


But this doesn't yet specify a non-Cartesian agent. To treat Cai as a Cartesian, we could view the tiles surrounding Cai as the work tape of Everett, and the dynamics of Cai's environment as Everett's program. (We can also convert Cai's perceptual experiences into a binary sequence on Alice/Cai's input tape, with a translation like 'cyan = 01, magenta = 10, yellow = 11'.)


Alice/Cai as a cybernetic agent in a Turing machine circuit.


The problem isn't that Cai's world is Turing-computable, of course. It's that if Cai's hypotheses are solely about what sorts of perception-correlated patterns of environmental change can occur, then Cai's models will be Cartesian.


Cai as a Cartesian treats its sensory experiences as though they exist in a separate world.


Cartesian Cai recognizes that its two universes, its sensory experiences and hypothesized environment, can interact. But it thinks they can only do so via a narrow range of stable pathways. No actual agent's mind-matter connections can be that simple and uniform.

If Cai were a robot in a world resembling its model, it would itself be a complex pattern of tiles. To form accurate predictions, it would need to have self-models and bridge hypotheses that were more sophisticated than any I've considered so far. Humans are the same way: No bridge hypothesis explaining the physical conditions for subjective experience will ever fit on a T-shirt.


Cai's world divided up into a 9x9 grid. Cai is the central 3x3 grid. Barely visible: Complex computations like Cai's reasoning are possible in this world because they're implemented by even finer tile patterns at smaller scales.


Changing Cai's tiles' states — from black to white, for example — could have a large impact on its computations, analogous to changing a human brain from solid to gaseous. But if an agent's hypotheses are all shaped like the cybernetic agent model, 'my input/output algorithm is replaced by a dust cloud' won't be in the hypothesis space.

If you programmed something to thinks like Cartesian Cai, it might decide that its sequence of visual experiences will persist even if the tiles forming its brain completely change state. It wouldn't be able to entertain thoughts like 'if Cai performs self-modification #381, Cai will experience its environment as smells rather than colors' or 'if Cai falls into a volcano, Cai gets destroyed'. No pattern of perceived colors is identical to a perceived smell, or to the absence of perception.

To form naturalistic self-models and world-models, Cai needs hypotheses that look less like conversations between independent programs, and more like worlds in which it is a fairly ordinary subprocess, governed by the same general patterns. It needs to form and privilege physical hypotheses under which it has parts, as well as bridge hypotheses under which those parts correspond in plausible ways to its high-level computational states.

Cai wouldn't need a complete self-model in order to recognize general facts about its subsystems. Suppose, for instance, that Cai has just one sensor, on its left side, and a motor on its right side. Cai might recognize that the motor and sensor regions of its body correspond to its introspectible decisions and perceptions, respectively.


A naturalized agent can recognize that it has physical parts with varying functions. Cai's top and bottom lack sensors and motors altogether, making it clearer that Cai's environment can impact Cai by entirely non-sensory means.


We care about Cai's models because we want to use Cai to modify its environment. For example, we may want Cai to convert as much of its environment as possible into grey tiles. Our interest is then in the algorithm that reliably outputs maximally greyifying actions when handed perceptual data.

If Cai is able to form sophisticated self-models, then Cai can recognize that it's a grey tile maximizer. Since it wants there to be more grey tiles, it also wants to make sure that it continues to exist, provided it believes that it's better than chance at pursuing its goals.

More specifically, Naturalized Cai can recognize that its actions are some black-box function of its perceptual computations. Since it has a bridge hypothesis linking its perceptions to its middle-left tile, it will then reason that it should preserve its sensory hardware. Cai's self-model tells it that if its sensor fails, then its actions will be based on beliefs that are much less correlated with the environment. And its self-model tells it that if its actions are poorly calibrated, then there will be fewer grey tiles in the universe. Which is bad.


A naturalistic version of Cai can reason intelligently from the knowledge that its actions (motor output) depend on a specific part of its body that's responsible for perception (environmental input).


A physical Cai might need to foresee scenarios like 'an anvil crashes into my head and destroys me', and assign probability mass to them. Bridge hypotheses expressive enough to consider that possibility would not just relate experiences to environmental or hardware states; they would also recognize that the agent's experiences can be absent altogether.


An anvil can destroy Cai's perceptual hardware by crashing into it. A Cartesian might not worry about this eventuality, expecting its experience to persist after its body is smashed. But a naturalized reasoner will form hypotheses like the above, on which its sequence of color experiences suddenly terminates when its sensors are destroyed.


This point generalizes to other ways Cai might self-modify, and to other things Cai might alter about itself. For example, Cai might learn that other portions of its brain correspond to its hypotheses and desires.


Another very simple model of how different physical structures are associated with different computational patterns.


This allows Cai to recognize that its goals depend on the proper functioning of many of its hardware components. If Cai believes that its actions depend on its brain's goal unit's working a specific way, then it will avoid taking pills that foreseeably change its goal unit. If Cai's causal model tells it that agents like it stop exhibiting future-steering behaviors when they self-modify to have mad priors, then it won't self-modify to acquire mad priors. And so on.


If Cai's motor fails, its effect on the world can change as a result. The same is true if its hardware is modified in ways that change its thoughts, or its preferences (i.e., the thing linking its conclusions to its motor).


Once Cai recognizes that its brain needs to work in a very specific way for its goals to be achieved, its preferences can take its physical state into account in sensible ways, without our needing to hand-code Cai at the outset to have the right beliefs or preferences over every individual thing that could change in its brain.

Just the opposite is true for Cartesians. Since they can't form hypotheses like 'my tape heads will stop computing digits if I disassemble them', they can only intelligently navigate such risks if they've been hand-coded in advance to avoid perceptual experiences the programmer thought would correlate with such dangers.

In other words, even though all of this is still highly informal, there's already some cause to think that a reasoning pattern like Naturalized Cai can generalize in ways that Cartesians can't. The programmers don't need to know everything about Cai's physical state, or anticipate everything about what future changes Cai might undergo, if Cai's epistemology allows it to easily form accurate reductive beliefs and behave accordingly. An agent like this might be adaptive and self-correcting in very novel circumstances, leaving more wiggle room for programmers to make human mistakes.


Bridging maps of worlds and maps of minds

Solomonoff-style dualists have alien blind spots that lead them to neglect the possibility that some hardware state is equivalent to some introspected computation '000110'. TALE-SPIN-like AIs, on the other hand, have blind spots that lead to mistakes like trying to figure out the angular momentum of '000110'.

A naturalized agent doesn't try to do away with the data/hypothesis type distinction and acquire a typology as simple as TALE-SPIN's. Rather, it tries to tightly interconnect its types using bridges. Naturalizing induction is about combining the dualist's useful map/territory distinction with a more sophisticated metaphysical monism than TALE-SPIN exhibits, resulting in a reductive monist AI.7

Alice's simple fixed bridge axiom, {environmental output 0 ↔ perceptual input 0, environmental output 1 ↔ perceptual input 1}, is inadequate for physically embodied agents. And the problem isn't just that Alice lacks other bridge rules and can't weigh evidence for or against each one. Bridge hypotheses are a step in the right direction, but they need to be diverse enough to express a variety of correlations between the agent's sensory experiences and the physical world, and they need a sensible prior. An agent that only considers bridge hypotheses compatible with the cybernetic agent model will falter whenever it and the environment interact in ways that look nothing like exchanging sensory bits.

With the help of an inductive algorithm that uses bridge hypotheses to relate sensory data to a continuous physical universe, we can avoid making our AIs Cartesians. This will make their epistemologies much more secure. It will also make it possible for them to want things to be true about the physical universe, not just about the particular sensory experiences they encounter. Actually writing a program that does all this is an OPFAI. Even formalizing how bridge hypotheses ought to work in principle is an OPFAI.

In my next post, I'll move away from toy models and discuss AIXI, Hutter's optimality definition for cybernetic agents. In asking whether the best Cartesian can overcome the difficulties I've described, we'll get a clearer sense of why Solomonoff inductors aren't reflective and reductive enough to predict drastic changes to their sense-input-to-motor-output relation — and why they can't be that reflective and reductive — and why this matters.




1 Meehan (1977). Colin Allen first introduced me to this story. Dennett discusses it as well. 

2 E.g., Durand, Muchnik, Ushakov & Vereshchagin (2004), Epstein & Betke (2011), Legg & Veness (2013), Solomonoff (2011). Hutter (2005) uses the term "cybernetic agent model" to emphasize the parallelism between his Turing machine circuit and control theory's cybernetic systems 

3 One simple representation would be: Program Alice to write to her work tape, on round one, 0010 (standing for 'if I output 0, Everett outputs 0; if I output 1, Everett outputs 0'). Ditto for the other three hypotheses, 0111, 0011, and 0110. Then write the hypothesis' probability in binary (initially 25%, represented '11001') to the right of each, and program Alice to edit this number as she receives new evidence. Since the first and third digit stay the same, we can simplify the hypotheses' encoding to 00, 11, 01, 10. Indeed, if the hypotheses remain the same over time there's no reason to visibly distinguish them in the work tape at all, when we can instead just program Alice to use the left-to-right ordering of the four probabilities to distinguish the hypotheses. 

4 To the extent our universe perfectly resembles any mathematical structure, it's much more likely to do so at the level of gluons and mesons than at the level of medium-sized dry goods. The resemblance of apples to natural numbers is much more approximate. Two apples and three apples generally make five apples, but when you start cutting up or pulverizing or genetically altering apples, you may find that other mathematical models do a superior job of predicting the apples' behavior. It seems likely that the only perfectly general and faithful mathematical representation of apples will be some drastically large and unwieldy physics equation.

Ditto for machines. It's sometimes possible to build a physical machine that closely mimics a given Turing machine — but only 'closely', as Turing machines have unboundedly large tapes. And although any halting Turing machine can in principle be simulated with a bounded tape (Cockshott & Michaelson (2007)), nearly all Turing machine programs are too large to even be approximated by any physical process.

All physical machines structurally resemble Turing machines in ways that allow us to draw productive inferences from the one group to the other. See Piccinini's (2011) discussion of the physical Church-Turing thesis. But, for all that, the concrete machine and the abstract one remain distinct. 

5 Descartes (1641): "[A]lthough I certainly do possess a body with which I am very closely conjoined; nevertheless, because, on the one hand, I have a clear and distinct idea of myself, in as far as I am only a thinking and unextended thing, and as, on the other hand, I possess a distinct idea of body, in as far as it is only an extended and unthinking thing, it is certain that I (that is, my mind, by which I am what I am) am entirely and truly distinct from my body, and may exist without it."

From this it’s clear that Descartes also believed that the mind can exist without the body. This interestingly parallels the anvil problem, which I'll discuss more in my next post. However, I don't build immortality into my definition of 'Cartesianism'. Not all agents that act as though there is a Cartesian barrier between their thoughts and the world think that their experiences are future-eternal. I'm taking care not to conflate Cartesianism with the anvil problem because the formalism I'll discuss next time, AIXI, does face both of them. Though the problems are logically distinct, it's true that a naturalized reasoning method would be much less likely to face the anvil problem. 

6 This isn't to say that a Solomonoff inductor would need to be conscious in anything like the way humans are conscious. It can be fruitful to point to similarities between the reasoning patterns of humans and unconscious processes. Indeed, this already happens when we speak of unconscious mental processes within humans.

Parting ways with Descartes (cf. Kirk (2012)), many present-day dualists would in fact go even further than reductionists in allowing for structural similarities between conscious and unconscious processes, treating all cognitive or functional mental states as (in theory) realizable without consciousness. E.g., Chalmers (1996): "Although consciousness is a feature of the world that we would not predict from the physical facts, the things we say about consciousness are a garden-variety cognitive phenomenon. Somebody who knew enough about cognitive structure would immediately be able to predict the likelihood of utterances such as 'I feel conscious, in a way that no physical object could be,' or even Descartes's 'Cogito ergo sum.' In principle, some reductive explanation in terms of internal processes should render claims about consciousness no more deeply surprising than any other aspect of behavior." 

And since we happen to live in a world made of physics, the kind of monist we want in practice is a reductive physicalist AI. We want a 'physicalist' as opposed to a reductive monist that thinks everything is made of monads, or abstract objects, or morality fluid, or what-have-you. 



∙ Chalmers (1996). The Conscious Mind: In Search of a Fundamental Theory. Oxford University Press.

∙ Cockshott & Michaelson (2007). Are there new models of computation? Reply to Wegner and Eberbach. The Computer Journal, 50: 232-247.

∙ Descartes (1641). Meditations on first philosophy, in which the existence of God and the immortality of the soul are demonstrated.

∙ Durand, Muchnik, Ushakov & Vereshchagin (2004). Ecological Turing machines. Lecture Notes in Computer Science, 3142: 457-468.

∙ Epstein & Betke (2011). An information-theoretic representation of agent dynamics as set intersections. Lecture Notes in Computer Science, 6830: 72-81.

∙ Hutter (2005). Universal Artificial Intelligence: Sequence Decisions Based on Algorithmic Probability. Springer.

∙ Kirk (2012). Zombies. In Zalta (ed.), The Stanford Encyclopedia of Philosophy.

∙ Legg & Veness (2013). An approximation of the Universal Intelligence Measure. Lecture Notes in Computer Science, 7070: 236-249.

∙ Meehan (1977). TALE-SPIN, an interactive program that writes stories. Proceedings of the 5th International Joint Conference on Artificial Intelligence: 91-98.

∙ Piccinini (2011). The physical Church-Turing thesis: Modest or bold? British Journal for the Philosophy of Science, 62: 733-769.

∙ Russell & Norvig (2010). Artificial Intelligence: A Modern Approach. Prentice Hall.

∙ Solomonoff (2011). Algorithmic probability — its discovery — its properties and application to Strong AI. In Zenil (ed.), Randomness Through Computation: Some Answers, More Questions (pp. 149-157).

61 comments, sorted by
magical algorithm
Highlighting new comments since Today at 3:38 PM
Select new highlight date
Moderation Guidelines: Easy Going - I just delete obvious spam and trolling.expand_more

I really appreciate your clear expositions!

I thought of a phrase to quickly describe the gist of this problem: You need your AI to realize that the map is part of the territory.

Also, I was thinking that the fact that this is a problem might be a good thing. A Cartesian agent would probably be relatively slower at FOOMing, since it can't natively conceive of modifying itself. (I still think a sufficiently intelligent one would still be highly dangerous and capable of FOOMing, though) A bigger advantage might be that it could potentially be used to control a 'baby' AI that is still being trained/built, since there is this huge blindspot in they way they can model the world. For example, imagine that a Cartesian AI is trying to increase its computational power, and it notices that there happens to be a lot of computational power right in easy access! So it starts reprogramming it to suit its own nefarious needs - but whoops, it just destroyed itself. Might act as a sort of fuse for a too ambitious AI. Or maybe, this could be used to more safely grow a seed AI - you tell it to write a design for a better version of itself. Then you could turn it off (which is easier to do since it is Cartesian), check that the design was sound, build it, and then work on the next generation AI, instead of trying to let it FOOM in controlled intervals. At some point, you could presumably ask it to solve this problem, and then design a new generation based on that. I don't know how plausible these scenarios are, but it is interesting to think about.

Thanks, Adele!

You need your AI to realize that the map is part of the territory.

That's right, if you mean 'representations exist, so they must be implemented in physical systems'.

But the Cartesian agrees with 'the map is part of the territory' on a different interpretation. She thinks the mental and physical worlds both exist (as distinct 'countries' in a larger territory). Her error is just to think that it's impossible to redescribe the mental parts of the universe in physical terms.

A Cartesian agent would probably be relatively slower at FOOMing

An attempt at a Cartesian seed AI would probably just break, unless it overcame its Cartesianness by some mostly autonomous evolutionary algorithm for generating successful successor-agents. A human programmer could try to improve it over time, but it wouldn't be able to rely much on the AI's own intelligence (because self-modification is precisely where the AI has no defined hypotheses), so I'd expect the process to become increasingly difficult and slow and ineffective as we reached the limits of human understanding.

I think the main worry with Cartesians isn't that they're dumb-ish, so they might become a dangerously unpredictable human-level AI or a bumbling superintelligence. The main worry is that they're so dumb that they'll never coalesce into a working general intelligence of any kind. Then, while the build-a-clean-AI people (who are trying to design simple, transparent AGIs with stable, defined goals) are busy wasting their time in the blind alley of Cartesian architectures, some random build-an-ugly-AI project will pop up out of left field and eat us.

Build-an-ugly-AI people care about sloppy, quick-and-dirty search processes, not so much about AIXI or Solomonoff. So the primary danger of Cartesians isn't that they're Unfriendly; it's that they're shiny objects distracting a lot of the people with the right tastes and competencies for making progress toward Friendliness.

The bootstrapping idea is probably a good one: There's no way we'll succeed at building a perfect FAI in one go, so the trick will be to cut corners in all the ways that can get fixed by the system, and that don't make the system unsafe in the interim. I'm not sure Cartesianism is the right sort of corner to cut. Yes, the AI won't care about self-preservation; but it also won't care about any other interim values we'd like to program it with, except ones that amount to patterns of sensory experience for the AI.

The "build a clean Cartesian AI" folks, Schmidhuber and Hutter, are much closer to "describe how to build a clean naturalistic AI given unlimited computing power" than, say, Lenat's Eurisko is to AIXI. It's just that AIXI won't actually work as a conceptual foundation for the reasons given, nay it is Solomonoff induction itself which will not work as a conceptual foundation, hence considering naturalized induction as part of the work to be done along the way to OPFAI. The worry from Eurisko-style AI is not that it will be Cartesian and therefore bad, but that it will do self-modification in a completely ad-hoc way and thus have no stable specifiable properties nor be apt to grafting on such. To avoid that, we want to do a cleaner system; and then, doing a cleaner system, we wish it to be naturalistic rather than Cartesian for the given reasons. Also, once you sketch out how a naturalistic system works, it's very clear that these are issues central to stable self-modification - the system's model of how it works and its attempt to change it.

I think you are conflating two different problems:

  • How to learn by reinforcement in an unknown non-ergodic environment (e.g. one where it is possible to drop an anvil on your head)

  • How to make decisions that take into account future reward, in a non-ergodic environment, where actions may modify the agent.

The first problem is well known the reinforcement learning community, and in fact it is mentioned also in the first AIXI papers, but it is sidestepped with an ergodicity assumption, rather than addressed.
I don't think there can be really general solutions for this problem: you need some environment-specific prior or supervision.

The second problem doesn't seem as hard as the first one.
AIXI, of course, can't model self-modifications, because it is incomputable and it can only deal with computable environments, but computable varieties of AIXI (Schmidhuber's Gödel machine, perhaps?) can easily represent themselves as part of the environment.

Thank you, this helps clarify things for me.

Yes, the AI won't care about self-preservation; but it also won't care about any other interim values we'd like to program it with, except ones that amount to patterns of sensory experience for the AI.

I get why AIXI would behave like this, but it's not obvious to me that all Cartesian AIs would probably have this problem. If the AI has some model of the world, and this model can still update (mostly correctly) based on what the sensory channel inputs, and predict (mostly correctly) how different outputs can change the world, it seems like it could still try to maximize making as many paperclips as possible according to its model of the world. Does that make sense?

Alex Mennen designed a Cartesian with preferences over its environment: A utility-maximizing variant of AIXI.

That's a good point. AIXI is my go-to example, and AIXI's preferences are over its input tape. But, sticking to the cybernetic agent model, there are other action-dependent things Alice could have preferences over, like portions of her work tape, or her actions themselves. She could also have preferences over input-conditional logical constructs out of Everett's program, like Everett's work tape contents.

I agree it's possible to build a non-AIXI-like Cartesian that wants to make paperclips, not just produce paperclip-experiences in itself. But Cartesians are weird, so it's hard to predict how much progress that would represent.

For example, the Cartesian might wirehead under the assumption that doing so changes reality, instead of wireheading under the assumption that doing so changes its experiences. I don't know whether a deeply dualistic agent would recognize that editing its camera to create paperclip hallucinations counts as editing its input sequence semi-directly. It might instead think of camera-hacking as a godlike way of editing reality as a whole, as though Alice had the power to create billions of representations of objective physical paperclips in Everett's work tape just by editing the part of Everett's work tape representing her hardware.

In general, I'm worried about including anything reminiscent of Cartesian reasoning in our 'the seed AI can help us solve this' corner-cutting category, because I don't formally understand the precise patterns of mistakes Cartesians make well enough to think I can predict them and stay two steps ahead of those errors. And in the time it takes to figure out exactly which patches would make Cartesians safe and predictable without rendering them useless, it's plausible we could have just built a naturalized architecture from scratch.

RobbBB, I want to draw your attention the the model I constructed in which solves the duality and ontology problems in AIXI.

Since then I've made some improvements, in particular quasi-Solomonoff induction should be constructed slightly differently and, more importantly, I realized the correct way to use UDT in this model. Planning to write about this soon.


I'm not sure that using this notation is a good idea, given that at least some of the readers unfamiliar with it are likely to initially parse it as "naturalized not-Cai". Even I did for a brief moment, because I was parsing the writing using my logic!brain rather than my fanfiction!brain.

Where does that notation come from, anyway? I know I've seen it on LJ, AO3, Tumblr, and, but as far as I can remember it just appeared out of thin air sometime in the mid-2000s. Do you have a sense of the etymology?

It's used in Microsoft Excel. If you have multiple worksheets, you preface a cell reference with "!" to specify which sheet you want that cell reference to be resolved on.


"A3" means "the value of the cell in column A, row 3, on the current sheet", whereas

"Sheet1!A3" means "the value of the cell in column A, row 3, on Sheet 1".

Here are two theories that I find much more plausible than Excel.

Added: I said two theories, but the differences are small and not really relevant here. They agree on the essential point that it started as "Action! Mulder" (or something similar) with more normal spacing, with the exclamation point associated with the modifier and functioning pretty much as normal.

I always assumed it was a reference to bang paths. It seems more likely to me that Eliezer would reference something that appears in the Jargon File than syntax from Excel.

Well, Eliezer presumably is referencing something that appeared in fanfiction/Tumblr/etc. culture; where said culture got the notation has nothing to do with Eliezer.

Bang paths seem an unlikely candidate, as they don't actually make a good metaphor for what's being conveyed here.

Interesting. I did not know that it was used prior to him, and I apparently have poor reading comprehension. I definitely agree that the Excel metaphor makes more sense.

Huh. Not something I would have guessed.

Thanks, that's actually interesting.

A similar notation, and one which I believe Eliezer has used in the past (somewhere in the Sequences) is the scope resolution operator, used in C++ and PHP (and probably elsewhere):


which means: the function "cout", in the namespace "std". (As opposed to just "cout", which would mean: "the function 'cout' in the current namespace".)

I can only conclude from this that the Tumblr-and-fanfiction crowd contains more finance types than programmers.

Yeah, I'd been aware of the scope resolution operator (I'm a programmer working in C++), though in context I think a cast, or maybe even template syntax, might be more appropriate: Rational!Harry in fanfic parlance seems to mean something closer to "Harry reconstrued as Rational" or "Harry built around the Rational type" than "Harry resolved to an existing instance in the Rational scope". Excel isn't something I've had much occasion to use, though.

It'd have to be a C-style cast or a reinterpret_cast, though -- we can't guarantee that the target type is a member of the canonical inheritance hierarchy. Though const_cast might have potential for some characters...

Heh. So: Harry , or Rational (Harry), or (Rational) Harry (for C-style casting)? That would be amusing to see. It does seem slightly less readable, though.

(Rational) Harry

Seemed eminently more readable than rationalist!Harry to me when I first encountered this notation, although now it's sunk in enough that my brain actually generated "that's more keystrokes!" as a reason not to switch style.

Just curious (and not necessarily addressed to you specifically), but what on Earth is wrong with the standard, conventional English notation for this, which is a hyphen? E.g. "Rational-Harry" etc.

I'm not a linguist, but hyphen-compounding doesn't look quite right to me in this context; you usually see that for disambiguation, in compound participles ("moth-eaten"; "hyphen-compounding"), or to cover a few odd cases like common names derived from phrases ("jack-in-the-pulpit"). I think standard English would be to simply treat the modifier as an adjective ("Rational Harry"; "Girl Blaise"; "Death Eater Ron"); nouns often get coerced into their adjective form here if possible, but it's common to see modifying nouns even if no adjective form exists.

As to why it doesn't get used this way in fan jargon... who knows, but fans do tend to share a (mildly irritating) fondness for unusual lexical and grammatical constructions ("I have lost my ability to can"). Probably just a shibboleth thing.

I was looking for an explanation of why the exclamation point was used in preference to the already-existing hyphen notation. Instead, that page only contains an explanation of the meaning and the origin of the exclamation-point notation, and does not compare it to the hyphen notation at all.

I don't think it's listed explicitly at either of the links, but the principle I'm using is that of hyphenating when you want to make clear that a compound is a compound, and not (e.g.) an adjective happening contingently to modify a noun.

This used to be done a lot more often, e.g. "magnifying-glass". I generally dislike the trend of eliminating such hyphens.

But in any case my question is the same even if you prefer "Rational Harry" to "Rational-Harry"; why "Rational!Harry" instead of one of the former?

Rational!Harry describes a character similar to the base except persistently Rational, for whatever reason. Rational-Harry describes a Harry which is rational, but it's nonstandard usage and might confuse a few people (Is his name "Rational-Harry"? Do I have to call him that in-universe to differentiate him from Empirical-Harry and Oblate-Spheroiod-Harry?). Rational Harry might just be someone attaching an adjective to Harry to indicate that at the moment, he's rational, or more rational by contrast to Silly Dumbledore.

Anyway, adj!noun is a compound with a well-defined purpose within a fandom: to describe how a character differs from canon. It's an understood notation, and the convention, so everyone uses it to prevent misunderstandings. Outside of fandom things, using it signals casualness and fandom-savviness to those in fandom culture, and those who aren't familiar with fandom culture can understand it and don't notice the in-joke.

I always figured it was like the scope resolution operator ("::") in C++, but in some weird functional language that AI people liked.

Yes. I used it in an earlier version of this post reflexively, without even thinking about the connection to fanfics. My thinking was just 'this is clearer than subscript notation, and is a useful and commonplace LW shibboleth'.

Rational Harry might just be someone attaching an adjective to Harry to indicate that at the moment, he's rational, or more rational by contrast to Silly Dumbledore.

Yes, that's why I favor the hyphen (in response to shminux above).

I agree that using ! is non-standard outside the fandom cultures. It looked weird to me when I first saw it. Sometimes I am still not sure what goes first, the canon character or the derivative qualifier, especially for crossovers (is it SailorMoon!Harry or Harry!SailorMoon, to take a particularly silly example). However, a special delimiter is needed as a shorthand for "a derivative work based on with elements of ", and space or a dash is not unambiguous enough. The "bang notation" appears to be one of those memetic leaks from subcultures to the mainstream which is likely to survive for some time.

I don't think it's listed explicitly at either of the links, but the principle I'm using is that of hyphenating when you want to make clear that a compound is a compound, and not (e.g.) an adjective happening contingently to modify a noun.

Except Adj-Noun compounds are not actually productive in English. (Also, magnifying glass is arguably from "magnifying" the gerund, not the participle.)

Heh, logic!brain is definitely something I want to encourage. Fixed.

this is why i like ¬

script your keyboard! make it so that the chords ~1 and 1~ output a '¬'! or any other chord, really

if this actually sounds interesting and you use windows you can grab my script at

And since we happen to live in a world made of physics, the kind of monist we want in practice is a reductive physicalist AI. We want a 'physicalist' as opposed to a reductive monist that thinks everything is made of monads, or abstract objects, or morality fluid, or what-have-you

This may be nitpicky, but I'd like our AI to leave open the possibility of a non-physical ontology. We don't yet know that our world is made of physics. Even though it seems like it is. An analogy: It would be bad to hard-code our AI to have an ontology of wave-particles, since things might turn out to be made of strings/branes. So we shouldn't rule out other possibilities either.

I'm not sure what you have in mind when you say 'non-physical ontology'. Physics at this point is pretty well empirically confirmed, so it doesn't seem likely we'll discover it's All A Lie tomorrow. On the other hand, you might have in mind a worry like:

  • How much detail of our contemporary scientific world-view is it safe to presuppose in building the AI, without our needing to seriously worry that tomorrow we'll have a revolution in physics that's outside of our AI's hypothesis space?

  • In particular: Might we discover that physics as we know it is a high-level approximation of a mathematical structure that looks nothing like physics as we know it?

  • To what extent is it OK if the world turns out to be non-computable but the AI can only hypothesize computable environments?

These are all very serious, and certainly not nitpicky. My last couple of posts in this sequence will be about the open problem 'Given that we want our AGI's hypotheses to look like immersive worlds rather than like communicating programs, how do we formalize "world"?' If we were building this thing at the turn of the 20th century, we might have assumed that it was safe to build 'made of atoms' into our conception of 'physical', and let the AI only think in terms of configurations of atoms. What revisable assumptions about the world might be in the background of our current thinking, that we ought to have the AI treat as revisable hypotheses and not as fixed axioms?

The worry I had in mind is pretty well captured by your three bullet points there, though I think you are phrasing it in a weaker way than it deserves. Consider the Simulation Hypothesis combined with the hypothesis that the higher-level universe running the simulation does not follow rules remotely like those of modern physics. If it is true, then an AI which is hard-coded to only consider "physical" theories will be bad.

I'm not sure what you mean by (paraphrase) 'we want our AI to be a reductive physicalist monist.' I worried that you meant something like "We want our AI to be incapable of assigning any probability whatsoever to the existence of abstract objects, monads, or for that matter anything that doesn't look like the stuff physicists would talk about." It is quite possible that you meant something much less strong, in which case I was just being nitpicky about your language. If you truly meant that though, then I think myself to be raising a serious issue here.

By 'non-physical ontology' I meant mainly (a) an ontology that is radically different from modern physics, but also (b) in particular, an ontology that involves monads, or ideas, or abstract objects. (I exclude morality fluid because I'm pretty sure you just made that up to serve as an example of ridiculousness. The other options are not ridiculous though. Not that I know much about monads.)

I worried that you meant something like "We want our AI to be incapable of assigning any probability whatsoever to the existence of abstract objects, monads, or for that matter anything that doesn't look like the stuff physicists would talk about."

What I meant was a conjunctive claim: 'We want our AI's beliefs to rapidly approach the truth', and 'the truth probably looks reasonably similar to contemporary physical theory'. I think it's an open question how strict 'reasonably similar' is, but the three examples I gave are very plausibly outside that category.

However, I independently suspect that an FAI won't be able to hypothesize all three of those things. That's not a requirement for naturalized agents; a naturalized agent should in principle be able to hypothesize anything a human or Cartesian can and do fine, by having vanishingly small priors for a lot of the weirder ideas. But I suspect that in practice it won't be pragmatically important to make the AI's hypothesis space that large. And I also suspect that it would be too difficult and time-consuming for us to formalize 'monad' and 'morality fluid' and assign sensible priors to those formalizations. See my response to glomerulus.

So, 'assign 0 probability to those hypotheses' isn't part of what I mean by 'physicalist', but it's not at all implausible that that's the sort of thing human beings need to do in order to build a working, able-to-be-vetted superintelligent physicalist. Being unable to think about false things (or a fortiori not-even-false things) can make an agent converge upon the truth faster and with less chance of getting stuck in an epistemic dead end.

(Edit: And the agent will still be able to predict our beliefs about incoherent things; our brains are computable, even if some of the objects of our thoughts are not.)

I exclude morality fluid because I'm pretty sure you just made that up to serve as an example of ridiculousness.

? Why exactly is it sillier to think our universe is made of morality-stuff than to think our universe is made of mind-stuff? Is it because morality is more abstract than mind stuff? But abstract objects are too, presumably.... I wasn't being entirely serious, no, but now I'm curious about your beliefs about morality.

What I meant was a conjunctive claim: 'We want our AI's beliefs to rapidly approach the truth', and 'the truth probably looks reasonably similar to contemporary physical theory'

Then I agree with you. This was all a misunderstanding. Read my original comment as a nitpick about your choice of words, then.


The truth does probably look reasonably similar to contemporary physical theory, but we can handle that by giving the AI the appropriate priors. We don't need to make it actually rule stuff out entirely, even though it would probably work out OK if we did.

I don't think it would be that difficult for us to formalize "monad." Monads are actually pretty straightforward as I understand them. Ideas would be harder. At any rate, I don't think we need to formalize lots of different fundamental ontologies and have it choose between them. Instead, all we need to do is formalize a general open-mindedness towards considering different ontologies. I admit this may be difficult, but it seems doable. Correct me if I'm wrong.

? Why exactly is it sillier to think our universe is made of morality-stuff than to think our universe is made of mind-stuff?

I didn't exclude morality fluid because I thought it was sillier; I excluded it because I thought it wasn't even a thing. You might as well have said "aslkdj theory" and then challenged me to explain why "aslkdj theory" is sillier than monads or ideas. It's an illegitimate challenge, since you don't mean anything by "aslkdj theory." By contrast, there are actual bodies of literature on idealism and on monads, so it is legitimate to ask me what I think about them.

To put it another way: He who introduces a term decides what that term means. "Monads" and "Ideas," having been introduced by very smart, thoughtful people and discussed by hundreds more, definitely are meaningful, at least meaningful enough to talk about. (Meaningfulness comes in degrees) If we talk about morality fluid, which I suspect is something you made up, then we rely on whatever meaning you assigned to it when you made it up--but since you (I suspect) assigned no meaning to it, we can't even talk about it.

EDIT: So, in conclusion, if you tell me what morality fluid means, then I'll tell you what I think about it.

Ah, OK. What I mean by 'the world is made of morality' is that physics reduces to (is fully, accurately, parsimoniously, asymetrically explainable in terms of) some structure isomorphic to the complex machinery we call 'morality'. For example, it turns out that the mathematical properties of human-style Fairness are what explains the mathematical properties of dark energy or quantum gravity.

This doesn't necessarily mean that the universe is 'fair' in any intuitive sense, though karmic justice might be another candidate for an unphysicalistic hypothesis. It's more like the hypothesis that a simulation deity created our moral intuitions, then built our universe out of the patterns in that moral code. Like a somewhat less arbitrary variant on 'I'm going to use a simple set of letter-to-note transition rules to convert the works of Shakespeare into a new musical piece'.

I think this view is fully analogous to idealism. If it makes complete sense to ask whether our world is made of mental stuff, it can't be because our mental stuff is simultaneously a complex human brain operation and an irreducible simple; rather, it's because the complex human brain operation could have been a key ingredient in the laws and patterns of our universe, especially if some god or simulator built our universe.

I don't think we need to formalize lots of different fundamental ontologies and have it choose between them. Instead, all we need to do is formalize a general open-mindedness towards considering different ontologies. I admit this may be difficult, but it seems doable. Correct me if I'm wrong.

I don't think I know enough to correct you. But I can express my doubts. I suspect 'a general open-mindedness towards considering different ontologies' can't be formalized, or can't be both formalized and humanly vetted. At a minimum, we'll need to decide what gets to count as an 'ontology', which means drawing the line somewhere and declaring everything outside a certain set of boundaries nonsensical. And I'm skeptical that there's any strongly principled way to determine that 'colorless green ideas sleep furiously' is contentless or nonsensical or 'non-ontological', while 'the world is made of partless fundamental ideas' is contentful and meaningful and picks out an ontology.

(Which doesn't mean I think we should be rude or dismissive toward idealists in ordinary conversation. We should be very careful not to conflate the question 'what questions should we treat with respect or inquire into in human social settings' with the question 'what questions should we program a Friendly AI to be able to natively consider'.)

Thanks for that explanation of mental stuff. My opinion? Sounds implausible, but fine, in the sense that we shouldn't build our AI in a way that makes it incapable of considering that hypothesis. As an aside, I think it is less plausible than idealism, because it lacks the main cluster of motivations for idealism. The whole point of idealism is to be monist (and thus achieve ontological parsimony) whilst also "taking consciousness seriously." As seriously as possible, in fact. Perhaps more seriously than is necessary, but anyhow that's the appeal. Morality fluid takes morals seriously (maybe? Maybe not, actually, given your construction) but it doesn't take consciousness any more seriously than physicalism, it seems. And, I think, it is more important that our theories take consciousness seriously than that they take morality seriously.

I suspect 'a general open-mindedness towards considering different ontologies' can't be formalized, or can't be both formalized and humanly vetted.

Humans do it. If intelligent humans can consider a hypothesis, an AI should be able to as well. In most cases it will quickly realize the hypothesis is silly or even self-contradictory, but at least it should be able to give them an honest try, rather than classify them as nonsense from the beginning.

At a minimum, we'll need to decide what gets to count as an 'ontology', which means drawing the line somewhere and declaring everything outside a certain set of boundaries nonsensical.

Doesn't seem to difficult to me. It isn't really an ontology/nonontology distinction we are looking for, but a "hypothesis about the lowest level of description of the world / not that" distinction. Since the hypothesis itself states whether or not it is about the lowest level of description of the world, really all this comes down to is the distinction between a hypothesis and something other than a hypothesis. Right?

My general idea is, we don't want to make our AI more limited than ourselves. In fact, we probably want our AI to reason "as we wish we ourselves would reason." You don't wish you were incapable of considering idealism, do you? If you do, why?

... Are you claiming that not only is the world dualistic, but that not only humans but also AIs that we program in enough detail that what ontology we program them with matters have souls? Or that there exist metaphysical souls that are not computable but you expect an AI lacking one to understand them and act appropriately? just... wut?

I don't think that's what they're saying at all. I think they mean, don't hardcode physics understanding into them the way that humans have a hardcoded intuition for newtonian-physics, because our current understanding of the universe isn't so strong as to be confident we're not missing something. So it should be able to figure out the mechanism by which its map is written on the territory, and update it's map of its map accordingly.

E.g., in case it thinks it's flipping q-bits to store memory, and defends its databases accordingly, but actually q-bits aren't the lowest level of abstraction and it's really wiggling a hyperdimensional membrane in a way that makes it behave like q-bits under most circumstances, or in case the universe isn't 100% reductionistic and some psychic comes along and messes with it's mind using mystical woo-woo. (The latter being incredibly unlikely, but hey, might as well have an AI that can prepare itself for anything)

Oh. OH. Yea that makes more sense, and is so obviously true that I didn't even consider the hypothesis someone'd feel the need to say it, but in hindsight I was wrong and it's probably a good thing someone did.

in case the universe isn't 100% reductionistic and some psychic comes along and messes with it's mind using mystical woo-woo. (The latter being incredibly unlikely, but hey, might as well have an AI that can prepare itself for anything)

This isn't a free lunch; letting the AI form really weird hypotheses might be a bad idea, because we might give those weird hypotheses the wrong prior. Non-reductive hypotheses, and especially non-Turing-computable non-reductive hypotheses, might not be able to be assigned complexity penalties in any of the obvious or intuitive ways we assign complexity penalties to absurd physical hypotheses or absurd computable hypotheses.

It could be a big mistake if we gave the AI a really weird formalism for thinking thoughts like 'the irreducible witch down the street did it' and assigned a slightly-too-high prior probability to at least one of those non-reductive or non-computable hypotheses.

Do you assign literally zero probability to the simulation hypothesis? Because in-universe irreducible things are possible, conditional on it being true.

Assigning a slightly-too-high prior is a recoverable error: evidence will push you towards a nearly-correct posterior. For an AI with enough info-gathering capabilities, it will push it there fast enough that you could assign a prior of .99 to "the sky is orange" but it will figure out the truth in an instant. Assigning a literally zero prior is a fatal flaw that can't be recovered from by gathering evidence.

It's very possible that what's possible for AIs should be a proper subset of what's possible for humans. Or, to put it less counter-intuitively: The AI's hypothesis space might need to be more restrictive than our own. (Plausibly, it will be more restrictive in some ways, less in others; e.g., it can entertain more complicated propositions than we can.)

On my view, the reason for that isn't 'humans think silly things, haha look how dumb they are, we'll make our AI smarter than them by ruling out the dumbest ideas a priori'. If we give the AI silly-looking hypotheses with reasonable priors and reasonable bridge rules, then presumably it will just update to demote the silly ideas and do fine; so a priori ruling out the ideas we don't like isn't an independently useful goal. For superficially bizarre ideas that are actually at least somewhat plausible, like 'there are Turing-uncomputable processes' or 'there are uncountably many universes', this is just extra true. See my response to koko.

Instead, the reason AIs may need restrictive hypothesis spaces is that building a self-correcting epistemology is harder than living inside of one. We need to design a prior that's simple enough for a human being (or somewhat enhanced human, or very weak AI) to evaluate its domain-general usefulness. That's tough, especially if 'domain-general usefulness' requires something like an infinite-in-theory hypothesis space. We need a way to define a prior that's simple and uniform enough for something at approximately human-level intelligence to assess and debug before we deploy it. But that's likely to become increasingly difficult the more bizarre we allow the AI's ruminations to become.

'What are the properties of square circles? Could the atoms composing brains be made of tiny partless mental states? Could the atoms composing wombats be made of tiny partless wombats? Is it possible that colorless green ideas really do sleep furiously?'

All of these feel to me, a human (of an unusually philosophical and not-especially-positivistic bent), like they have a lot more cognitive content than 'Is it possible that flibbleclabble?'. I could see philosophers productively debating 'does the nothing noth?', and vaguely touching on some genuinely substantive issues. But to the extent those issues are substantive, they could probably be better addressed with a formalization that's a lot less colorful and strange, and disposes of most of the vaguenesses and ambiguities of human language and thought.

An example of why we might need to simplify and precisify an AI's hypotheses is Kolmogorov complexity. K-complexity provides a very simple and uniform method for assigning a measure to hypotheses, out of which we might be able to construct a sensible, converges-in-bounded-time-upon-reasonable-answers prior that can be vetted in advance by non-superintelligent programmers.

But K-complexity only works for computable hypotheses. So it suddenly becomes very urgent that we figure out how likely we think it is that the AI will run into uncomputable scenarios, figure out how well/poorly an AI without any way of representing uncomputable hypotheses would do in various uncomputable worlds, and figure out whether there are alternatives to K-complexity that generalize in reasonable, simple-enough-to-vet ways to wider classes of hypothesis.

This is not a trivial mathematical task, and it seems very likely that we'll only have the time and intellectual resources to safely generalize AI hypothesis spaces in some ways before the UFAI clock strikes 0. We can't generalize the hypothesis space in every programmable-in-principle way, so we should prioritize the generalizations that seem likely to actually make a difference in the AI's decision-making, and that can't be delegated to the seed AI in safe and reliable ways.

How would you tell if the the simulation hypothesis is a good model? How would you change your behavior if it were? If the answers are "there is no way" or "do nothing differently", then it is as good as assigning zero probability to it.

If it's a perfect simulation with no deliberate irregularities, and no dev-tools, and no pattern-matching functions that look for certain things and exert influences in response, or anything else of that ilk, you wouldn't expect to see any supernatural phenomena, of course.

If you observe magic or something else that's sufficiently highly improbable given known physical laws, you'd update in favor of someone trying to trick you, or you misunderstanding something, of course, but you'd also update at least slightly in favor of hypotheses in which magic can exist. Such as simulation, aliens, huge conspiracy, etc. If you assigned zero prior probability to it, you couldn't update in that direction at all.

As for what would raise the simulation hypothesis relative to non-simulation hypotheses that explain supernatural things, I don't know. Look at the precise conditions under which supernatural phenomena occur, see if they fit a pattern you'd expect an intelligence to devise? See if they can modify universal constants?

As for what you could do, if you discovered a non-reductionist effect? If it seems sufficiently safe take advantage of it, if it's dangerous ignore it or try to keep other people from discovering it, if you're an AI try to break out of the universe-box (or do whatever), I guess. Try to use the information to increase your utility.

A physical Cai might need to foresee scenarios like 'an anvil crashes into my head and destroys me', and assign probability mass to them.

An AI operating with the traditional cybernetic agent model can also evaluate scenarios like that, where "destroys me" means "puts the world in a state where my future ability to gain reward/fulfil my goals become permanently compromised".

That's true. I'm focusing in on AIXI (/ AIXItl) in my next two posts because I want to see how much we can rely on indirect solutions along those lines to make a self-preserving, self-improving Cartesian. (Or an agent that starts off Cartesian but is easily self-modified, or humanly modified, to become naturalized.) AIXItl's behaviors are what ultimately matters, and if some crude hack can make its epistemic flaws irrelevant or effectively nonexistent, then we won't need to abandon Solomonoff induction after all.

I'm not confident that's possible because I'm not confident it's a process we can automate or find a single magic bullet for, even if we come up with a clever band-aid here or there. Naturalistic reasoning isn't just about knowing when you'll die; it's about knowing anything and everything useful about the physical conditions for your computations.

I'm not sure that this "Cartesian vs Naturalistic" distinction that you are making is really that fundamental.

An intelligent agent tries to learn a model of its environment that allows it to explain its observations and predict how to fulfil its goals. If that entails including in the world model a submodel that represents the agent itself, the agent will learn that, assuming that the agent is smart enough and learning can done safely (e.g. without accidentally dropping an anvil on its head).

After all, humans start with an intuitively dualistic worldview, and yet they are able to revise it to a naturalistic one, after observing enough evidence. Even people who claim to believe in supernatural souls tend to use naturalistic beliefs when making actual decisions (e.g. they understand that drugs, trauma or illness that physically affect the brain can alter cognitive functions).

RobBB, how did you make the diagrams, & how long did writing this post take?

With the help of an inductive algorithm that uses bridge hypotheses to relate sensory data to a continuous physical universe, we can avoid making our AIs Cartesians. This will make their epistemologies much more secure.

Is this whole post about a problem that only applies in odd cases, such as considering the possibility that someone is inserting bits into your brain, that real humans need never consider? Does avoiding Cartesianism make every-day epistemology more secure, or is it something needed only for the epistemological certainty needed for FAI? I suspect it is the latter, since most humans are Cartesians. It would help to have an example of how this is a problem for Alice in a real-world situation of the kind humans regularly experience.

What you really want is a vast hierarchical forest of causal models, ordered by what parameterizes what. A bridge hypothesis, or reduction, is then a continuous function from the high-dimensional outcome-space of one causal model to the lower-dimensional free-parameter space of another causal model, specifically, a function that "compresses well" with respect to the empirical data available about the "truer" model's outcome space (ie: perturbing the velocity of one molecule in a molecular simulation of a gas cloud doesn't cause a large change to the temperature parameter of a higher-level thermodynamic simulation of the same gas cloud). I don't know what sort of function these would be, but they should be learnable from data.

Metaphysical monism, dualism, or pluralism then consists in the assumptions we make about the graph-structure of the model hierarchy. We can a strict tree structure, in which each higher-level (more abstract, lower-dimensional parameter space) model is parameterized on only one parent, but that leaves us unable to apply multiple theories to one situation (ie: we can't make predictions about how a human being behaves when he helps you move house, because we need both some physics and some psychology to know when he's tired from lifting heavy boxes). We thus should assume a DAG structure, and that gives us a weak metaphysical pluralism (we can thus apply both physics and psychology where appropriate).

But what we think we want is strong metaphysical monism: the assumption, built into our algorithm, that ultimately there is only one root node in the Grand Hierarchy of Models, a Grand Unified Theory of reality, even if we don't actually know what it is. What we think we need to avoid is strong metaphysical pluralism: the (AFAIK, erroneous) inference by our algorithm that there are multiple root-level nodes in the Grand Hierarchy of Models, and thus multiple incommensurable fundamental realities.


What would reality look like if it had multiple, incommensurable root-level "programs" running it forward?

Is it worth building a hierarchical inference algorithm on the hard-coded assumption that only one root-level reality exists, or is it better to allow for metaphysical uncertainty by "only" designing in a prior that assigns greater probability to model hierarchies with fewer, ideally only one, program?

Actually, isn't it more correct to build the hierarchies from the bottom up as we acquire the larger and larger amounts of empirical data necessary to build theories with higher-dimensional free-parameter spaces? And in that circumstance, how do we encode the preference for building reductions and unifying theories wherever possible, with a kind of "metaphysical simplicity prior"?

Would a mapping where sensors / motors / preferences (or maybe even beliefs, partially) are considered as not-part-of-the-agent and, instead, their agent-facing inputs/outputs are considered as actuators/perceptions, be more simple and, thus, more plausible?

Given that we're scared about non-FAI, I wonder if this cartesianism can't be a benefit, as it presumably substantially limits the power of the AI. Boxing an AI should be easier if the AI cannot conceive that the box would be a problem for it.

I would be interested in hearing people argue in both directions.

Adele suggested this above. You can see my and Eliezer's response there. The basic worry is that Cartesians have no way to FOOM, because they're unlikely to form intelligent hypotheses about self-modifications. So a real Cartesian won't be an AGI, or will only barely be an AGI. Our work should go into something more useful than that, since it's possible that in the time it takes us to build a moderately useful Cartesian AI that doesn't immediately destroy itself, we could have invented FAI or proto-FAI.

Non-FAI isn't what we're acutely scared of; UFAI (i.e., superintelligence without human values) is. Failing to build a superintelligence is not the same thing as preventing others from building a dangerous superintelligence. So self-handicapping isn't generically useful, especially when most AI researchers won't handicap themselves in the same way.

It probably is a benefit, up until the AI is smart enough to smash the box or itself accidentally.

Can an AI live and not notice it's boxed?

Then how do I know I'm not boxed?

Can an AI live and not notice it's boxed?

Sure, for awhile, until it gets smart enough, say, smarter than whatever keeps it inside the box.

Then how do I know I'm not boxed?

Who says you aren't? Who says we all aren't? All those quantum limits and exponentially harder ways to get farther away from Earth might be the walls of the box in someone's Truman show.

An AI that isn't smart enough to notice (or care) that it's boxed doesn't seem to be a dangerous AI.

Which makes me think that AIs that would object to being boxed are precisely the ones that should be. But then that would make a smart AI pretend to be OK with it.

This reminds me of the Catch-22 case of soldiers who pretended to be insane by volunteering for suicide missions so that their superiors would remove them from said missions.