Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

A strong theory of logical uncertainty might let us say when the results of computations will give “information”, including logical information, about other computations. This might be useful for, among other things, identifying parts of hypotheses that have the same meaning.

TL;DR: I don’t think this works as stated, and this kind of problem should probably be sidestepped anyway.

Experts may get most of the value from the summary. Thanks to Sam Eisenstat for conversations about this idea.

Executive summary

If we have a good predictor under logical uncertainty , we can ask: how does ’s predictions about the output of a computation change if it learns the outcome of ? We can then define various notions of how informative is about .

Possible uses:

  • Throttling modeling capabilities: preventing an agent from gaining too much logical information about a computation might translate into safety guarantees in the form of upper bounds on how well the agent can model .

  • Ontology identification: finding parts of different hypotheses that are mutually informative about each other could point towards the ontology used by arbitrary hypotheses, which could be useful for attaching values to parts of hypotheses, dealing with ontological crises, and finding which parts of a world-model influenced a decision.

  • Non-person predicate: disallowing computations that are very informative about a human might prevent some mindcrime.


  • Empirical information can include logical information, so a good theory will have to consider “total information”.

  • Any interesting properties of logical information rely heavily on a good theory of logical uncertainty.

  • If the predictor knows too much, it can’t see dependencies, since it has no uncertainty to become more informed about.

  • The predictor needs more time to integrate larger knowledge bases and larger pointers to parts of hypotheses. But given more time, the predictor becomes less and less uncertain about the computation of interest, without necessarily having learned anything relevant.

  • The non-person predicate only blacklists computations that are clearly informative about a particular computation, so it doesn’t at all prevent mindcrime in general.

  • Counterlogical behavior is probably relevant to our values, and isn’t captured here.

  • At best this scheme identifies abstract structures in hypotheses, rather than things that humans care about.

Logical information

TL;DR: it might be enlightening to look at how a good predictor, under logical uncertainty, would change its distribution after adding the result of a computation to its knowledge base.

Logical uncertainty

Say that a predictor is an algorithm that assigns probabilities to outcomes of computations. For conceptual convenience, assume all computations are labeled by a running time and a finite specification of what their outputs could possibly be (so we don’t have to worry that they won’t halt, and so we know what space to assign probabilities over). Predictors have access to a knowledge base , which is a list of outputs of computations, and we write for the probability distribution the predictor puts on the outputs of given knowledge base .

Assume we have a predictor which is “good” in some sense to be determined by how we will use .

Logical information

We can then talk about how informative a given computation is about another computation (relative to ). Say , in fact evaluates to , and has possible outputs . Then we can compare the distributions and over the ; how much they change gives a measure of how much would learn about if it learned that evaluates to .

For example, we can compare the entropy of these distributions to get an analog of information, or we could look at KL divergence, total variation, the probability of the true outcome of , or something else. We might also look at the expected change in one of these quantities between to , taken over the according to , to get a notion of “how much one could reasonably expect to learn about from ”.

Vaguely, we want to be such that if there is a reliable way of predicting faster or more accurately using the result of , then will take advantage of that method. Also should be able to do this across a wide variety of knowledge bases .

Example: XOR

Let and be two “unrelated”, difficult computations, each of which outputs a 0 or a 1. Let (the code of and appear verbatim in the code of ). Say , and they are independent (i.e. ), so that hopefully . (This is assuming something vaguely like has twice the time to predict something with twice the code; or we could explicitly add time indices to everything.)

We should have that , given that and are unrelated. Even knowing , still has the same relevant information and the same computational resources to guess at . On the other hand, consider . A good predictor asked to predict should, given the output of , devote more resources to guessing at . This should result in a more confident prediction in the right direction; e.g., assuming in fact , we might get .

Note that this is different from ordinary information, where the random variable would have no mutual information with , and would have an entire bit of mutual information with , conditional on . This difference is a feature: in the logical uncertainty setting, really is informative about .

Collapsing levels of indirection and obfuscation

It’s often difficult to understand where in a system the important work is being done, because important dynamics could be happening on different levels or in non-obvious ways. E.g., one can imagine a search algorithm implemented so that certain innocuous data is passed around from function to function, each applying some small transformation, so that the end result is some output-relevant computation that was apparently unrelated to the search algorithm. Taking a “logical information” perspective could let us harness a strong theory of logical uncertainty to say when a computation is doing the work of another computation, regardless of any relatively simple indirection.

Possible uses of logical information

Throttling changes in logical uncertainty

One might hope that preventing from being low entropy, would correspond to any sufficiently bounded agent being unable to model well. Then we could let grow by adding the results of computations that are useful but irrelevant to , and end up doing something useful, while maintaining some safety guarantee about the agent not modeling . I don’t think this works; see this post.

Ontology identification

TL;DR: maybe we can look at chunks of computation traces of different hypotheses, and ask which ones are logically informative about other ones, and thereby identify “parts of different hypotheses that are doing the same work”. (NB: Attacking the problem directly like this—trying to write down what a part is—seems doomed.)

Motivation: looking inside hypotheses

One major problem with induction schemes like Solomonoff induction is that their hypotheses are completely opaque: they take observations as inputs, then do…something…and then output predictions. If an agent does its planning using a similar opaque induction scheme, this prevents the agent from explicitly attaching values to things going on internally to the hypotheses, such as conscious beings; an agent like AI only gets to look at the reward signals output by the black-box hypotheses.

Even a successful implementation of AI (that didn’t drop an anvil on its head or become a committed reductionist) would have a strong incentive to seize control of its input channel and choose its favorite inputs, rather than to steer the world into any particular state, other than to protect the input channel. Technically one could define a utility function that incentivizes AI to act exactly in the way required to bring about certain outcomes in the external world; but this is just wrapping almost all of the difficulty into defining the utility function, which would have to do induction on transparent hypotheses, recognize valuable things, do decision theory, and translate action evaluations into a reward function (HT Soares).

The right way to deal with this may be to build a reasoning system to be understandable from the ground up, where each piece has, in some appropriate sense, a clear function with a comprehensible meaning in the context of the rest of the system; or to bypass the problem entirely with high-level learning of human-like behaviors, goals, world models, etc. (e.g. apprenticeship learning). But we may want to start with a reasoning system designed with other desiderata in mind, for example to satisfy guarantees like limited optimization power or robustness under changes. Then later if we want to be able to identify valuable referents of hypotheses, it would be nice to understand the “parts” of different hypotheses. In particular, it would be nice to know when two parts of two different hypotheses “do the same thing”.

Rice’s theorem: mostly false

(HT Eisenstat)

Rice’s theorem—you can’t determine any extensional input-output behavior of programs in general—stops us from saying in full generality when two programs do the same thing (in terms of I/O). But, there may be important senses in which this is mostly false: it may be possible to mostly discern the behavior of most programs. This is one way to view the logical uncertainty program.

Parts of hypotheses

I suggest that two “parts” of two hypotheses are “doing the same work” when they are logically informative about each other. For example, a (spacetime chunk containing a) chair in a physical model of the world, is informative about an abstract wire-frame simulation of the same chair, at least for certain questions such as “what happens if I tip the chair by radians?”. This should hold even if both hypotheses are fully specified and deterministic, as long as the predictor is not powerful enough to just simulate each one directly rather than guessing.

Unfortunately, I’m pretty sure that there’s no good way to, in general, decompose programs neatly so that different parts do separate things, and so that anything the program is modeling, is modeled in some particular part. Now I’ll sketch one notion of part that I think doesn’t work, in case it helps anyone think of something better.

Say that all hypotheses are represented as cellular automaton computations, e.g. as Turing machines. Consider the computation trace of a given hypothesis . Then define a part of to be a small, cheaply computable subtrace of the trace of , where by subtrace I just mean a subset of the bits in the trace. Cheaply computable is meant to capture the notion of something that is easily recognizable in the world while you are looking at it. This can be viewed as a small collection of observed nodes in the causal graph corresponding to the cellular automaton implementing .

Then we say that a part of a hypothesis is informative about a computation , to the extent that changes from . If is also a subtrace of some hypothesis , this gives a notion of how much two parts of two hypotheses are doing the same work.

Potential uses for identifying parts

Finding valuable parts of hypotheses. Say we have some computation that implements something relevant to computing values of possible futures. This doesn’t necessarily have to be intrinsically valuable by itself, like a simulation of a happy human, since it could be something like a strong but non-conscious Go program, and humans may have values about what sort of thing their opponent is. (It’s not actually obvious to me that I have any non-extensional values over anything that isn’t on the “inside” of a conscious mind, i.e. any non-conscious Go player with the same input-output behavior across counterfactuals is just as sweet.)

In any case, we might be able to identify instances of in a given model by searching for parts of the model that are informative about and vice versa. Finding such computations seems hard.

Ontological crises. Say we have an AI system with a running model of the world, and value bindings into parts of that model. Then some considerations such as new data arise, and the AI system switches its main model to an entirely new sort of model. For example, the agent might switch from thinking the world runs on particles and classical mechanics, to thinking the world runs on quantum mechanics. See “Ontological crises in artificial agents’ value systems”.

Then we want the agent to attach values to parts of the new model that correspond to values attached to parts of the old model. This might be accomplished by matching up parts of models as gestured at above. E.g. a happy human embedded in classical physics or a happy human embedded in quantum mechanics may be “made of different stuff”, but still do very similar computations and thereby be logically informative about each other.

Plan-pivotal parts of a hypothesis class. Even more speculatively, it could be possible to identify “pivotal parts” of hypotheses that lead to a decision or a belief. That is, if we want to understand why an AI made a decision or came to a conclusion, it could help to look at a single class of corresponding parts across many hypotheses, and see how much the predicted behavior of those parts “influenced” the decision, possibly again using logical informativeness.

Not-obviously-this-person predicate

We might hope to prevent an agent from running computations that are conscious and possibly suffering, by installing an overseer to veto any computations that are logically informative about a reference human simulation . This works just as in ontology identification, but it is distinct because here we are trying to avoid moral patients actually occurring inside the agent’s hypotheses, rather than trying to locate references to moral patients.

Problems with logical information

Empirical bits can be logical bits

If the environment contains computers running computations that are informative about , then empirical observations can be relevant to predicting . So a good theory of “total information” should call on a general predictor rather than just a logical predictor. These might be the same thing, e.g. if observations are phrased as logical statements about the state of the agent’s sensors.

Strong dependence on a theory of logical uncertainty

Many computational questions are entangled with many other computational questions [citation needed], so using logical information to understand the structure of computations depends on a good quantitative theory of logical uncertainty. Then we can speak meaningfully of “how much” one computation is informative about another.

Indeed, this notion of logical information may depend in undesirable ways on the free parameter of a “good” predictor, including the choices of computational resources available to the predictor. This is akin to the free choice of a UTM for Solomonoff induction; different predictor predictors may use, in different ways, the results of to guess the results of , and so could make different judgments about how informative is about . (As in the case of Solomonoff induction, there is some hope that this would wash out in any cases big enough to be of interest.)

For example, if the predictor is clever enough or computationally powerful enough to predict , it will think nothing is at all informative about , because all the conditional distributions will just be the same point distribution on the actual outcome of . This may not capture what we cared about. For example, if implements versions of a conscious computation , we want to detect this; but the predictor tells us nothing about which things implements.

More abstractly, logical informativeness formulated in terms of some predictor is relying on ignorance to detect logical dependencies; this is not necessarily a problem, but seems to demand a canonical idea of prediction under logical uncertainty.

Dependence on irrelevant knowledge

We want to get notions of how informative is about , given a knowledge base . In particular, needs to have enough time to consider what is even in . But this means that we need to allow more time to think about as gets larger, even if and are fixed. This is problematic because at some point will simply be able to compute , and therefore be useless for saying whether is informative about . (There may be some clever way to preprocess to avoid this problem, but it does seem like a serious problem.)

Pointers to things are hard to recognize

Say we want to find all the chairs in some large hypothesis . We consider some subtrace , which does in fact point to a chair in . But, this pointer may be long (as a program), and so it might take a lot of resources for the predictor to think about enough to recognize that points to a chair. If the chair is another fixed computation, then for most implementations of chairs in other hypotheses, it is easier for to just compute any question about the chair, than to recognize as a chair and use its behavior as information. This seems like a serious problem.

I don’t think would have a similar problem identifying parts of hypotheses that are informative or not in a way that depends on the results of difficult computations happening inside a hypothesis. For example, say runs a very complicated computation, and then depending on its output, either halts or implements a chair. We might worry that can’t tell if the second stage is useful or not for modeling a chair, because it can’t compute the complicated thing. But can still tell that this is the setup, so if it is given the results and they start to look like a chair is being modeled, will know that it can use the rest of the bits as information.

Also note that locating parts of large hypotheses requires a search over vast numbers of possible parts, which seems extremely expensive.

Subtraces can sneak in information

The programs serving as pointers to subtraces of could just think about , and then find bits in the trace of to encode information about , mostly ignoring whatever is doing. This is supposed to be avoided by simply adding the value of certain bits of ’s trace to (rather than adding the result of the computation that looks at the trace). But this assumes that the predictor somehow updates on its observations of logical facts without updating on the fact that it was shown those facts, which is probably not even a desirable property of a good predictor.

Problems with the non-person predicate

Recall the proposal to prevent mindcrime by vetoing computations that are informative about a human.

Agent: Okay to run computation ?

Overseer: Hold on, let me make sure it is safe.

Overseer: *commits a whole lot of mindcrime*

Overseer: Um, yep, that is definitely mindcrime, no you may not run .

Agent: Whew, glad I checked.

Overseer: *grimaces inwardly*

Even if the predictor can successfully detect potential mindcrime, it may itself commit mindcrime, especially while thinking about computations that include mindcrime. This might be partially sidestepped by not computing the answer to and just adding possible outputs of to , but the resulting counterfactuals might not make sense.

More fundamentally, the overseer blacklists computations that are definitely bad because they are informative about , rather than whitelisting computations that are definitely safe. There may be many situations where an agent commits mindcrime without modeling any existing human, or commit mindcrime on very nonhuman but nevertheless morally valuable minds.

Values over counterfactual behavior

We may have values that depend on the counterfactual behavior of certain parts of the environment (“If I hadn’t tried so hard to beat the video game, the win screen wouldn’t be displayed.”). In this context, we might try saying that a part of implements if adding any possible results of that subtrace to would make more informed about . But there’s no reason to expect that this sort of counterlogical will be reasonable. For example, instead of updating on counterfactual results of , the predictor might just stop thinking that has anything to do with (because is apparently doing some weird thing).

Identifying things, not human things

At best, this ontology identification scheme still has nothing to do with locating things that humans value. That would require something like writing down computations that exemplify different kinds of value-relevant parts of the world, which seems like most of the problem.

New Comment