By the way, here's my account of the motivation for this problem:
Let's say you start with an AI that is superhuman at engineering. You want to ask it to do a simple task (like make you burritos) without risking vast unforeseen consequences. So you let it passively scan a bunch of human-made burritos, and ask it to make you a burrito. There are a couple of interesting failure modes:
The space of acceptable burritos, as a subset of configurations of atoms, is a really narrow and twisty target. If you take the set of configurations which are closer to burrito 1 in the training set than any other training burrito is, the vast majority of those configurations would be toxic to humans, and some of them contain self-replicating nanobots, etc. Of course, there are ways of representing concepts such that the essential aspects of acceptable burritos (like being made out of a specific set of organic molecules) are more likely to be found. This is the problem of identifying the correct measure b in the first place.
Having the AI create nanotech is pretty risky, and for this task we'd prefer if it stuck to more boring engineering like agricultural and culinary robots. But "don't make any nanotech" is not a natural command, since how do you specify "nanotech" without examples, and since there are plenty of creative nanotech-like things that wouldn't even occur to us to rule out. So we want to either give it parameters for what it can do (this gives us the feasible set f, which is unlikely to exactly contain any of our examples), or somehow set things up so that the boring engineering tasks are the optimal way to satisfy the problem. (This is also why "exactly clone one of the example burritos" is not a great solution, since this obviously requires nanotech.)
What do you think about making burritos that have low information content given good burrito recipes (or sampling from a distribution f that has low information relative to the learned distribution b)? Fancy burritos seem a lot simpler to describe than nanobot-containing burritos, given the machinery to describe mediocre burritos.
Thanks for commenting. I think measuring individual burritos by low information content (i.e. high probability) relative to some recipes doesn't work. Specifically, this is because the information content is dominated by thermodynamic complexity in the burrito. Under almost any distribution, burritos with low information content will be small, cold, crystalline ones.
Measuring the Kolmogorov complexity of f relative to b seems more promising. This might work if we could set some upper limit on the information content of a distribution containing nano-UFAIs and only allow distributions less (relatively) complex than this limit.
More concretely, we could represent b as a probabilistic program and allow f to be a low-complexity modification of b (measuring complexity using some mutation distribution on probabilistic programs). I think something like this is worth exploring.
The quantilizer idea seems excellent.
One note of caution - although it is neat to be able to bound loss, one can still question how useful meaningful 'loss' is here. Looking instead at utility, there is no bound on how much worse a quantilizer can perform relative to a random actor (nor how much it would contribute to our regret).
The real reason to use a quantilizer is that we expect that edge cases might have regret on the order of the cosmic endowment, but that most quantilized actions will not. For this thing that matters, there are no guarantees.
The part makes it different from something like the Kullback–Leibler divergence, but that might be a good feature - a definition seems harder to hack.
If worrying about butterfly effects and similar, it might be useful to do something like this: let p be the probability distribution of future states given that a burrito is not made, and q the same distribution given that the burrito is made. If p and q are very different (as measured by KL divergence or the approach here) then that means that either a) the burrito is dangerous, or b) the AI can unravel butterfly effects. If p and q are very different for many different burrito it could make, then we have a butterfly effect problem. If they are only different for some burritos, then we have identified the high-impact burritos.
I'm feeling this could be related to my ideas here: http://lesswrong.com/r/discussion/lw/ly9/crude_measures/
I'll think about it more...
Summary: it is sometimes desirable to create systems that avoid edge cases. This post presents a partial solution to this problem, which works by comparing the system's behavior to a "natural" distribution.
Introduction
Learning a concept from both positive and negative examples is a well-studied problem in machine learning. However, if we learn a concept from training data (say, using human judgments as the gold standard), then the learned concept might not generalize well; it could differ from human judgments on any number of edge cases. There is a sense in which these edge cases are "outside the domain" of the training data; that is, the training data seems to naturally define some domain where we can make judgments confidently, with edge cases lying outside this domain.
Therefore, we will be ignoring 2-class classification and defining a form of "classification" that uses only positive examples. We could use this "classifier" to positively identify regions that do not contain edge cases.
As a running example, consider a system that is supposed to create burritos using a nanofactory. It will have as training data some training burrito configuration files, which specify the exact arrangement of atoms in each training burrito. The task will be to create burritos that are similar to the training burritos, without necessarily being identical to them.
Suppose this system has access to a hypercomputer. Then it could use a variant of Solomonoff induction to learn a generative model over burrito configuration files, b. With some caveats, this may learn the distribution of training burritos.
We could instruct the system to sample a burrito configuration file from this distribution, and then print it out using the nanofactory. If the induced distribution reflects the distribution of training burritos, then this should not be significantly different from creating a new training burrito in the usual fashion.
What if the nanofactory can't print all possible burrito configurations (or it can, but we only want to print high-quality burritos)? Suppose we have some feasible set f of printable burrito configurations. We may be tempted to sample from the distribution b|f, which assigns probabilities proportional to b for burrito configurations in f and probability 0 to other burrito configurations. However, this may be dangerous! For example, if only burritos containing nano-UFAIs are in the feasible set, then the nanofactory will be guaranteed to print such a burrito. Solomonoff induction will never assign probability 0 to any burrito configuration.
Bits of unnaturalness
Can we quantify how dangerous the b|f distribution is? Here is a simple measure that might work. Define f's bits of unnaturalness to be −logPb(B∈f), where B is a random burrito configuration sampled from b. The logic is that if Pb(B∈f) is not too low, then we didn't have a super-low chance of getting a feasible burrito through natural processes. A one-in-a-million burrito is unlikely to be dangerous (except through butterfly effects, discussed later).
We can generalize this to more burrito distributions d than just ones of the form b|f. It is potentially dangerous for any given burrito configuration to be much more probable under d than b, so we can define d's bits of unnaturalness as −logmaxB∈support(d)(d(B)/b(B)). It is easy to see that this agrees with our original unnaturalness measure when d=b|f.
How dangerous is it to sample from a distribution with k bits of unnaturalness? Intuitively, if k is low, we're sampling from a distribution that "fits under" the natural distribution, and so the result is unlikely to be unsafe. More formally, suppose we have a cost function c defined on burrito configurations, which returns a non-negative number. If we have k bits of unnaturalness, then we have Ed[c(B)]=∑Bd(B)c(B)≤∑B2kb(B)c(B)=2kEb[c(B)]
which means that if k is low, then sampling from d is not much more costly than sampling from b. This is a nice guarantee to have.
Problems
In practice, many unnatural burrito distributions are likely to be harmless: every new method of making burritos will produce burritos that were unlikely to be produced by previous methods. The system could very well end up paralyzed due to having to make burritos that look almost exactly like natural ones. There isn't a clear solution to this problem: we'd need some way of deciding which deviations from the natural distribution are safe and which aren't.
Due to butterfly effects, creating a burrito naturally potentially has a very high expected cost. It's not implausible that creating a single burrito could negatively alter the course of history. In practice, there's also an equally high expected benefit (due to positive butterfly effects). But it only takes one bit of unnaturalness to wipe out the half of burritos that have positive butterfly effects. We might be able to solve this problem if we had some way of placing limits on the extent to which bounded computations (such as sampling from d) can take advantage of butterfly effects. Additionally, false thermodynamic miracles could be used to interfere with butterfly effects.
There is a bigger issue when we sample multiple burritos from d. A large number of burritos from d could give us enough information to mostly determine d. This could be dangerous, because d could contain dangerous messages. This is part of the general pattern that, even if the distribution for a single burrito is only slightly unnatural, the distribution for a sequence of burritos sampled from the distribution may be very unnatural.
It might be possible to solve this problem by limiting how unnatural the process used to produce d itself is. We could define a distribution over burrito distributions, and then define a naturalness measure for other burrito distribution distributions relative to this one. If d is sampled from a not-too-unnatural burrito distribution distribution, then the sequence of burritos gotten from sampling d and then sampling burritos from d would not be too unnatural relative to what you would expect if you sampled a burrito distribution from the original burrito distribution distribution then sampled burritos from that distribution.
There might be ways to use an unnaturalness measure like this pervasively throughout an AI system to get global security guarantees (Eliezer mentioned something similar in a discussion about satisficers), but I haven't thought about this thoroughly.
Alternatives and variations