# 13

Solomonoff Induction is a sort of mathematically ideal specification of machine learning. It works by trying every possible computer program and testing how likely they are to have produced the data. Then it weights them by their probability.

Obviously Solomonoff Induction is impossible to do in the real world. But it forms the basis of AIXI and other theoretical work in AI. It's a counterargument to the no free lunch theorem; that we don't care about the space of all possible datasets, but ones which are generated by some algorithm. It's even been proposed as a basis for a universal intelligence test.

Many people believe that trying to approximate Solomonoff Induction is the way forward in AI. And any machine learning algorithm that actually works, to some extent, must be an approximation of Solomonoff Induction.

But how do we go about trying to approximate true Solomonoff Induction? It's basically an impossible task. Even if you make restrictions to remove all the obvious problems like infinite loops/non-halting behavior. The space of possibilities is just too huge to reasonably search through. And it's discrete - you can't just flip a few bits in a program and find another similar program.

We can simplify the problem a great deal by searching through logic circuits. Some people disagree about whether logic circuits should be classified as Turing complete, but it's not really important. We still get the best property of Solomonoff Inducion; that it allows most interesting problems to be modelled much more naturally. In the worst case you have some overhead to specify the memory cells you need to emulate a Turing machine.

Logic circuits have some nicer properties compared to arbitrary computer programs, but they still are discrete and hard to do inference on. To fix this we can easily make continuous versions of logic circuits. Go back to analog. It's capable of doing all the same functions, but also working with real valued states instead of binary.

Instead of flipping between discrete states, we can slightly increase connections between circuits, and it will only slightly change the behavior. This is very nice, because we have algorithms like MCMC that can efficiently approximate true bayesian inference on continuous parameters.

And we are no longer restricted to boolean gates, we can use any function that takes real numbers. Like a function that takes a sum of all of it's inputs, or one that squishes a real number between 0 and 1.

We can also look at how much changing the input of a circuit slightly, changes the output. Then we can go to all the circuits that connect to it in the previous time step. And we can see how much changing each of their input changes their output, and therefore the output of the first logic gate.

And we can go to those gates' inputs, and so on, chaining it all the way through the whole circuit. Finding out how much a slight change to each connection will change the final output. This is called the gradient, and we can then do gradient descent. Basically change each parameter slightly in the direction that increases the output the way we want.

This is a very efficient optimization algorithm. With it we can rapidly find circuits that fit functions we want. Like predicting the price of a stock given the past history, or recognizing a number in an image, or something like that.

But this isn't quite Solomonoff Induction. Since we are finding the best single model, instead of testing the space of all possible models. This is important because essentially each model is like a hypothesis. There can be multiple hypotheses which also fit the data yet predict different things.

There are many tricks we can do to approximate this. For example, if you randomly turn off each gate with 50% probability and then optimize the whole circuit to deal with this. For some reason this somewhat approximates the results of true bayesian inference. You can also fit a distribution over each parameter, instead of a single value, and approximate bayesian inference that way.

Although I never said it, everything I've mentioned about continuous circuits is equivalent to Artificial Neural Networks. I've shown how they can be derived from first principles. My goal was to show that ANNs do approximate true Solomonoff Induction. I've found the Bayes-Structure.

It's worth mentioning that Solomonoff Induction has some problems. It's still an ideal way to do inference on data, it just has problems with self-reference. An AI based on SI might do bad things like believe in an afterlife, or replace it's reward signal with an artificial one (e.g. drugs.) It might not fully comprehend that it's just a computer, and exists inside the world that it is observing.

Interestingly, humans also have these problem to some degree.

Reposted from my blog here.

# 13

New Comment

This is a subject that I'm fairly well-read on. Look up Eray Ozkural's work on this.

You have correctly identified that the problem with Solomonoff induction (at least as it's usually formulated) is that representations of programs in terms of binary strings (the Turing machine formalism) isn't very ideal for approximating Solomonoff induction. Circuits and/or neural networks are somewhat better, but still extremely limited because as you try to do more and more complicated computations, your neural networks become exponentially more complicated in structure.

Ideally, you want a representation system that is adaptive - it can use solutions found in previous problem-solving attempts, apply abstraction, and work in a 'higher' representation space. That is, it develops its own 'language' for representing programs at a high level where the search space is correspondingly smaller. This is the basis of the OOPS system (optimally ordered problem solver).

But we can still do better. Instead of using the Turing representation, using stochastic context-sensitive grammars allows a far more compact and easily navigable search space. This is the basis of Ozkural's work, such as http://link.springer.com/chapter/10.1007%2F978-3-319-09274-4_12#page-1

Appreciate your linking to this guy's work...so, err, is any of it not pay-walled?

Most of it should be publicly available in some form with a little bit of searching; here's the free version of that article: http://agi-conf.org/2014/wp-content/uploads/2014/08/ozkural-application-agi14.pdf

Sincere thanks.

Upvoted - I had considered writing a similar post but never got around to it because my version of conveying this idea translated into something far too long.

ANNs are just a universal model family (perhaps the simplest canonical model for analog/algebraic circuits) for which the inference/learning algorithm is completely orthogonal. You can use anything from genetic algorithms to gradient descent to full Bayesian to learn the weights or program from data. Of course, for performance reasons, one should never go full bayesian.

I think you missed the most important key performance principle behind deep nets - deep factoring. The real world has structure in the form of complex symmetries which can be exploited by deep factored circuits which reuse subcomputations. AIXI is beyond stupid because it learns a huge space of 2^N models where each model performs 2^N ops in complete isolation. In a deep factored model you focus the search on model subspaces where almost all of the subcomputations between submodels can be shared, resulting in exponential speedup. This is the true power of deep models, which all large successful ANNs exploit.

This is an interesting way to view neural networks. After seeing your post, I happened to come across Jürgen Schmidhuber's AMA, where he gave a similar but less detailed explanation about how "RNN-based CM system of 1990 may be viewed as a limited, downscaled, sub-optimal, but at least computationally feasible approximation of AIXI".

People here seem to really like Solomonoff induction, but I don't think it's all that relevant to learning in practice due to computational complexity.

Solomonoff induction is not computable. Trying to "approximate" it, by coming up with hypotheses similar to it, is probably also not computable.

If you replace Solomonoff induction with induction over programs that halt quickly or induction over circuits, it becomes computable, but it is still NP-hard. Again, approximating this is probably also NP-hard, depending on your definition of approximation.

Next, if you replace boolean circuits with neural nets, it is still hard to find the best neural net to fit the data. MCMC and gradient descent only find local optima. I mean, the fact that neural nets didn't give us strong AI back in the 70s demonstrates that they are not doing anything close to Solomonoff induction.

It's not even clear that a learning program must approximate Bayesian inference. There are things like PAC learning that don't do that at all.

MCMC and gradient descent only find local optima.

Not strictly true. Simply by overparameterizing the ANN, you can emulate parallel gradient descent (global optimization). Stochastic techniques - such as dropout - and tricks such as momentum can also help tunnel through local optima. In addition, the typical optimization hurdles tend to be saddlepoints rather than local optima.

Modern SGD mechanisms are powerful global optimizers.

I mean, the fact that neural nets didn't give us strong AI back in the 70s demonstrates that they are not doing anything close to Solomonoff induction.

It demonstrates nothing of the sort, but such a demonstration would be pointless anyway. Solomonoff induction is completely worthless - intractable - so you absolutely don't want to do that anyway.

Modern SGD mechanisms are powerful global optimizers.

They are heuristic optimizers that have no guarantees of finding a global optimum. It's strange to call them "powerful global optimizers".

Solomonoff induction is completely worthless - intractable - so you absolutely don't want to do that anyway.

I believe that was my point.

They [modern SGD systems] are heuristic optimizers that have no guarantees of finding a global optimum. It's strange to call them "powerful global optimizers".

Not at all strange. As I pointed out in the post above - modern SGD functions as a global optimizer due to clever tricks such as over-parameterization, dropout, and momentum. These optimizers are also suprisingly powerful as measured in terms of actual solution quality vs computational cost - indeed they dominate most of the landscape. It is completely disingenous to call the dominate solution anything other than powerful.

In practice strong guarantees are not important. What actually matters is finding a solution that is good enough relative to computational costs, where 'good enough' is determined according to some more complex business/economic utility metric, where solution quality follows diminishing returns. For sufficiently complex problems, finding the actual global optimum is rarely ideal - it is far too far to the right on the asymptotic diminishing returns curve.

What actually matters is finding a solution that is good enough relative to computational costs

That is true, but calling methods to find such solutions "global optimizers" is still a misnaming. Adding some noise/randomness to your local optimization does not a global optimizer make.

Of course, stochastic local-ish optimizers can be very powerful and very useful. They are just not global.

Disagree, but this may be now on the level of terminology.

By powerful global optimizer, I mean an optimizer which can find global optimums roughly using similar time/space to other techniques, and more importantly can find good enough solutions that are very close to the true global optimum but using much less time/space.

Adding some noise/randomness to your local optimization does not a global optimizer make.

Actually it does. As I pointed out above, adding the right kind of noise/randomness allows SGD to find the global optimum (given enough chances/time). This is just trivially true - if you randomly search long enough and retain the best solution found so far, you are guaranteed to eventually find the optimal solution.

Thinking about it a bit, the "global" vs "local" property is not really a property of an optimizer, as it is a combined property of the optimizer and the search space (aka the error landscape). For an entirely arbitrary search space, only the exhaustive search and equivalents will guarantee you the global optimum. For simple search spaces (e.g. quadratic optimization) easy "local" things like hill-climbing will guarantee you the global optimum. If your solution is bounded, you always know how close to the theoretical best you are and can adjust the search accordingly (e.g. terminate early), etc.

Yes. In some cases simple (convex) search spaces even permit exact non-iterative analytic solutions which are faster than any iterative approach like SGD (although curiously, they are typically not dramatically faster - as they still need to analyze each example once, and the number of full epochs/iterations in SGD type approaches appears to be sublinear. )

The more advanced methods - natural gradient, hessian free, etc - expend extra computation to analyze the local search space and then use this analysis to make an improved nonlocal jump. This always has extra cost though, so these techniques just end up tending to match SGD in terms of net optimization speed.

Looking at it another way, when a researcher uses something like a convex optimization technique to solve a problem more quickly - what actually happened is the combined human+machine system used a global optimizer to solve the problem, with more of the algorithm running in the human's brain than on the machine. ;)

It's not even clear that a learning program must approximate Bayesian inference. There are things like PAC learning that don't do that at all.

Can you expand on the relationship between the two? As far as I can see, Bayesian learning starts from a prior distribution, and PAC learning (for PAC-learnable concepts) works independently of the distribution. That is, they're solving average-case vs. worst-case or adversarial problems, which is a subject Eliezer has discussed here. Further discussion, mostly not by Eliezer, here. Those discussions don't mention PAC though.

PAC-Bayesian hybrids appear to also be a thing, but all I know about that is the Google search I just did.

I suppose that in some sense, the Bayesian-vs-PAC comparison is the learning analogue of the average-case vs. worst-case comparison (where Bayesian=average case, PAC=worst case). It's not a perfect analogy, but in both cases a big question is what if you don't know the distribution/prior?

In the learning version, I think the problem is even more severe - I don't know if an analogue of de-randomization works anymore, and I don't think the "P=BPP" conjecture has an analogue.

The learning version has some additional complications, in that Bayesian inference is computationally intractable in many cases.

what if you don't know the distribution/prior?

I suppose the Bayesian answer to that is that a probability distribution is a description of one's knowledge, and that in principle, every state of knowledge, including total ignorance, can be represented as a prior distribution. In practice, one may not know how to do that. Fundamentalist Bayesians say that that is a weakness in our knowledge, while everyone else, from weak Bayesians to Sunday Bayesians, crypto-frequentists, and ardent frequentists, say it's a weakness of Bayesian reasoning. Not being a statistician, I don't need to take a view, although I incline against arguments deducing impossibility from ignorance.

The learning version

What are you contrasting with learning? I don't see a demarcation of principle between learning and any other sort of statistical prediction.

I suppose the Bayesian answer to that is that a probability distribution is a description of one's knowledge, and that in principle, every state of knowledge, including total ignorance, can be represented as a prior distribution. In practice, one may not know how to do that. Fundamentalist Bayesians say that that is a weakness in our knowledge, while everyone else, from weak Bayesians to Sunday Bayesians, crypto-frequentists, and ardent frequentists, say it's a weakness of Bayesian reasoning. Not being a statistician, I don't need to take a view, although I incline against arguments deducing impossibility from ignorance.

I don't have any strong disagreements there. But consider: if we can learn well even without assuming any distribution or prior, isn't that worth exploring? The fact that there is an alternative to Bayesianism - one that we can prove works (in some well-defined settings), and isn't just naive frequentism - is pretty fascinating, isn't it?

What are you contrasting with learning?

I'm contrasting randomized vs. deterministic algorithms, which Eliezer discussed in your linked article, with Bayesian vs. PAC learning models. The randomized vs. deterministic question shouldn't really be considered learning, unless you want to call things like primality testing "learning".

It's true that it's only possible to find local optima, but that's true with any algorithm. Unless P=NP, perfect mathematically optimal AI is probably impossible. But in practice, SGD is extremely good at optimizing NNs, and the local optima issue isn't a huge problem.

As to why we can have decent machine learning and not AGI, I don't know. Approximating SI isn't sufficient for one, you need to act on the models you find. But even so, I'm more trying to say it's is a promising research direction, and better than other stuff. I'm not saying it's a solved problem. Or that there are no further advances to be made.

It's not even clear that a learning program must approximate Bayesian inference.

Everything approximates Bayesian inference, it's just a matter of how ideal the approximation is. If you have enough data, the maximum likelihood approaches bayesian inference. And that's how NNs are typically trained. But it's nice to have other hypotheses and not just assume that the most likely hypothesis is 100% correct.

But in practice, SGD is extremely good at optimizing NNs, and the local optima issue isn't a huge problem.

That's not even true. In practice, it's the best we've got, but it's still terrible in most interesting settings (or else you could solve NP-hard problems in practice, which you can't).

As to why we can have decent machine learning and not AGI, I don't know.

It's because the neural net algorithms are not even close to finding the optimal neural net in complex situations.

Approximating SI isn't sufficient for one, you need to act on the models you find.

That's trivial to do. It's not the problem here.

Everything approximates Bayesian inference, it's just a matter of how ideal the approximation is.

This might be true in some sense, but not in a meaningful one. PAC learning, for instance, is fundamentally non-Bayesian. Saying that PAC learning approximates Bayesian inference is the same as saying that Bayesian inference approximates PAC learning. It's not a very meaningful statement.

People on LW tend to be hard-core Bayesians who have never even heard of PAC learning, which is an entire branch of learning theory. I find it rather strange.

That's not even true. In practice, it's the best we've got, but it's still terrible in most interesting settings (or else you could solve NP-hard problems in practice, which you can't).

SGD seems to be sufficient to train neural networks on real world datasets. I would honestly argue that it's better than whatever algorithms humans use. In that generally it learns much faster and far better on many problems.

That's not sufficient for general intelligence. But I think the missing piece is not some super powerful optimization algorithm. A better optimizer is not going to change the field much or give us AGI.

And I also don't think the magic of the human brain is in really good optimization. Most scientists don't even believe that the brain can do anything as advanced as backpropagation, which requires accurately estimating gradients and propagating them backwards through long time steps. Most biologically plausible algorithms I've seen are super restricted or primitive.

It's because the neural net algorithms are not even close to finding the optimal neural net in complex situations.

I disagree. You can find the optimal NN and it still might not be very good. For example, imagine feeding all the pixels of an image into a big NN. No matter how good the optimization, it will do way worse than one which exploits the structure of images. Like convolutional NNs, which have massive regularity and repeat the same pattern many times across the image (an edge detector on one part of an image is the same at another part.)

That's trivial to do. It's not the problem here.

It's really not. Typical reinforcement learning is much more primitive than AIXI. AIXI, as best I understand it, actually simulates every hypothesis forward and picks the series of actions that lead to the best expected reward.

Even if you create a bunch of really good models, simulating them forward thousands of steps for every possible series of actions is impossible.

I disagree. You can find the optimal NN and it still might not be very good. For example, imagine feeding all the pixels of an image into a big NN. No matter how good the optimization, it will do way worse than one which exploits the structure of images. Like convolutional NNs, which have massive regularity and repeat the same pattern many times across the image (an edge detector on one part of an image is the same at another part.)

If you can find the optimal NN, that basically lets you solve circuit minimization, an NP-hard task. This will allow you to find the best computationally-tractable hypothesis for any problem, which is similar to Solomonoff induction for practical purposes. It will certainly be a huge improvement over current NN approaches, and it may indeed lead to AGI. Unfortunately, it's probably impossible.

It's really not. Typical reinforcement learning is much more primitive than AIXI. AIXI, as best I understand it, actually simulates every hypothesis forward and picks the series of actions that lead to the best expected reward.

I was only trying to say that if you're finding the best NN, then simulating them is easy. I agree that this is not the full AIXI. I guess I misunderstood you - I thought you were trying to say that the reason NN doesn't give us AGI is because they are hard to simulate.

PAC learning, for instance, is fundamentally non-Bayesian. Saying that PAC learning approximates Bayesian inference is the same as saying that Bayesian inference approximates PAC learning. It's not a very meaningful statement.

I looked into PAC learning a bit when Scott Aaronson talked about it on his blog, and came to the following conclusion. 'Instead of saying “PAC-learning and Bayesianism are two different useful formalisms for reasoning about learning and prediction” I think we can keep just Bayesianism and reinterpret PAC-learning results as Bayesian-learning results which say that in some special circumstances, it doesn’t matter exactly what prior one uses. In those circumstances, Bayesianism will work regardless.'

Of course that was 7 years ago and I probably barely scratched the surface of the PAC learning literature even then. Are there any PAC learning results which can't be reinterpreted this way?

PAC-learning has no concept of prior or even of likelihood, and it allows you to learn regardless. If by "Bayesianism" you mean "learning", then sure, PAC-learning is a type of Bayesianism. But I don't see why it's useful to view it that way (Bayes's rule is never used, for example).

It's true that it's only possible to find local optima, but that's true with any algorithm.

Whaaaat? Exhaustive search is an algorithm, it will find you the global optimum anywhere. For many structures of the search space it's not hard to find the global optimum with appropriate algorithms.

Everything approximates Bayesian inference, it's just a matter of how ideal the approximation is. If you have enough data, the maximum liklihood approaches bayesian inference.

Huh?

In that sense stochastic gradient descent will also find the global optimum, since the randomness will eventually push it to every point possible. It will just take the eternity of the universe, but so will exhaustive search.

It's also trivial to modify any local algorithm to be global, by occasionally moving around randomly. This is also effective in practice, at finding better local optima.

Everything approximates Bayesian inference, it's just a matter of how ideal the approximation is. If you have enough data, the maximum likelihood approaches bayesian inference.

Huh?

There is a view that everything that works must be an approximation of the ideal Bayesian method. This is argued by Yudkowsky in Beautiful Probability and Searching for Bayes-Structure.

I used Maximum likelihood as an example, which is where you take the most probable hypothesis (parameters in a statistical model.) Instead of weighing many hypotheses the bayesian way. If you have enough data, the most probable hypothesis should converge to the correct one.

There is a view that everything that works must be an approximation of the ideal Bayesian method.

You can reformulate many problems in the Bayesian framework. This does not mean that everything is an approximation of Bayesianism -- just like the ability to translate a novel into French does not mean that each novel is an approximation of a French roman.

It's deeper than that. Bayesian probability theory is a mathematical law. Anything method that works must be computing an approximation of it. Just like Newtonian mechanics is a very close approximation of relativity. But they are not equivalent.

Bayesian probability theory is a mathematical law.

That is not true. The Bayes equation is mathematically correct. A theory is much wider -- for example, Bayesians interpret probability as a degree of belief -- is that also a mathematical law? You need a prior to start -- what does the "mathematical law" say about priors?

Tell me, did Eliezer even address PAC learning in his writing? If not, I would say that he's being over-confident and ignorant in stating that Bayesian probability is all there is and everything else is a mere approximation.

PAC-learning is definitely something we don't talk about enough around here, but I don't see what the conflict is with it being an approximation of Bayesian updating.

Here's how I see it: You're updating (approximately) over a limited space of hypotheses that might not contain the true hypothesis, and then this idea that the best model in your space can still be approximately correct is expressible both on Bayesian and on frequentiist grounds (the approximate update over models being equivalent to an approximate update over predictions when you expect the universe to be modelable, and also the best model having a good frequency of success over the long run if the real universe is drawn from a sufficiently nice distribution).

But I'm definitely a n00b at this stuff, so if you have other ideas (and reading recommendations) I'd be happy to hear them.

Here's how I see it: You're updating (approximately) over a limited space of hypotheses that might not contain the true hypothesis, and then this idea that the best model in your space can still be approximately correct is expressible both on Bayesian and on frequentiist grounds (the approximate update over models being equivalent to an approximate update over predictions when you expect the universe to be modelable, and also the best model having a good frequency of success over the long run if the real universe is drawn from a sufficiently nice distribution).

The "update" doesn't use Bayes's rule; there's no prior; there's no concept of belief. Why should we still consider it Bayesian? I mean, if you consider any learning to be an approximation of Bayesian updating, then sure, PAC-learning qualifies. But that begs the question, doesn't it?

You can't do an exhaustive search on an infinite set.

You can't do an exhaustive search on an infinite set.

I haven't seen any infinite sets in reality.

The set of possible Turing Machines is infinite. Whether you consider that to satisfy your personal definition of "seen" or "in reality" isn't really relevant.

Can you unpack "approximation of Solomonoff induction"? Approximation in what sense?

I walk through each step in the post. You can approximate Turing complete algorithms with logic gates. And NNs are basically just continuous logic gates and can compute the same functions. And there are algorithms like MCMC which can approximate bayesian inference.

Estimator asked in what sense SI is approximated, not, given a sense, how SI is approximated in that sense. Can you give a metric for which the value is close to SI's value?

I don't really understand the question. Or what "sense" means in this context.

Are you asking how close the approximation is the ideal? I would say that depends on the amount of computing power available, and that it approaches it in the limit. But it gives reasonable answers on realistic computers, whereas SI does not.

There is also some loss based on the information required to build a finite turing machine of the right size. As opposed to the infinite number of other structures you can build with logic gates. E.g. a machine that is exactly like a finite turing machine, but the 15th memory cell is corrupted, etc.

I don't think this problem is unsolvable though. There are for example Neural Turing Machines which give it access to an infinite differentiable memory.

An example of a sense would be to define some quantification of how good an algorithm is, and then show that a particular algorithm has a large value for that quantity, compared to SI. In order to rigorously state that X approaches Y "in the limit", you have to have some index n, and some metric M, such that |M(Xn)-M(Yn)| -> 0. Otherwise, you're simply making a subjective statement that you find X to be "good". So, for instance, if you can show that the loss in utility in using your algorithm rather than SI goes to zero as the size of the dataset goes to infinity, that would be an objective sense in which your algorithm approximates SI.

It should be obvious that SGD over an appropriately general model - with appropriate random inits and continuous restarts with keep the max score solution found - will eventually converge on the global optimum, and will do so in expected time similar or better to any naive brute force search such as SI.

In particular SGD is good at exploiting any local smoothness in solution space.

any machine learning algorithm that actually works, to some extent, must be an approximation of Solomonoff Induction

the conclusion that ANNs do approximate Solomonoff induction follows immediately. Moreover, any ML algorithm should have the Bayes-Structure nature then, shouldn't it?

Of course, but what's interesting is how it is an approximation (and how close it is.) I don't think decision trees are a very good approximation, for example. You can compile any computer program to a neural network (and people have actually made such compilers), but it would take an exponentially large decision tree to do the same thing. And many ML algorithms like linear regression or naive bayes, don't even have the universal approximator property.

I also think that some AI researchers have written off neural networks because they seem too mathematically inelegant. Just some heuristics thrown together without any solid theory or principle.

The purpose of my writing is to show that they are elegant. Even further, that if you tried to come up with the ideal approximation of SI from first principles, you would just end up with NNs.

The purpose of my writing is to show that they are elegant. Even further, that if you tried to come up with the ideal approximation of SI from first principles, you would just end up with NNs.

Indeed. Although SGD is probably not the optimal approximation of Bayesian inference - for example it doesn't handle track/handle uncertainty at all, but that is a hot current area of research.

I only barely mentioned it in my post, but there are ways of approximating bayesian inference like MCMC. And in fact there are methods which can take advantage of stochastic gradient information, which should make them roughly as efficient as SGD.

There is also a recent paper by Deepmind, Weight Uncertainty in Neural Networks.