I searched the posts but didn't find a great deal of relevant information. Has anyone taken a serious crack at it, preferably someone who would like to share their thoughts? Is the material worthwhile? Are there any dubious portions or any sections one might want to avoid reading (either due to bad ideas or for time saving reasons)? I'm considering investing a chunk of time into investigating Legg's work so any feedback would be much appreciated, and it seems likely that there might be others who would like some perspective on it as well.

New Comment
46 comments, sorted by Click to highlight new comments since: Today at 3:19 PM

You need to understand Solomonoff's and Hutter's ideas first to see where Legg is coming from. One of the best introductions to these topics available online is Legg's "Solomonoff Induction", though Li and Vitanyi's book is more thorough if you can get it. Legg's paper about prediction is very nice. I haven't studied his other papers but they're probably nice too. He comes across as a smart and cautious researcher who doesn't make technical mistakes. His thesis seems to be a compilation of his previous papers, so maybe you're better off just reading them.

The thesis is quite readable and I found it valuable to sink deeply into the paradigm, rather than have things spread out over a bunch of papers.

The most worthless part of the thesis, IIRC*, was his discussion and collecting of definitions of intelligence; it doesn't help persuade anyone of the intelligence=sequence-prediction claim, and just takes up space.

* It's been a while; I've forgotten whether the thesis actually covers this or whether I'm thinking of another paper.

I've got Li and Vitanyi's book and am currently working through the Algorithmic Probability Theory sequence they suggest. I am also working through Legg's Solomonoff Induction paper.

I actually commented on your thread from February earlier today mentioning this paper, which seems to deal with the issues related to semi-measures in detail (something that you were indicating was very important) and it seems to do so in the context of the quote from Eliezer.

In particular, from the abstract:

Universal semimeasures work by modelling the sequence as generated by an unknown program running on a universal computer. Although these predictors are uncomputable, and so cannot be implemented in practice, the serve to describe an ideal: an existence proof for systems that predict better than humans.

[-][anonymous]12y 0

Yes, I already rederived most of these results and even made a tiny little bit of progress on the fringe :-) But it turned out to be tangential to the problem I'm trying to solve.

Note that most mainstream AI researchers are deeply skeptical of the AIXI/universal intelligence approach.

Observe the following parallel:

Decades ago the STRIPS formalism for automatic planning was very popular. The STRIPS formalism was extremely general: a wide variety of planning problems could be expressed in its action- and state-representation language. Furthermore, the STRIPS framework came with a tantalizing theoretical observation: if you could obtain good heuristic functions, that provided an accurate lower-bound estimate of the distance to the goal, then the A* algorithm could be used to find an optimal plan. So the problem of achieving general intelligence was "just" a problem of finding good heuristic functions for graph search.

Nowadays the AIXI formalism is gaining popularity. The formalism is extremely general: nearly any problem can be formulated in its terms. It comes with a tantalizing theoretical observation: if the Kolmogorov complexity of the observation stream can be found, then the action sequence executed by the agent is provably optimal. So the problem of achieving general intelligence is "just" a problem of estimating the Kolmogorov complexity of an observation sequence.

STRIPS vs AIXI:

Yes, these ideas are similar - though AIXI is quite a bit better - because it understands the role of Occam's razor.

Note that most mainstream AI researchers are deeply skeptical of the AIXI/universal intelligence approach.

It does seem there is some skepticism about AIXI - possibly deserved. However, "Solomonoff Induction" - which it is heavily based on - is a highly fundamental principle. There's a fair amount of skepticism and resistance concerning that as well. However, as far as I can see, this is entirely misguided. Often it seems due to entrenched existing ideas.

On the other hand, many others are embracing the ideas. They are certainly important to understand. Here is Shane Legg on the topic of the significance of this kind of material:

Another theme that I picked up was how central Hutter’s AIXI and my work on the universal intelligence measure has become: Marcus and I were being cited in presentations so often that by the last day many of the speakers were simply using our first names. As usual there were plenty of people who disagree with our approach, however it was clear that our work has become a major landmark in the area. ''

So, do these concerns make you think that it may be a bad idea to spend much time studying the AIXI approach? Could you perhaps suggest something I might be better off reading?

ETA: What do you think about these two papers both titled A Monte Carlo Approximation to AIXI?

I don't think it would be a good idea to focus obsessively on AIXI. If you want to study AI, study a broad set of topics. To find reading material, do a set of Google searches on obvious keywords and rank the results by citation count.

Regarding the papers, I have a very strong idiosyncratic belief about the path to AI: to succeed, researchers must study the world and not just algorithms. In my view, all results should be expressible in the form: because the real world data exhibits empirical structure X, algorithm Y succeeds in describing/predicting it. The algorithms presented by the papers you linked to probably work only because they happen to take advantage of some structure present in the toy problems they are tested on.

In my view, all results should be expressible in the form: because the real world data exhibits empirical structure X, algorithm Y succeeds in describing/predicting it.

Even if I know the exact probability distribution over images, there is an algorithmic problem (namely, how to do the inference), so your view is definitely at least a little too extreme.

In fact, this algorithmic difficulty is an issue that many researchers are currently grappling with, so in practice you really shouldn't expect all results to be making a novel statement about how the world works. Applying this standard to current research would stall progress in the directions I (and I think most serious AI researchers) currently believe are most important to actually reaching AI, especially human-comprehensible AI which might possibly be friendly.

Maybe we are wrong, but the argument you gave, and your implications about how NFL should be applied, are not really relevant to that question.

Even if I know the exact probability distribution over images, there is an algorithmic problem (namely, how to do the inference), so your view is definitely at least a little too extreme.

I don't dispute that the algorithmic problem is interesting and important. I only claim that the empirical question is equally important.

Applying this standard to current research would stall progress in the directions I (and I think most serious AI researchers) currently believe are most important to actually reaching AI

What you're really saying is that you think a certain direction of research will be fruitful. That's fine. I disagree, but I doubt we can resolve the debate. Let's compare notes again in 2031.

In my view, all results should be expressible in the form: because the real world data exhibits empirical structure X, algorithm Y succeeds in describing/predicting it.

I think you are restating the No Free Lunch theorem but that isn't a rare belief is it?

Sure, many people are aware of the NFL theorem, but they don't take it seriously. If you don't believe me, read almost any computer vision paper. Vision researchers study algorithms, not images.

Sure, many people are aware of the NFL theorem, but they don't take it seriously.

Legg's thesis says:

Some, such as Edmonds (2006), argue that universal definitions of intelligence are impossible due to Wolpert’s so called “No Free Lunch” theorem (Wolpert and Macready, 1997). However this theorem, or any of the standard variants on it, cannot be applied to universal intelligence for the simple reason that we have not taken a uniform distribution over the space of environments. Instead we have used a highly non-uniform distribution based on Occam’s razor.

The No Free Lunch theorems seem obviously-irrelevant to me. I have never understood why they get cited so much.

Don't any vision researchers use Bayes? If so, they'd have to be researching the formulation of priors for the true state of the scene, since the likelihood is almost trivial.

I'm not really in the field, but I am vaguely familiar with the literature and this isn't how it works (though you might get that impression from reading LW).

A vision algorithm might face the following problem: reality picks an underlying physical scene and an image from some joint distribution. The algorithm looks at the image and must infer something about the scene. In this case, you need to integrate over a huge space to calculate likelihoods, which is generally completely intractable and so requires some algorithmic insight. For example, if you want to estimate the probability that there is an apple on the table, you need to integrate over the astronomically many possible scenes in which there is an apple on the table.

I don't know if this contradicts you, but this is a problem that biological brain/eye systems have to solve ("inverse optics"), and Steven Pinker has an excellect discussion of it from a Bayesian perspective in his book How the Mind Works. He mentions that the brain does heavily rely on priors that match our environment, which significantly narrows down the possible scenes that could "explain" a given retinal image pair. (You get optical illusions when a scene violates these assumptions.)

There are two parts to the problem: one is designing a model that describes the world well, and the other is using that model to infer things about the world from data. I agree that Bayesian is the correct adjective to apply to this process, but not necessarily that modeling the world is the most interesting part.

I think this paper, entitled "Region competition: Unifying snakes, region growing, and Bayes/MDL for multiband image segmentation" is indicative of the overall mindset. Even though the title explicitly mentions Bayes and MDL, the paper doesn't report any compression results - only segmentation results. Bayes/MDL are viewed as tricks to be used to achieve some other purpose, not as the fundamental principle justifying the research.

In my view, all results should be expressible in the form: because the real world data exhibits empirical structure X, algorithm Y succeeds in describing/predicting it.

AIXI does exactly that - it is based on Occam's razor.

Furthermore, it seems to me if AGI is the goal, researchers would need to study those features of the world that caused natural general intelligence (instantiated in humans) to arise. So thank goodness they're not doing that.

I would say researchers need to study the features of the world that make general intelligence possible. What computational structure does the world have that allows us to understand it?

What computational structure does the world have that allows us to understand it?

Hutter says:

Ockham's razor principle has been proven to be invaluable for understanding our world. Indeed, it not only seems a necessary but also sufficient founding principle of science. Until other necessary or sufficient principles are found, it is prudent to accept Ockham's razor as the foundation of inductive reasoning. So far, all attempts to discredit the universal role of Ockham's razor have failed.

I don't think we can resolve this debate, but let me try to clarify the differences in our positions (perhaps confusing to nonspecialists, since we both advocate compression).

Hutter/Legg/Tyler/etc (algorithmic approach) : Compression is the best measure of understanding. Therefore, to achieve general intelligence, we should search for general purpose compressors. It is not interesting to build specialized compressors. To achieve compression in spite of the NFL theorem, one must exploit empirical structure in the data, but the only empirical fact we require is that the world is computable. Because the compressors are general purpose, to demonstrate success it is sufficient to show that they work well on simple benchmark problems. There is no need to study the structure of specific datasets. To achieve good text compression, one simply finds a general purpose compressor and applies it to text. The problem is entirely a problem of mathematics and algorithm design.

Burfoot (empirical approach) : Compression is the best measure of understanding. However, general purpose compressors are far out of reach at this stage. Instead, one should develop specialized compressors that target specific data types (text, images, speech, music, etc). To achieve good compression in spite of the NFL, one must study the empirical structure of the respective data sets, and build that knowledge into the compressors. To compress text well, one should study grammar, parsing, word morphology, and related topics in linguistics. To demonstrate success, it is sufficient to show that a new compressor achieves a better compression rate on a standard benchmark. We should expect good compressor to fail when applied to a data type for which it was not designed. Progress is achieved by obtaining a series of increasingly strong compression results (K-complexity upper bounds) on standard databases, while also adding new databases of greater scope and size.

Again, I don't think this debate can resolved, but I think it's important to clarify the various positions.

Thanks for the attempt at a position summary!

General purpose systems have their attractions. The human brain has done well out of the generality that it has.

However, I do see many virtues in narrower systems. Indeed, if you want to perform some specific task, a narrow expert system focussed on the problem domain will probably do a somewhat better job than a general purpose system. So, I would not say:

It is not interesting to build specialized compressors.

Rather, each specialized compressor encodes a little bit of a more general intelligence.

This is also a bit of a misrepresentation:

but the only empirical fact we require is that the world is computable

Occam's razor is the critical thing, really. That is an "empirical fact" - and without it we are pretty lost.

We do want general-purpose systems. If we have those, they can build whatever narrow systems we might need.

There are two visions of the path towards machine intelligence - one is of broadening narrow systems, and the other is of general forecasting systems increasing in power: the "forecasting first" scenario. Both seem likely to be important. I tend to promote the second approach partly for technical reasons, but partly because it currently gets so little air time and attention.

I read the latter one, and from a brief glance I think the first one is essentially the same paper. It uses some tricks to get a relatively efficient algorithm in the special case where all the agent has to do is recognize some simple patterns in the environment, somewhat simpler than regular expressions. It would never be able to learn that, in general, the distance the ball bounces up is 50% the distance that it fell, but if the falling distance were quantized and the bouncing distance were quantized and there was a maximum height the ball could fall from, it could eventually learn all possible combinations.

They also gave up on the idea of experimentation not being a special case for AIXI and instead use other heuristics to decide how much to experiment and how often to do take the best known action for a short-term reward.

I believe they're doing the best they can, and for all I know it might be state of the art, but it isn't general intelligence.

The material is fundamental. It doesn't seem to mention The Wirehead Problem, though.

Download it from here: http://www.vetta.org/publications/

SIAI get a pretty favourable writeup. He did get their grant, though. That's one of the expenditures I approve of.

Looking at Schmidhuber and Hutter should help as well. Some links in the area.

Ideally, start off with Solomonoff induction. My links on that are at the bottom of this page.

I appreciate the links, but I've already reviewed or am in the process of reviewing most of them. I was looking for something dealing more specifically with the thesis itself, as I'm not sure whether I should read it, or if perhaps I should only read certain parts of it to save time.

Also (more than a bit off topic), I've enjoyed a number of your youtube videos in the past. A friend of mine has developed a fascination with keyboards and was quite pleased when he saw your video displaying your keyboard array.

The thesis is probably throwing yourself in at the deep end - not necessarilly the best way to learn - perhaps. It depends a lot on what you have already studied so far, though.

You might be correct on that. As of now I suppose I should focus on mastering the basics. I've nearly finished Legg's write up of Solomonoff Induction, but since it seems like there is a good bit of controversy over the AIXI approach I suppose I'll go ahead and get a few more of the details of algorithmic probability theory under my belt and move on to something more obviously useful for a bit; like the details of machine learning and vision and maybe the ideas for category theoretic ontologies.

Again, I would point at Solomonoff Induction as being the really key idea, with AIXI being icing and complication to some extent.

The whole area seems under-explored, with many potential low-hanging fruit to me. Which is pretty strange, considering how important the whole area is. Maybe people are put off by all the dense maths.

On the other hand, general systems tend to suffer from jack-of-all-trades syndrome. So: you may have to explore a little and then decide where your talents are best used.

On the other hand, general systems tend to suffer from jack-of-all-trades syndrome. So: you may have to explore a little and then decide where your talents are best used.

This seems to be the biggest issue for me; my tendency is to pick up as much as I can easily digest (relatively so, I do read portions of relevant texts and articles and often work out a few problems when the material calls for it) and move on. I do generally tend to return to certain material to delve deeper after it becomes clear to me that it would be useful to do so.

Most of my knowledge base right now is in mathematical logic (some recursive function theory, theory of computation, computational complexity, a smattering of category theory), some of the more discrete areas of mathematics (mainly algebraic structures, computational algebraic geometry) and analytic philosophy (philosophical logic, philosophy of mathematics, philosophy of science).

Over the past several (4-5) months I've been working off-and-on through the material suggested on LessWrong: the sequences, decision theory, Bayesian inference, evolutionary psychology, cognitive psychology, cognitive science etc. I've only gotten serious about tackling these and other areas related to FAI over the past couple of months (and very serious over the past few weeks).

Still, nothing seems to pop out at me as 'best suited for me'.

OK. Keep going - or take a break. Good luck!

Good luck!

Thanks!

I'll keep it up for as long as possible. I tend to become quite obsessed with more ordinary subjects that catch my attention, so this should be the case all the more with FAI as I do take Unfriendly AGI seriously as an existential threat and an FAI seriously as a major, major benefit to humanity.

Though I am not an AI researcher, it seems pretty obvious that knowledge of AIXI is the most important part of the mathematical background for work in Friendly AI.

Moreover, it seems quite useful to know how to wield a technical explanation of how an agent can come to know about its environment even if one is not interested in AI. In other words, epistemology seems too important to leave to non-mathematical methods.

Though I am not an AI researcher, it seems pretty obvious that knowledge of AIXI is the most important part of the mathematical background for work in Friendly AI.

I don't see it. Your intuition (telling that it's obvious) is probably wrong, even if the claim is in some sense correct (in a non-obvious way).

(The title of "the most important" is ambiguous enough to open a possibility of arguing definitions.)

In other words, epistemology seems too important to leave to non-mathematical methods.

It doesn't follow that a particular piece of mathematics is the way to go.

Hi, Vladimir!

In other words, epistemology seems too important to leave to non-mathematical methods.

It doesn't follow that a particular piece of mathematics is the way to go.

Is there another non-trivial mathematical account of how an agent can come to have accurate knowledge of its environment that is general enough to deserve the name 'epistemology'?

This is a bad argument, since the best available option isn't necessarily a good option.

This is what I was thinking, investing too much time and energy in AIXI simply because it seems to be the most 'obvious' option currently available could blind you to other avenues of approach.

I think you should know the central construction, it's simple enough (half of Hutter's "gentle introduction" would suffice). But at least read some good textbooks (such as AIMA) that give you overview of the field before charting exploration of primary literature (not sure if you mentioned before what's your current background).

I own a copy of AIMA, though I admittedly haven't read it from cover to cover. I did an independent study learning/coding some basic AI stuff about a year ago, the professor introduced me to AIMA.

not sure if you mentioned before what's your current background

It's a bit difficult to summarize. Is sort of did so here, but I didn't include a lot of detail.

I suppose I could try to hit a few specifics; I was jumping around The Handbook of Brain Theory and Neural Networks for a bit, I picked up the overviews and read a few of the articles, but haven't really come back to it yet; I've read a good number of articles from the MIT Encyclopedia of Cognitive Science; I've read a (small) portion of "Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems" (I ended up ultimately delving too far into molecular biology and organic chem so I abandoned it for the time being, though I would like to look at Comp Neurosci again, maybe using From Neuron to Brain instead, seems more approachable); I read a bit "Dynamical Systems in Neuroscience: The Geometry of Excitability and Bursting" partly to get a sense of just how much current computational models of neurons might diverge from actual neuronal behavior but mostly to get an idea of some alternatives.

As I mentioned in my response to timtyler, I tend to cycle through my readings quite a bit. I like to pick up a small cluster of ideas and let them sink in and move on to something else, coming back to the material later if it still seems relevant to my interests. Once it's popped up a few times I make a more concerted effort to learn it. In any event My main goal over the past few months was to try to get a better overview of a large amount of material relevant to FAI.

Is there another non-trivial mathematical account of how an agent can come to have accurate knowledge of its environment

Pretty much: Solomonoff Induction. That does most of the work in AIXI. OK, it won't design experiments for you, but there are various approaches to doing that...

When I use the word 'AIXI' above, I mean to include Solomonoff induction. I would have thought that was obvious.

One has to learn Solomonoff induction to learn AIXI.

AIXI is more than just Solomonoff induction. It is Solomonoff induction plus some other stuff. I'm a teensy bit concerned that you are giving AIXI credit for Solomonoff induction's moves.

AIXI is more than just Solomonoff induction. It is Solomonoff induction plus some other stuff.

Right. The other stuff is an account of the most fundamental and elementary kind of reinforcement learning. In my conversations (during meetups to which everyone is invited) with one of the Research Fellows at SIAI, reinforcement learning has come up more than Solomonoff induction.

But yeah, the OP should learn Solomonoff induction first, then decide whether to learn AIXI. That would have happened naturally if he'd started reading Legg's thesis, unless the OP has some wierd habit of always finishing PhD theses that he has started.

Since we've gone back and forth twice, and no one's upvoted my contributions, this will probably be my last comment in this thread.

it seems pretty obvious that knowledge of AIXI is the most important part of the mathematical background for work in Friendly AI.

Possibly. That was my initial reaction as well, but now I'm not so sure.

epistemology seems too important to leave to non-mathematical methods.

Agreed.