I've been feeling burned on Overcoming Bias lately, meaning that I take too long to write my posts, which decreases the amount of recovery time, making me feel more burned, etc.
So I'm taking at most a one-week break. I'll post small units of rationality quotes each day, so as to not quite abandon you. I may even post some actual writing, if I feel spontaneous, but definitely not for the next two days; I have to enforce this break upon myself.
When I get back, my schedule calls for me to finish up the Anthropomorphism sequence, and then talk about Marcus Hutter's AIXI, which I think is the last brain-malfunction-causing subject I need to discuss. My posts should then hopefully go back to being shorter and easier.
Hey, at least I got through over a solid year of posts without taking a vacation.
Breaks are good for perspective. Have a margarita. Those are good for a single tiled smiley. I look forward to the talk on AIXI.
more a man than most!
Wait... Eliezer isn't a god?
To echo razib, I'm pretty impressed.
You know, I tried a couple times to read the "Gentle Introduction to AIXI" but didn't ever manage to get very far. I finally found a paragraph summary of the idea when someone linked to a discussion you and Ben had on the agi list. (AIXI searches through all possible algorithms up to a given length and tries each one to find the best for a given task. I'm sure that's a bit off.) It strikes me that no friendliness issues are addressed, but what's more, all of the hard parts of creating a practical AI are left unaddressed. AIXI doesn't give us any insight into what the algorithms behind human cognition are, nor into algorithms that might be more universal. In fact, there's hardly any algorithm there (which is why the theory was amenable to formal analysis). AIXI is probably interesting in some limited aspects, especially regarding theorem proving (I would guess) but I had much higher expectations given the abstract of the paper. And I guess that's sort of the point--a lot of people misunderstand what exactly AIXI means.
Anyway, I look forward to your next posts.
See ya later! I continue to enjoy reading your posts when ever you make them but I am sure a person needs to rest occasionally. Yep.
If some of you want to brush up on AIXI before Eli gets into that, I might suggest checking out my thesis which is now online:
SIAI has a curiously mixed attitude towards AIXI. On the SIAI website Hutter's AIXI book and related AIXI article are among the core readings list, then among the SIAI research agenda there are two AIXI related items based on research I've done. Recently, I was awarded an SIAI academic prize worth $10,000 for, you guessed it, my research into AIXI and related topics. And yet, Eli regularly describes AIXI as a "brain malfunction", or worse!
SIAI has a ... mixed attitude towards AIXI
SIAI has a ... mixed attitude towards AIXI
That's the right attitude to have in general: Encourage good work, but be prepared to criticize it as well.
He calls AIXI "brain-malfunction-causing." I don't think he says that AIXI is a malfunction itself!
Shane: "Brain-malfunction-causing," not "brain malfunction."
Eliezer: Best of luck.
Joshua and Nick:
Eli described AIXI itself as "awfully stupid" in a post here two months ago.
Seems obvious to me that AIXI is describing a fully general learner, which is not the same as a FAI by any stretch. In particular, it's missing all of the optimizations you might gain by narrowing the scope, and it's completely unfriendly. It's a pure utility maximizer, which means it's a step down from a smiley-face maximizer in terms of safety - it has no humane values.
An AIXI solving a mathematical game would optimize. An AIXI operating in the real world would waste an awful lot of time learning basic physics, and then wirehead - if you were lucky.
http://www.vetta.org/documents/Machine_Super_Intelligence.pdf is still pretty long and technical :-(
Shane, I believe he meant "AIXI would act stupid," not "AIXI is a stupid idea."
I rather objected to this bit:
Kolmogorov complexity has many powerful theoretical properties and is a central ingredient in the theory of universal artificial intelligence. Its most important property is that the complexity it assigns to strings and sequences does not depend too much on the choice of the universal Turing machine U.
...on page 32. Kolmogorov complexity and Solomonoff induction are highly language-specific. Almost all the discussions of Solomonoff induction I have seen gloss over this fact - with a comment along the lines of "oh, emulating another machine just takes up some fixed number of bits for the emulator" - so it doesn't make much difference.
IMHO, that is not acceptable. It does make a difference. IMO, if you choose a crappy, serial model of computation as a foundation for a metric, that choice needs more actively defending.
Shane, I meant that AIXI is the last difficult topic.
AIXI itself is a deranged god, but that's a separate story. I'm very fond of Hutter's work, I just don't think it means what Hutter seems to think it means. AIXI draws the line of demarcation between problems you can solve using known math and infinite computing power, and problems that are essentially structural in nature. I regard this as an important line of demarcation!
It's also the first AGI specification drawn in sufficient detail that you can really nail down what goes wrong - most AGI wannabes will just say, "Oh, my AI wouldn't do that" because it's all magical anyway.
...and, thanks for the well-wishes!
Thanks. That clears things up. Enjoy your break. Maybe you should not post quite so much? You really do seem to be writing rather a lot these days. By the time I get to replying to some of your comments you've already written another 5 posts!
Answering this question starts to feel a bit like living in the movie Groundhog Day. :-)
Usually the reference machine is taken to have a low state x symbol complexity, so you can't hide much in it. In other words the reference machine has to be in some sense simple.
Now look at the Kolmogorov complexity function. As you mention, if somebody else uses a different reference machine their measured Kolmogorov complexity will be different, where the maximal difference is bounded by some constant. How big is this bound? Pretty small. Many types of simple Turing machines have been shown to be able to simulate each other with a few hundred bits of input. It's also trivial for a serial machine to simulate a parallel machine. Remember that Kolmogorov complexity is a measure of information content, in the sense of shortest description. It's not a measure of how much computation was performed. There are other measures for that which are more complex...
Finally, look at the Solomonoff bound (bottom of page 38). There you see the Kolmogorov complexity of the true model of the environment. If you're using a different simple reference machine, this bound might go up by a few hundred bits. Is this a big deal? Well, yes: if you are using only a few bytes of input data, and that's all. But then Bayesian inference in general will have problems as your prior will strongly affect your posterior. The reference machine problem is the same issue, but in different language. However, what if you have more data, say a few kB or more, maybe much much more? In this case taking a few bytes longer to converge isn't really a bit deal. Especially considering that this is convergence for any unknown computable hypothesis. Say you want to predict the stock market, or results from particle physics experiments, or sentences in a book. Are a few bytes of extra data for the convergence bound going to make much difference? Not really. The Solomonoff predictor is still going to kick some serious butt.
Of course it's not computable, has to be approximated in practice... etc. etc. So why bother with all this? I see it as a kind of "mathematical philosophy". You take ideas about induction, learning, computation etc. and really nail them down hard in formal mathematics and then study what you've got. I think this gives you some insights into the nature of learning, intelligence etc. Of course, this is a rather subjective point. My own AGI project that I'm developing with some of my research buddies isn't directly based on Solomonoff Induction and AIXI, but we do draw on some related works (such as the universal intelligence measure suitably computably interpreted) and I do sometimes use AIXI as a kind of mental framework to think about some kinds of AGI design issues.
Eliezer, you have been pretty prolific. I've thoroughly enjoyed digesting your writing lo these eight years, but this blogging thing seems to have worked out especially well. Enjoy your well-deserved regrouping time.
To get the ball rolling on rationality quotes, here's a rationality quote from the father of the Greek enlightenment, Xenophanes:
But as for certain truth, no man has known it,
Nor will he know it; neither of the gods
Nor yet of all the things of which I speak.
And even if by chance he were to utter
The perfect truth, he would himself not know it;
For all is but a woven web of guesses
Enjoy your break.
Re: As you mention, if somebody else uses a different reference machine their measured Kolmogorov complexity will be different, where the maximal difference is bounded by some constant. How big is this bound? Pretty small.
No, the bound is infinite. I.e. there is no bound. You have to confine your attention to a subset of possible machines before you can do any bounding. What is the rationale for considering some machines and not others? One possibile criterion is what best approximates a small hardware implementation - i.e. whether the virtual machine can be small physically. However, a physical machine has experimental access to the constants of nature, which might take up considerable space in a Turing machine - so it does not seem reasonable to claim that a simple Turing machine is a good approximation to a small physical machine. If you reject the real for the virtual, what are the grounds for doing that?
Also, if it comes down to 100 bits, that is not exactly a small probabilty difference. It's a factor of 2^100. That might well make a big difference when comparing small theories - such as the ones often used by science.
These may not be show-stopping issues - but it seems odd to me that Solomonoff induction people seem to think they can gloss over them, perhaps to arrive all the faster at their bizarre conclusion that Solomonoff induction is the one true answer to the problem of the priors.
What is the rationale for considering some machines and not others?
Because we want to measure the information content of the string, not some crazy complex reference machine. That's why a tiny reference machine is used. In terms of inductive inference, when you say that the bound is infinitely large, what you're saying is that you don't believe in Occam's razor. In which case the whole Bayesian system can get weird. For example, if you have an arbitrarily strong prior belief that most of the world is full of purple chickens from Andromeda galaxy, well, Bayes' rule is not going to help you much. What you want is an uninformative prior distribution, or equivalently over computable distributions, a very simple reference machine.
Thanks to the rapid convergence of the posterior from a universal prior, that 2^100 factor is small for any moderate amount of data. Just look at the bound equation.
These things are not glossed over. Read the mathematical literature on the subject, it's all there.
That's why a tiny reference machine is used.
I think that Tim is pointing out that there is no available mathematical measure for the 'tinyness' of this machine which is not circular. You seem to be saying that the machine looks simple to most people and that all other machines which people class as simple could be simulated on this machine within a few hundred bits. This has two problems. Firstly, it is not provable that all other machines which we class as similarly simple will be simulated within a few hundred bits as it is an empirical question which other machines people find simple. I'll grant that we can be reasonably confident though. The second problem is that we haven't defined any canonical mathematical measure of simplicity, just a measure of 'simplicity relative to the empirical facts about what humans find simple'. Perhaps we could use physics instead of humans and look at physically small Turing machines and then have 'simplicity relative to the empirical facts about what can be done in small volumes'. These are no doubt interesting, but are still concepts of relative simplicity, not a canonical absolute simplicity/complexity. No such measure has been discovered, and perhaps there can be no such measure. We can contrast this with complexity of infinite strings, where there is a convergence between all base machines and thus an absolute measure of simplicity. The problem is that we are now looking for something to deal with finite strings, not infinite ones.
Do take it easy. Still the p-best, most p-thought-provoking blog writing anywhere on the intertubes.
Why not the standard approach of using Shannon's state x symbol complexity for Turing machines? If a reference machine has a very low state x symbol complexity then it is trivial to implement in our universe: we just need a few symbols, a few states, and a few transformation rules.
Why not the standard approach of using Shannon's state x symbol complexity for Turing machines?
Why choose a Turing machine? They are clearly not a canonical mathematical entity, just a historical artifact. Their level of power is a canonical mathematical entity, but there are many Turing-equivalent models of computation. This just gets us simplicity relative to Turing machines where what we wanted was simplicity simpliciter (i.e. absolute simplicity). If someone came to you with a seemingly bizarre Turing-complete model, where the shortest program for successor was 3^^^3 bits and all the short programs were reserved for things that look crazy to us, how can you show him that Turing machines are the more appropriate model? Of course, it is obvious to us that Turing machines are in some sense a better judge of the notion of simplicity that we want, but can we show this mathematically? If we can, it hasn't yet been done. Turing machines might look simple to Turing machines (and human brains) but this new model might look simple to itself (e.g. has a 1 bit universal machine etc.).
It looks like we have to settle for a non-canonical concept of simplicity relative to human intuitions or simplicity relative to physics or the like. I think this is a deep point, which is particularly puzzling and not sufficiently acknowledged in Kolmolgorov complexity circles. It feels like there must be some important mathematically canonical measure of complexity of finite strings, just like there is for infinite strings, but all we have are 'complexity-relative-to-X' and perhaps this is all there is.
Um, shouldn't someone be acknowledging, at this point, that what we're talking about here is exactly precisely "the problem of justifying induction without using induction" or "how to justify Occam's Razor without using Occam's Razor"?
Whether you switch to something else like lambda calculus or a trivial CA doesn't really matter. These all boil down to models with a few states and transitions and as such have simple physical realisations. When you have only a few states and transitions there isn't much space to move about. This is the bedrock. It isn't absolutely unique, sure, but the space is tight enough to have little impact on Solomonoff induction.
3^^^3 is a super gigantic monster number, and all these mind boggeling many shorter programs outputting things that are complex on a minimal state Turing machine (or lambda calculus, or minimal CA, or minimal...), where are you going to put all this? You can't squeeze it into something that is as ultra trivial as the Wolfram/Smith UTM that has just 2 states and 3 symbols.
Quiet from the gallery! You're on a break remember. :-)
Yeah, it basically does come down to that. You don't get something from nothing. An ultra tiny universal machine seems to be the most something from the closest to nothing we can achieve.
It is certainly similar to those problems, but slightly different. For example, justifying Occam's Razor requires a bit more than we need here. In our case, we are just looking for a canonical complexity measure for finite strings. For Occam's Razor we also need to show that we have reason to prefer theories expressible by simpler strings to those specified by more complex strings. As an example, we already have such a canonical complexity measure for infinite strings. It is not perfect, as you might want some complexity measure defined with o-machines instead, or with finite state automata or whatever. These would give different complexity measures, but at least the Turing machine level one marks out a basic type of complexity, rather than an infinite set of complexity measures, as for finite strings.
Where are you going to put all this?
Is that a physical question? If so, it is just complexity-relative-to-physics or to be more precise: complexity-relative-to-physical-size. If you mean that it seems more complex: of course it does to us, but we have no canonical way of showing that, as opposed to measures based on its rival machines from the same computability class (which is just begging the question). As I said, there might be a canonical complexity measure for finite strings, but if so, this hasn't been proven yet. I don't know what the upshot of all this is, and in many practical cases I'm sure the stuff I'm saying can be safely put aside, but it is worth being aware of it, particularly when trying to get appropriately general and theoretical results (such as the AIXI stuff). If we talk about AIXI(M) where AIXI is a function of machine M, and call my pathological machine P, then AIXI(TM) looks pretty much the same as AIXI(LambdaCalculus) and AIXI(Java) and every other sensible language we use, but it looks completely different to AIXI(P) until we start looking at strings of order 3^^^3. Whether that is a problem depends upon the questions being asked.
In our case, we are just looking for a canonical complexity measure for finite strings.
I see what you're saying, of course; but isn't that like saying that you only want a canonical measure of the length of a message, not its improbability?
I've written up some of my thoughts on the issue, and put them at: http://alife.co.uk/essays/the_one_true_razor/
Yes, in some sense the idea of Turing computation is a kind of physical principle in that no well defined process we know of is not Turing computable (for other readers: this includes chaotic systems and quantum systems as the wave function is computable... with great difficulty in some cases).
Actually, if you built P and it really was very trivial, then I could get my simple Turing machine to compute a quantum level simulation of your P implementation with far less than 3^^^3 bits of extra information. Thus, if your bound really only kicks in at 3^^^3 bits, then (within currently accepted quantum physics) no trivial physical implementation of P can be possible.
Anyway, as you can't specify P in just a few simple states and symbols, I do not consider it to be an acceptable reference machine (for strict theory purposes at least).
I'm enjoying reading your articles already a long time.
From my perspective, if you write less, its much better that to write too much. If you post one article per day - I can't read it anyway (I mean can't read & really digest it). What count's is quality, not quantity.
I just want to say, that I think you articles are very good, but if you post less of them, they will be even better.
I want to echo others here and thank you for the great article, and wish you a good break. Get a nice omega 3/folic acid/vitamin D/zinc cocktail and recharge that brain :)