Message Length

30cousin_it

12Zack_M_Davis

4Ms. Haze

5FireStormOOO

21dxu

11Zack_M_Davis

16rainy

6Zack_M_Davis

12Raemon

11derikk

10Donald Hobson

10habryka

4Zack_M_Davis

9jimrandomh

8dynomight

6Gunnar_Zarncke

5BeanSprugget

1Mart_Korz

3ic

2moody

2Mathisco

6Zack_M_Davis

6gilch

1Mathisco

1Mathisco

New Comment

You lay it out very nicely. But I'd quibble that as long as your nth-order Markov chain isn't exceptionally small and fully deterministic, there might be room for more explanation. Maybe there's no explanation and the data is genuinely random, but what if it's a binary encoded Russian poem? When you've exhausted all self-contained short theories, that doesn't mean the work of science is done. You also need to exhaust all analogies with everything in the world whose complexity is already "paid for", and then look at that in turn, and so on.

First thing I did before even reading the article is see that it wasn't ASCII or UTF-8 (or at least if it was it wasn't bit-aligned). Definitely on the short list of things technical folks are going to instinctively check, along with maybe common "magic bytes" at the start of the maybe-a-file.

Since the parameters in your implementation are 32-bit floats, you assign a complexity cost of 32 ⋅ 2^n bits to n-th order Markov chains, and look at the sum of fit (log loss) and complexity.

Something about this feels wrong. The precision of your floats shouldn't be what determines the complexity of your Markov chain; the expressivity of an nth-order Markov chain will almost always be worse than that of a (n+1)th-order Markov chain, *even if the latter has access to higher precision floats than the former*. Also, in the extreme case where you're working with real numbers, you'd end up with the absurd conclusion that every Markov chain has infinite complexity, which is obviously nonsensical.

This does raise the question of *how* to assign complexity to Markov chains; it's clearly going to be linear in the number of parameters (and hence exponential in the order of the chain), which means the general form k ⋅ 2^n seems correct... but the value you choose for the coefficient k seems underdetermined.

You're right! This is something that the literature has details on, that I chose to simplify/gloss-over in pursuit of my goal of "write Actual Executable Code implementing minimum-description-length model selection and then tell a cute story about it."

Chapter 5 of Peter D. Grünwald's *The Minimum Description Length Principle* gives the complexity term (assuming that the parameter space is compact) as , where is the number of parameters, is the bits of percision per parameter, and is the codelength function for a universal prior on the integers (the theoretical ideal that Levenshtein coding and the Elias ω coding approach). Basically, in order to specify your parameters, you need to first say how many of them there are and to what precision, and for that, you need a prefix-free code for the integers, which is itself kind of intricate: you can't give every integer the same codelength, so you end up doing something like "first give the order of magnitude (in, um, unary), and *then* pick out a specific integer in that range", but recursively (so that you first give the order of magnitude of the order of magnitude if it gets too big, *&c*.).

(Oh, and Grünwald warns that this still isn't optimal and that question of *the best* encoding is left to §15.3, but I'm not there yet—it's a big textbook, okay?)

I was originally imagining implementing the integer coding, but I decided it was too complicated for the cute story I wanted to tell about model selection, particularly since I wanted to provide Actual Executable Code with minimal boilerplate and minimal dependencies. (Rust just gives me `f32`

and `f64`

, not variable-length floats.) I think this kind of handwaving should be admissible in the cute-blog-story genre as long as the author addresses objections in the comment section.

I did a strange experiment once, I implemented a markov predictor like this and used it to predict the next bit of a sequence. Then I extended the sequence with the negation of that bit an increased the length of the markov chain. The result was a very random unpredictable looking sequence, but of course it was completely deterministic. I wonder if this sequence has a name?

I'll probably write up a proper review later. But right now I just wanted to say: the other day, someone sent me a big blob of code over IM. I looked at it, my eyes kinda glazed over, I was confused about why they sent it. I moved on with my day.

Only later did I realize the code was an SVG file that a) my computer program was supposed to automatically read and interpret, b) *I* totally could have noticed and interpreted and realized I should open it in photoshop.

And then I thought "man, I feel like I'm a character in some Zack Davis fiction, and I totally failed at my job."

This sounds like a classic example of the bias-variance tradeoff: adding parameters to your model means it can more accurately fit your data (lower bias), but is more sensitive to fluctuations in that data (higher variance). Total error when making predictions on new data is minimized when the bias and variance errors are balanced.

Another example: given data points, you can always draw a polynomial of degree that fits them perfectly. But the interpolated output values may vary wildly with slight perturbations to your measured data, which is unlikely to represent a real trend. Often a simple linear regression is the most appropriate model.

Of course, you also need to store the concept of a Markov chain in the abstract as part of your models. (But that is constant, and should be fairly small in a good encoding. ) On the other hand, 32 bit floats are excessive in precision. And a substantial proportion of floats aren't in the range [0,1] at all. You could probably cut the precision down to 16 bits, maybe less. Of course, you also need a few bits for the order of the markov chain, and the precision used.

Promoted to curated: This really is a quite good and straightforward explanation of what I think of as one of the really core ideas in theoretical rationality, covering surprisingly much ground in a way that feels approachable. I remember how impactful reading "A technical explanation of a technical explanation" was for my thinking about rationality, and this feels like a much less daunting introduction to some of the same ideas.

Thanks! But ... I'm not seeing the Curated "★" icon (*e.g.*, in the Posts list on my profile)? Am I missing something, or is there an extra button you still have to click?

(When we curate something, admins get an email version of the post immediately, then everyone else who's subscribed to curated gets the email after a 20 minute delay. Sometimes we notice a formatting problem with the email, un-curate during the 20 minute window, then re-curate it; that's what happened in this case.)

This covers a really impressive range of material -- well done! I just wanted to point out that if someone followed all of this and wanted more, Shannon's 1948 paper is surprisingly readable even today and is probably a nice companion:

http://people.math.harvard.edu/~ctm/home/text/others/shannon/entropy/entropy.pdf

I liked the previous title - Msg Len - better, but I agree that it was maybe a bit too much ;-)

Really interesting post. To me, approaching information with mathematics seems like a black box - and in this post, it feels like magic.

I'm a little confused by the concept of cost: I understand that it takes more data to represent more complex systems, which grows exponentially faster than than the amount of bits. But doesn't the more complex model still strictly fit the data better? - is it just trying to go for a different goal than *accuracy*? I feel like I'm missing the entire point of the end.

I am not sure whether my take on this is correct, so I'd be thankful if someone corrects me if I am wrong:

I think that if the goal was only 'predicting' this bit-sequence after knowing the sequence itself, one could just state probability 1 for the known sequence.

In the OP instead, we regard the bit-sequence as stemming from some sequence-generator, of which only this part of the output is known. Here, we only have limited data such that singling out a highly complex model out of model-space has to be weighed against the models' fit to the bit-sequence.

A recommended reading for understanding more of what happens with "regularization" in optimization and search algorithms.

Interesting also the post comes the same week as a discussion on the Solomonoff prior.

And

that'show you can explain mysteriously frequent consecutive runs and alternations. If the lasttwobits being`01`

(respectively`10`

) makes it more likely for the next bit to be`0`

(respectively`1`

),andthe last two bits being`00`

(respectively`11`

) makes it more likely for the next bit to be`0`

(respectively`1`

), then you would be more likely to see both long`0000`

... or`1111`

... consecutive runsand`01010`

... alternations.

Alternatively, it could be a first order chain that checks the digit located 2 bits before it and makes it more likely to match. This does however remove the possibility of assigning different likelihoods to `00`

and `01`

being followed by `0`

(or `11`

and `10`

being followed by `1`

).

Edit: or a coin biased to match the digit 2 entries before, removing the possibility of different likelihoods for chains of 1 and chains of 0 (or alternating chains beginning/ending on 1 or 0).

Realistically?—Python and Rust are the only two languages I'm Really Good at, and I'm more Excited about Rust because I write enough Python at my dayjob? In my defense, the `.windows`

iterator on slices was really handy here and is shamefully missing from Python's `itertools`

.

Toolz has a sliding window.

Hy's `partition`

can also do a sliding window like Clojure's can, by setting the `step`

argument.

Rust is a fascinating new language to learn, but not designed for scientific computing. For that Julia is the new kid on the block, and quite easy to pick up if you know Python.

Someone is broadcasting a stream of bits. You don't know why. A 500-bit-long sample looks like this:

The thought occurs to you to do Science to it—to ponder if there's some way you could better predict what bits are going to come next. At first you think you can't—it's just a bunch of random bits. You can't predict it, because that's what random

means.Or does it? True, if the sequence represented flips of a fair coin—every flip independently landing either

`0`

or`1`

with exactly equal probability—then there would be no way you could predict what would come next: any continuation you could posit would be exactly as probable as any other.But if the sequence represented flips of a

biasedcoin—if, say,`1`

came up 0.55 of the time instead of exactly 0.5—then it would be possible to predict better or worse. Your best bet for the next bit in isolation would always be`1`

, and you would more strongly anticipate sequences with slightly more`1`

s than`0`

s.You count 265

`1`

s in the sample of 500 bits.Giventhe hypothesis that the bits were generated by a fair coin, the number of`1`

s (or without loss of generality,`0`

s) would be given by the binomial distribution (500k)(0.5)k(0.5)500−k, which has a standard deviation of √500⋅0.52=√125≈11.18, so your observation of 265−250=15 excess`1`

s is about 1511.18≈1.34 standard deviations from the mean—well within the realm of plausibility of happening by chance, although you're at least slightlysuspiciousthat the coin behind these bits might not be quite fair.... that is, if it's even a coin. You love talking in terms of shiny, if hypothetical, "coins" rather than stodgy old "independent and identically distributed binary-valued random variables", but looking at the sample again, you begin to

furtherdoubt whether the bits are independent of each other. You've heard that humans are biased to overestimate the frequency of alternations (`101010`

...) and underestimate the frequency of consecutive runs (`00000`

... or`11111`

...) in "truly" (uniformly) random data, but the 500-bit sample contains a run of 13`0`

s (starting at position 243)anda run of 19`1`

s (starting at position 319). You're not immediately sure how to calculate the probability of that, but your gut says that should be very unlikely given the biased-coin model, even after taking into account that human guts aren't very good at estimating these things.Maybe not everything in the universe is a coin. What if the bits were being generated by a Markov chain—if the probability of the next bit depended on the value of the one just before? If a

`0`

made thenextbit more likely to be a`0`

, and the same for`1`

, that would make the`00000`

... and`11111`

... runs less improbable.Except ... the sample

alsohas a run of 17 alternations (starting at position 153). On the "fair coin" model, shouldn't that itself be 217−13=16 times as suspicious as the run of 13`0`

s and 217−19=14 as suspicious as the run of 19`1`

s which led you to hypothesize a Markov chain?—or rather, 8 and 18 times as suspicious, respectively, because there are two ways for an alternation to occur (`0101010`

... or`1010101`

...).A Markov chain in which a

`0`

or`1`

makes another of the same more likely, makes alternationslesslikely: the Markov chain hypothesis can only make the consecutive runs look less surprising at the expense of making the run of alternations lookmoresurprising.So maybe it's all just a coincidence: the broadcast is random—whatever that means—and you're just apophenically pattern-matching on noise. Unless ...

Could it be that some things in the universe are

neithercoinsnorMarkov chains? You don'tknowwho is broadcasting these bits or why; you called it "random" because you didn't see any obvious pattern, but now that you think about it, it would be pretty weird for someone to just be broadcasting random bits. Probably the broadcast is something like a movie or a stock ticker; if a close-up sample of the individual bits looks "random", that's only because you don't know the codec.Trying to guess a video codec is

obviouslyimpossible. Does that kill all hope of being able to better predict future bits?Maybenot. Even if you don't know what the broadcast is really for, there might be some nontriviallocalstructure to it, where bits are statistically related to the bits nearby, like how a dumb encoding of a video might have consecutive runs of the same bit-pattern where a large portion of a frame is the same color, like the sky.Local structure, where bits are statistically related to the bits nearby ... kind of like a Markov chain, except in a Markov chain the probability of the next state only depends on the

oneimmediately before, which is a pretty narrow notion of "nearby." To broaden that, you could imagine the bits are being generated by ahigher-orderMarkov chain, where the probability of the next bit depends on the previousnbits for some specific value ofn.And

that'show you can explain mysteriously frequent consecutive runs and alternations. If the lasttwobits being`01`

(respectively`10`

) makes it more likely for the next bit to be`0`

(respectively`1`

),andthe last two bits being`00`

(respectively`11`

) makes it more likely for the next bit to be`0`

(respectively`1`

), then you would be more likely to see both long`0000`

... or`1111`

... consecutive runsand`01010`

... alternations.A biased coin is just an

n-th-order Markov chain wheren= 0. Ann-th-order Markov chain wheren> 1, is just a first-order Markov chain where each "state" is a tuple of bits, rather than a single bit.Everything in the universe is a Markov chain!—with respect to the models you've considered so far.

"The bits are being generated by a Markov chain of some order" is

atheory, but a pretty broad one. To make it concrete enough to test, you need to posit some specific ordern, and, givenn, specific parameters for the next-bit-given-previous-nprobabilities.The

n= 0 coin has one parameter: the bias of the coin, the probability of the next bit being`0`

. (Or without loss of generality`1`

; we just need one parameterpto specify the probability of one of the two possibilities, and then the probability of the other will be 1 −p.)The

n= 1 ordinary Markov chain has two parameters: the probability of the next bit being (without loss of generality)`0`

giventhat the last bit was a`0`

, and the probability of the next bit being (without loss of ...)`0`

giventhat the last bit was a`1`

.The

n= 2 second-order Markov chain has four parameters: the probability of the next bit being (without loss ...)`0`

giventhat the last two bits were`00`

, the probability of the next bit being (without ...)`0`

giventhat the last two bits were`01`

, the probability of the next bit being—Enough!You get it! Then-th order Markov chain has 2n parameters!Okay, but then how do you guess the parameters?

For the

n= 0 coin, yourbest guessat the frequency-of-`0`

parameter is going to just be the frequency of`0`

s you've observed. Your best guess could easily be wrong, and probably is: just because you observed 235/500 = 0.47`0`

s, doesn't mean the parameteris0.47: it's probably somewhat lower or higher, and your sample got more or fewer`0`

s than average just by chance. But positing that the observed frequency is the actual parameter is the maximum likelihood estimate—the single value thatmostmakes the data "look normal".For

n≥ 1, it's the same idea: your best guess for the frequency-of-`0`

-after-`0`

parameter is just the frequency of`0`

being the next bit, among all the places where`0`

was the last bit, and so on.You can write a program that takes the data and a degree

n, and computes the maximum-likelihood estimate for then-th order Markov chain that might have produced that data. Just slide an (n+1)-bit "window" over the data, and keep a tally of the frequencies of the "plus-one" last bit, for each of the 2n possiblen-bit patterns.In the Rust programming language, that looks like following. (Where the representation of our final theory is output as a map (

`HashMap`

) fromn+1-bit-patterns to frequencies/parameter-values (stored as a thirty-two bit floating-point number,`f32`

).)Now that you have the best theory for each particular

n, you can compare how well each of them predict the data! For example, according ton= 0 coin model with maximum-likelihood parameterp= 0.47, the probability of your 500-bit sample is about ... 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000007517883433770135.Uh. The tiny probability makes sense: there's a

lotof randomness in 500 flips of a biased coin. Even if you know the bias, the probability of anyparticular500-flip sequence is going to be tiny. But a number that tiny is kind of unwieldy to work with. You'd almost rather just count the zeros and ignore the specific digits afterwards.But counting the zeros is just taking the logarithm—well, the negative logarithm in the case of zeros after the decimal point. Better make the log base-two—it's

thematic. Call this measurement thelog loss.Nowyou can compare different theories to see which order of Markov chain isthe besttheory to "fit" your 500-bit sample ... right?There's a problem. Higher choices of

nmonotonically achieve a better "fit". You got the idea of higher-order Markov chains because the idea of a biased coin didn't seem adequate to explain the consecutive and alternating runs you saw, but you somehow have trouble believing that the bitstream was generated by afifteenth-order Markov chain with a completely separate probability for the next bit foreachof the 215 = 32,768 prefixes`000000000000000`

,`000000000000001`

,`000000000000010`

,&c.Having had the "higher-order Markov chain" idea, are you now obligated to setnas large as possible? What would that evenmean?In retrospect, the problem should have been obvious from the start. Using your sample data to choose maximum-likelihood parameters, and then using the model with those parameters to "predict" the

samedata puts you in the position of the vaunted "sharpshooter" who paints a target around a clump of bullet holesafterfiring wildly at the broad side of a barn.Higher values of

nare like a ... thinner paintbrush?—or a squigglier, more "gerrymandered" painting of a target. Higher-order Markov chains arestrictlymore expressive than lower-order ones: the zeroth-order coin is just a first-order Markov chain where the next-bit-after-`0`

and next-bit-after-`1`

parameters just happen to be the same; the first-order Markov chain is just a second-order chain where the next-bit-after-`00`

and next-bit-after-`10`

parameters happen to be the same, as well as the next-bit-after-`01`

and—enough!You get it!The broadcast is ongoing; you're not limited to the particular 500-bit sample you've been playing with. If the worry were

justthat the higher-order models will (somehow, you intuit) fail to predict future data, you could use different samples for estimating parameters and validating the resulting models, but you think you're suffering from some more fundamental confusion—one that's probably not limited to Markov chains in particular.Your working concept of what it means for a theory to "fit" the data, is for it to maximize the probability with which the theory predicts the data. This is an objective,

quantitativemeasurement. (Okay, the log loss is taking the negative logarithm of that to avoid so many zeros after the decimal point, but minimizing the log loss and maximizing the probability are both expressing the same preference on theories.)Howdo you know (and your gut says that youknow) that the higher-order models will do badly on future data, if your objective criterion of model-goodness says they're better? The log loss always "wants" to you to choose ever-more-complex models. You asked: what would that evenmean? But maybe it doesn't have to be a rhetorical question: whatwouldthat even mean?Well ... in the limit, you could choose a theory that assigns Probability One to the observed data. The "too many zeros"/"avoid working with really tiny numbers" justification for taking the negative log doesn't really apply here, but for consistency with your earlier results, you dutifully note that the logarithm of 1 is 0 ...

But maybe "too many zeros" isn't the only good motivation for taking the logarithm?

Intelligence is prediction is compression. The log loss of a model against the data can be interpreted as the expected number of bits you would need to describe the data, given the optimal code implied by your model.In order to communicate a reduction in your uncertainty, you need to send a signal—something you can choose to vary in response to the reality of the data. A signal you can vary to take two possible states, can distinguish between two sets among which you've divided the remaining possibilities; writing down a bit means halving your uncertainty.

On this interpretation, what the logarithm of Probability One being zero

meansis that if your theory predicted the exact outcome with certainty, then once you stated the theory, you wouldn't have to say anything more in order to describe the data—you would justknowwith zero further bits.Once you stated the theory. A theory implies an optimally efficient coding by which further bits can whittle down the space of possibilities to the data that actually happened. More complicated or unlikely data requires more bits just tospecify—to single out that one outcome amongst the vastness of equally remote alternatives. Butthe same thing goes for theories.Given a particular precision to which parameters are specified, there are exponentially more Markov chains of higher degrees, which can continue to drive down the log loss—but not faster than their

ownprobability decreases. You need exponentially more data just to learn the parameters of a higher-order model. If you don'thavethat much data—enough to pin down the 2n parameters that single out thisparticularhigher-order Markov chain amongst the vastness of equally remote alternatives—then your maximum-likelihood best guess is not going to be very good on future data, for the same reason you can't expect to correctly guess that a biased coin has a probability of landing Heads of exactly 0.23 if you've only seen it flipped twice.If you

dohave enough data to learn a more complex model, but the data was actually generated by a simpler model, then the parameters of the complex model will approximately take the settings that produce the same behavior as the simpler model—like a second-order Markov chain for which the bit-following-`01`

parameter happens to take the same value as the bit-following-`11`

parameter. And if you're deciding what theory to prefer based on both fit and complexity, the more complex model won't be able to "pay" for its increased complexity with its own predictions.Now that you know what's going on, you can modify your code to penalize more complex models. Since the parameters in your implementation are 32-bit floats, you assign a complexity cost of 32⋅2n bits to

n-th order Markov chains, and look at the sum of fit (log loss) and complexity. Trying out your code again on a larger sample of 10,000 bits from the broadcast——reveals a clear preference for the third-order theory (that for which the fit-plus-complexity score is the lowest), allowing you to enjoy the huge 450-plus–bit leap in compression/prediction from

n:= 2 to 3 andlogically stop there, the steepness of the ascent into the madness of arbitrary complexity successfully dissuading you from chasing after diminishing returns (which you suspect are only hallucinatory). That's the power packed by parsimony—the sublime simplicity ofScience.(Full source code.)