Message Length

[-]cousin_it5y*300

You lay it out very nicely. But I'd quibble that as long as your nth-order Markov chain isn't exceptionally small and fully deterministic, there might be room for more explanation. Maybe there's no explanation and the data is genuinely random, but what if it's a binary encoded Russian poem? When you've exhausted all self-contained short theories, that doesn't mean the work of science is done. You also need to exhaust all analogies with everything in the world whose complexity is already "paid for", and then look at that in turn, and so on.

[-]Zack_M_Davis5y120

Yeah, but I couldn't get that program to run on my desktop.

[-]Ms. Haze5y41

I agree. I definitely would have run through common encodings before going to Markov Chains.

[-]FireStormOOO5y50

First thing I did before even reading the article is see that it wasn't ASCII or UTF-8 (or at least if it was it wasn't bit-aligned). Definitely on the short list of things technical folks are going to instinctively check, along with maybe common "magic bytes" at the start of the maybe-a-file.

[-]dxu5y212

Since the parameters in your implementation are 32-bit floats, you assign a complexity cost of 32 ⋅ 2^n bits to n-th order Markov chains, and look at the sum of fit (log loss) and complexity.

Something about this feels wrong. The precision of your floats shouldn't be what determines the complexity of your Markov chain; the expressivity of an nth-order Markov chain will almost always be worse than that of a (n+1)th-order Markov chain, even if the latter has access to higher precision floats than the former. Also, in the extreme case where you're working with real numbers, you'd end up with the absurd conclusion that every Markov chain has infinite complexity, which is obviously nonsensical.

This does raise the question of how to assign complexity to Markov chains; it's clearly going to be linear in the number of parameters (and hence exponential in the order of the chain), which means the general form k ⋅ 2^n seems correct... but the value you choose for the coefficient k seems underdetermined.

[-]Zack_M_Davis5y110

You're right! This is something that the literature has details on, that I chose to simplify/gloss-over in pursuit of my goal of "write Actual Executable Code implementing minimum-description-length model selection and then tell a cute story about it."

Chapter 5 of Peter D. Grünwald's The Minimum Description Length Principle gives the complexity term (assuming that the parameter space is compact) as , where $k$ is the number of parameters, $d$ is the bits of percision per parameter, and $L_{N}$ is the codelength function for a universal prior on the integers (the theoretical ideal that Levenshtein coding and the Elias ω coding approach). Basically, in order to specify your parameters, you need to first say how many of them there are and to what precision, and for that, you need a prefix-free code for the integers, which is itself kind of intricate: you can't give every integer the same codelength, so you end up doing something like "first give the order of magnitude (in, um, unary), and then pick out a specific integer in that range", but recursively (so that you first give the order of magnitude of the order of magnitude if it gets too big, &c.).

(Oh, and Grünwald warns that this still isn't optimal and that question of the best encoding is left to §15.3, but I'm not there yet—it's a big textbook, okay?)

I was originally imagining implementing the integer coding, but I decided it was too complicated for the cute story I wanted to tell about model selection, particularly since I wanted to provide Actual Executable Code with minimal boilerplate and minimal dependencies. (Rust just gives me f32 and f64, not variable-length floats.) I think this kind of handwaving should be admissible in the cute-blog-story genre as long as the author addresses objections in the comment section.

[-]rainy5y160

I did a strange experiment once, I implemented a markov predictor like this and used it to predict the next bit of a sequence. Then I extended the sequence with the negation of that bit an increased the length of the markov chain. The result was a very random unpredictable looking sequence, but of course it was completely deterministic. I wonder if this sequence has a name?

[-]Zack_M_Davis5y60

Your childhood dislike of asymmetry led you to invent the Thue–Morse sequence?!

[-]Raemon4y120

I'll probably write up a proper review later. But right now I just wanted to say: the other day, someone sent me a big blob of code over IM. I looked at it, my eyes kinda glazed over, I was confused about why they sent it. I moved on with my day.

Only later did I realize the code was an SVG file that a) my computer program was supposed to automatically read and interpret, b) I totally could have noticed and interpreted and realized I should open it in photoshop.

And then I thought "man, I feel like I'm a character in some Zack Davis fiction, and I totally failed at my job."

[-]derikk4y110

This sounds like a classic example of the bias-variance tradeoff: adding parameters to your model means it can more accurately fit your data (lower bias), but is more sensitive to fluctuations in that data (higher variance). Total error when making predictions on new data is minimized when the bias and variance errors are balanced.

Another example: given data points, you can always draw a polynomial of degree $n - 1$ that fits them perfectly. But the interpolated output values may vary wildly with slight perturbations to your measured data, which is unlikely to represent a real trend. Often a simple linear regression is the most appropriate model.

[-]Donald Hobson5y100

Of course, you also need to store the concept of a Markov chain in the abstract as part of your models. (But that is constant, and should be fairly small in a good encoding. ) On the other hand, 32 bit floats are excessive in precision. And a substantial proportion of floats aren't in the range [0,1] at all. You could probably cut the precision down to 16 bits, maybe less. Of course, you also need a few bits for the order of the markov chain, and the precision used.

[-]habryka5y100

Promoted to curated: This really is a quite good and straightforward explanation of what I think of as one of the really core ideas in theoretical rationality, covering surprisingly much ground in a way that feels approachable. I remember how impactful reading "A technical explanation of a technical explanation" was for my thinking about rationality, and this feels like a much less daunting introduction to some of the same ideas.

[-]Zack_M_Davis5y40

Thanks! But ... I'm not seeing the Curated "★" icon (e.g., in the Posts list on my profile)? Am I missing something, or is there an extra button you still have to click?

[-]jimrandomh5y90

(When we curate something, admins get an email version of the post immediately, then everyone else who's subscribed to curated gets the email after a 20 minute delay. Sometimes we notice a formatting problem with the email, un-curate during the 20 minute window, then re-curate it; that's what happened in this case.)

[-]dynomight5y80

This covers a really impressive range of material -- well done! I just wanted to point out that if someone followed all of this and wanted more, Shannon's 1948 paper is surprisingly readable even today and is probably a nice companion:

http://people.math.harvard.edu/~ctm/home/text/others/shannon/entropy/entropy.pdf

[-]Gunnar_Zarncke5y60

I liked the previous title - Msg Len - better, but I agree that it was maybe a bit too much ;-)

[-]BeanSprugget5y50

Really interesting post. To me, approaching information with mathematics seems like a black box - and in this post, it feels like magic.

I'm a little confused by the concept of cost: I understand that it takes more data to represent more complex systems, which grows exponentially faster than than the amount of bits. But doesn't the more complex model still strictly fit the data better? - is it just trying to go for a different goal than accuracy? I feel like I'm missing the entire point of the end.

[-]Mart_Korz5y10

I am not sure whether my take on this is correct, so I'd be thankful if someone corrects me if I am wrong:

I think that if the goal was only 'predicting' this bit-sequence after knowing the sequence itself, one could just state probability 1 for the known sequence.

In the OP instead, we regard the bit-sequence as stemming from some sequence-generator, of which only this part of the output is known. Here, we only have limited data such that singling out a highly complex model out of model-space has to be weighed against the models' fit to the bit-sequence.

[-]ic5y30

A recommended reading for understanding more of what happens with "regularization" in optimization and search algorithms.

Interesting also the post comes the same week as a discussion on the Solomonoff prior.

[-]moody4y*20

And that's how you can explain mysteriously frequent consecutive runs and alternations. If the last two bits being 01 (respectively 10) makes it more likely for the next bit to be 0 (respectively 1), and the last two bits being 00 (respectively 11) makes it more likely for the next bit to be 0 (respectively 1), then you would be more likely to see both long 0000... or 1111... consecutive runs and 01010... alternations.

Alternatively, it could be a first order chain that checks the digit located 2 bits before it and makes it more likely to match. This does however remove the possibility of assigning different likelihoods to 00 and 01 being followed by 0 (or 11 and 10 being followed by 1).

Edit: or a coin biased to match the digit 2 entries before, removing the possibility of different likelihoods for chains of 1 and chains of 0 (or alternating chains beginning/ending on 1 or 0).

[-]Mathisco5y20

May I ask why you choose Rust to write math and algorithms? I would have chosen Julia :p

[-]Zack_M_Davis5y60

Realistically?—Python and Rust are the only two languages I'm Really Good at, and I'm more Excited about Rust because I write enough Python at my dayjob? In my defense, the .windows iterator on slices was really handy here and is shamefully missing from Python's itertools.

[-]gilch5y60

Toolz has a sliding window.

Hy's partition can also do a sliding window like Clojure's can, by setting the step argument.

[-]Mathisco5y10

Julia's IterTools has the partition with step argument as well

[-]Mathisco5y10

Rust is a fascinating new language to learn, but not designed for scientific computing. For that Julia is the new kid on the block, and quite easy to pick up if you know Python.

01100110110101011011111100001001110000100011010001101011011010000001010000001010 10100111101000101111010100100101010010101010101000010100110101010011111111010101 01010101011111110101011010101101111101010110110101010100000001101111100000111010 11100000000000001111101010110101010101001010101101010101100111001100001100110101 11111111111111111100011001011010011010101010101100000010101011101101010010110011 11111010111101110100010101010111001111010001101101010101101011000101100000101010 10011001101010101111...

fn maximum_likelihood_estimate(data: &[Bit], degree: usize) -> HashMap<(Vec<Bit>, Bit), f32> { let mut theory = HashMap::with_capacity(2usize.pow(degree as u32)); // Cartesian product—e.g., if degree 2, [00, 01, 10, 11] let patterns = bit_product(degree); for pattern in patterns { let mut zero_continuations = 0; let mut one_continuations = 0; for window in data.windows(degree + 1) { let (prefix, tail) = window.split_at(degree); let next = tail[0]; if prefix == pattern { match next { ZERO => { zero_continuations += 1; } ONE => { one_continuations += 1; } } } } let continuations = zero_continuations + one_continuations; theory.insert( (pattern.clone(), ZERO), zero_continuations as f32 / continuations as f32, ); theory.insert( (pattern.clone(), ONE), one_continuations as f32 / continuations as f32, ); } theory }

fn log_loss(theory: &HashMap<(Vec<Bit>, Bit), f32>, data: &[Bit]) -> f32 { let mut total = 0.; let degree = log2(theory.keys().count()) - 1; for window in data.windows(degree + 1) { let (prefix, tail) = window.split_at(degree); let next = tail[0]; total += -theory .get(&(prefix.to_vec(), next)) .expect("theory should have param value for prefix-and-continuation") .log2(); } total }

for hypothesized_degree in 0..15 { let theory = maximum_likelihood_estimate(&data, hypothesized_degree); println!( "{}th-order theory: fit = {}", hypothesized_degree, log_loss(&theory, &data) ); }

0th-order theory: fit = 498.69882 1th-order theory: fit = 483.86075 2th-order theory: fit = 459.01752 3th-order theory: fit = 438.90198 4th-order theory: fit = 435.9401 5th-order theory: fit = 425.77222 6th-order theory: fit = 404.2693 7th-order theory: fit = 344.68494 8th-order theory: fit = 270.51175 9th-order theory: fit = 199.88765 10th-order theory: fit = 147.10117 11th-order theory: fit = 107.72962 12th-order theory: fit = 79.99724 13th-order theory: fit = 57.16126 14th-order theory: fit = 33.409912

for hypothesized_degree in 0..10 { let theory = maximum_likelihood_estimate(&data, hypothesized_degree); let fit = log_loss(&theory, &data); let complexity = 2f32.powi(hypothesized_degree as i32) * 32.; println!( "{}th-order theory: fit = {}, complexity = {}, total = {}", hypothesized_degree, fit, complexity, fit + complexity ); }

0th-order theory: fit = 9970.838, complexity = 32, total = 10002.838 1th-order theory: fit = 9677.269, complexity = 64, total = 9741.269 2th-order theory: fit = 9111.029, complexity = 128, total = 9239.029 3th-order theory: fit = 8646.953, complexity = 256, total = 8902.953 4th-order theory: fit = 8638.786, complexity = 512, total = 9150.786 5th-order theory: fit = 8627.224, complexity = 1024, total = 9651.224 6th-order theory: fit = 8610.54, complexity = 2048, total = 10658.54 7th-order theory: fit = 8562.568, complexity = 4096, total = 12658.568 8th-order theory: fit = 8470.953, complexity = 8192, total = 16662.953 9th-order theory: fit = 8262.546, complexity = 16384, total = 24646.547

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

134

134

134