This is the second post in a series that explores a circle of ideas around entropy and information processing in complex systems. We hope it will form an accessible introduction to  James Crutchfield's Computational Mechanics framework. 

In the previous post we saw that

  • Data is, in general, a mixture of randomly generated components and deterministic hidden structure
  • Kolmogorov Complexity and Shannon Entropy are powerful measures of structure. However, we also encountered some limitations: Kolmogorov complexity overestimates deterministic structure while the Shannon Entropy underestimates deterministic structure.
  • We need a framework that allows us to find and characterize both structured and random parts of our data.

In this post we will show that we can characterize both structure and randomness in a system by looking at how entropy (irreducible uncertainty) changes over increasing scales. This will lead to ways in which we can understand transformer function in language tasks.

Structure is found in how entropy scales

We want something conceptually in between the randomness of Shannon Entropy and the determinism of Kolmogorov complexity. The way to bridge the gap is by realizing that systems that have deterministic structure at large scales look random from the perspective of entropy at short scales, but look less and less random as we increase the scale. In other words, if we can't see far or remember a lot, then things seem random, and as we see farther and remember more, structure is revealed. And what's more is that the way randomness turns into structure at longer scales tells us many computational properties of the a system. Let's see how this works.

Imagine we come across some mystery data:

...00110100100100110110110100110100100100110110110100100110110100...

Let's assume that we don't know how it was generated.

Notice that the frequency of 0 and 1's is equal in our string. 

So maybe it's just a random string generated by a fair coin? After all, if we compare to the distribution of 0s and 1s of a string generated by a fair coin we get the same results. 

That's a good initial guess but let's increase our scale and look at the length 2 blocks as well. By length 2 blocks we mean here the distribution over 00, 01, 10, and 11 in our data.

Notice that the 01 block occurs more often than the 00 block, so the statistics of our data are inconsistent with that of a fair coin at this larger scale; our data looks less random at this longer scale than it did at the length 1 block scale. Maybe it's some kind of biased coin? Let's look at how our mystery data compares to that of a 75/25 biased coin for block lengths 2.

Notice that in the length two blocks the biased coin has an asymmetry - the 00 and 11 blocks are not equal in frequency . This will be true of a string generated from coin flips with any bias, but is not true of our mystery string. Looking at the distribution over block length 3:

In our mystery string, the blocks 111 and 000 never occur! That's impossible for any data generated by coin flips, no matter how biased.

If the data is completely random, then at any block length, all possible blocks will occur. If the data is a completely structured sequence of a certain period, then some blocks will never occur! Technically speaking, when considering the distribution of length  blocks, random data will have nonzero frequency over all  possible blocks, whereas for any data string that has a period of , if we look at the distribution of tokens of length  there will be exactly  tokens of length  ​ that occur with nonzero frequency. 

The main point here is that deterministic sequences are associated with a constant distribution as you scale up block length, and random sequences are associated with  occurring blocks.

Recall from the previous post how the mystery data was generated: you output 0, then 1, then you flip a fair coin, and then repeat.  Our data was a mix of both a 'random' component and deterministic hidden structure. Recall that data is a mixture of 'random' components and deterministic hidden structure.[3]  Let's look at the block distributions of the mystery data up to block length-9.

As block length increases the number of occurring blocks does not stay constant, nor does every possible block occur! It is a mix of both random components and deterministic structure. 

There is also fractal-like behavior. If you focus on the columns of the histograms you will notice a pattern that repeats (at a different scale) every time you increase the block length by 3. This has something to do with renormalization and scaling laws and fractals, but for now let's just leave it as a teaser.

Entropy Rates, Block Entropy Diagrams and Memory

Recall our claim that if we can't see far or remember a lot, then things seem random, and as we see farther and remember more, structure is revealed. And what's more is that the way randomness turns into structure at longer scales tells us many computational properties of the system. Here is the mystery string again. 

...00110100100100110110110100110100100100110110110100100110110100...

 Let's look at how the entropy of these distributions scales as block token length increases. 

Block Entropy Diagrams

Recall the basic intuition about entropy: entropy is a canonical measure of 'negative information'. It measures the 'irreducible uncertainty'. We want to look how this measure varies as we consider longer and longer block lengths (and therefore time-scales!)

Define the fundamental notion of block entropy:

Here  is the set of all length-L blocks  and  denotes the frequency of that block sequence in the data. The block entropy measures the irreducible uncertainy when predicting block sequences of length  in the data set. For length  we recover the naive Shannon entropy defined above. We can plot the block entropy as block length increases.

The block entropy diagrams for different data: 01 and 0110111 are length-2 and length-6 sequences that are repeated deterministically .

This is called the block entropy diagram

Again, we see that the data generated from coin flips are ever increasing straight lines, and the two repeated deterministic strings flatten out once you hit the sequence length

Next define the block entropy rate at scale  to be 

What does this expression measure? It is often a fruitful tactic to examine the asymptotics of mathematical functions to understand.  So let us examine its large  asymptotic limit. 

Entropy rate

The entropy rate  can be thought of as

  • Irreducible randomness: the randomness that persists even after accounting for all correlations over arbitrarily large blocks of tokens. 
  • The randomness that cannot be “explained away”.
  • Expected uncertainty per symbol. 

What does the entropy rate look like for real text data? Researchers have tried to estimate the entropy rate of English text. We hope you now have some inkling the subtlety of this question. For instance: what texts do we look at? In reality we cannot take the large  limits so we have to pick some cutoff, et cetera.  From a recent paper on the topic:

Information Theory's father, C. E. Shannon showed that the entropy of printed English should be bounded between 0.6 and 1.3 bits/letter over 100-letter long sequences of English text. He used for his calculations a human prediction approach and ignored punctuation, lowercase, and uppercase. Cover and King, using gambling estimates, found a value between 1.25 and 1.35 bits per character. Teahan and Cleary reported a value for the entropy of English of 1.46 bits per character using data compression. Kontoyiannis reported a value of 1.77 bits per character using a method based on string matching, resembling that of universal source coding, based on a one million-character sample from a same author. 

Block Entropy Rate and Intrinsic Memory

Let's get back to the block entropy rate. Recall we defined the block entropy rate at scale  to be 

We can think of  as the average uncertainty of the next symbol, given that the previous  symbols have been observed. Said differently: if you had a memory of size , your estimate of the entropy (inherent uncertainty) of the process would be 

The block entropy rate at length  can be thought of as the upper limit on how good your predictions can ideally be, with a memory of length . Think of these as estimates of the true block entropy rate (taking ), which is the actual irreducible entropy in the generating process. 

  1. ^

    Does the block entropy rate at scale  equal the physical memory in the system in some sense? This is fraught. It is better to say that the block entropy rate measures the 'apparent physical memory'. Some of the physical memory may be 'encrypted'.

New to LessWrong?

New Comment
5 comments, sorted by Click to highlight new comments since: Today at 2:27 AM

I like this post for the way it illustrates how the probability distribution over blocks of strings changes as you increase block length.

Otherwise, I think the representation of other ideas and how they related to it is not very accurate, and might mislead reader about the consensus among academics.

As an example, strings where the frequency of substrings converges to a uniform distribution is are called "normal". The idea that this could be the definition of a random string was a big debate through the first half of the 20th century, as people tried to put probability theory on solid foundations. But you can have a fixed, deterministic program that generates normal strings! And so people generally rejected this ideas as the definition of random. Algorithmic information theory uses the definition of Martin-Löf random, which is that an (infinite) string is random if it can't be compressed by any program (with a bunch of subtleties and distinctions in there).

Yes, that's right. For instance, the decimal expansion of pi is thought to be normal (but iirc unproven). The ideas in these post can all be found in academic texts.

Do let us know any specific text in text that you think is misleading or false! Thanks 😊

I'm writing this comment without checking the internet, but I don't quite see the analogy to renormalization. In renormalization you might, for example, have a theory about some scale that can be interpreted as making predictions about the next scale, and these theories at different scales all have the same form but maybe different parameters, so you can solve the whole stack at once.

This post doesn't really have theories. I suppose you could loosely interpret frequentism as a theory here, and say that the future will continue with the same distribution as the past, but this theory is terrible for renormalization, because it means that as you change scales you expect pure randomness to dominate. I'd categorize this post more as about phenomenology - maybe "phenomenological scaling" or "phenomenological coarse-graining."

I like these posts! Could you share any pointers or thoughts on how this framework extends to situations where you're not passively observing a stream of information, but your actions affect the incoming bits? 

Relatedly, is there a way to capture the idea of performing maximally informative experiments to extract as much structure from the (anticipated) bit stream as possible?