Introduction to abstract entropy

[-]So8res3y*363

This piece reads to me like the output of someone who worked hard to comprehend a topic in full, without accepting the rest of society's cloudy bullshit / papered-over-confusion / historical baggage in place of answers. And in a particularly thorny case, no less. And with significant effort made to articulate the comprehension clearly and intuitively to others.

For instance: saying "if we're going to call all of these disparate concepts 'entropy', then let's call the length of the name of a microstate 'entropy' also; this will tie the whole conceptual framework together and make what follows more intuitive" is a bold move, and looks like the product of swallowing the whole topic and then digesting it down into something organized and comprehensible. It strikes me as a unit of conceptual labor.

Respect.

I'm excited to see where this goes.

8Adam Scherlis3y

I almost agree, but I really do stand by my claim that Alex has nicely identified the correct abstract thing and then named the wrong part of it entropy. [EDIT: I now think the abstract thing I describe below -- statistical entropy -- is not the full thing Alex is going for. A more precise claim is: Alex is describing some general thing, and calling part of it "entropy". When I map that thing onto domains like statmech or information theory, his "entropy" doesn't map onto the thing called "entropy" in those domains, even though the things called "entropy" in those domains do map onto each other. This might be because he wants it to map onto "algorithmic entropy" in the K-complexity setting, but I think this doesn't justify the mismatch.] The abstract thing [EDIT: "statistical entropy"] is shaped something like: there are many things (call 'em microstates). Each thing has a "weight", p. (Let's not call it "probability" because that has too much baggage.) We care a lot about the negative log of p. However, in none of the manifestations of this abstract concept is that called "entropy". We also care about the average of -log(p) over every possible microstate, weighted by p. That's called "entropy" in every manifestation of this pattern (if the word is used at all), never "average entropy". I don't see why it helps intuition to give these things the same name, and especially not why you would want to replace the various specific "entropy"s with an abstract "average entropy".

8So8res3y

I'm also unsure whether I would have made Alex's naming choice. (I think he suggested that this naming fits with something he wants to do with K complexity, but I haven't understood that yet, and will wait and see before weighing in myself.) Also, to state the obvious, noticing that the concept wants a short name (if we are to tie a bunch of other things together and organize them properly) feels to me like a unit of conceptual progress regardless of whether I personally like the proposed pun. On a completely different note, one of my personal spicy takes is that when we're working in this domain, we should be working in log base 1/2 (or 1/e or suchlike, namely with a base 0 < b < 1). Which is very natural, because we're counting the number of halvings (of probability / in statespace) that it takes to single out a (cluster of) state(s). This convention dispells a bunch of annoying negative signs. (I also humbly propose the notation ə, pronounced "schwa", for 1/e.) ((In my personal notation I use lug, pronounced /ləg/, for log base ə, and lug2 and etc., but I'm not yet confident that this is a good convention.))

4Alex_Altair3y

I didn't mean something I'm doing, I meant that the field of K-complexity just straight-forwardly uses the word "entropy" to refer to it. Let me see if I can dig up some references.

3Adam Scherlis3y

K-complexity is apparently sometimes called "algorithmic entropy" (but not just "entropy", I don't think?) Wiktionary quotes Niels Henrik Gregersen: I think this might be the crux! Note the weird type mismatch: "the statistical entropy of an ensemble [...] the ensemble average of the algorithmic entropy of its members". So my story would be something like the following: 1. Many fields (thermodynamics, statistical mechanics, information theory, probability) use "entropy" to mean something equivalent to "the expectation of -log(p) for a distribution p". Let's call this "statistical entropy", but in practice people call it "entropy". 2. Algorithmic information theorists have an interestingly related but distinct concept, which they sometimes call "algorithmic entropy". Whoops, hang on a sec. Did you want your "abstract entropy" to encompass both of these? If so, I didn't realize that until now! That changes a lot, and I apologize sincerely if waiting for the K-complexity stuff would've dissipated a lot of the confusion. Things I think contributed to my confusion: (1) Your introduction only directly mentions / links to domain-specific types of entropy that are firmly under (type 1) "statistical entropy" (2) This intro post doesn't yet touch on (type 2) algorithmic entropy, and is instead a mix of type-1 and your abstract thing where description length and probability distribution are decoupled. (3) I suspect you were misled by the unpedagogical phrase "entropy of a macrostate" from statmech, and didn't realize that (as used in that field) the distribution involved is determined by the macrostate in a prescribed way (or is the macrostate). I would add a big fat disclaimer that this series is NOT just limited to type-1 entropy, and (unless you disagree with my taxonomy here) emphasize heavily that you're including type-2 entropy.

2Alex_Altair3y

Indeed I definitely do. There are a bunch of places where I think I flagged relevant things, and I'm curious if these seem like enough to you; * The whole post is called "abstract entropy", which should tell you that it's at least a little different from any "standard" form of entropy * The third example, "It helps us understand strategies for (and limits on) file compression", is implicitly about K-complexity * This whole paragraph: "Many people reading this will have some previous facts about entropy stored in their minds, and this can sometimes be disorienting when it's not yet clear how those facts are consistent with what I'm describing. You're welcome to skip ahead to the relevant parts and see if they're re-orienting; otherwise, if you can get through the whole explanation, I hope that it will eventually be addressed!" * Me being clear that I'm not a domain expert * Footnote [4], which talks about Turing machines and links to my post on Solomonoff induction * Me going on and on about binary strings and how we're associating these with individual state -- I dunno, to me this just screams K-complexity to anyone who's heard of it * "I just defined entropy as a property of specific states, but in many contexts you don't care at all about specific states..." * ... "I'll talk about this in a future post; I think that "order" is synonymous with Kolmogorov complexity." ... I struggled with writing the intro section of this post because it felt like there were half a dozen disclaimer-type things that I wanted to get out of the way first. But each one is only relevant to a subset of people, and eventually I need to get to the content. I'm not even expecting most readers to be holding any such type-1/type-2 distinction in their mind to start, so I'd have to go out of my way to explain it before giving the disclaimer. All that aside, I am very open to the idea that we should be calling the single-state thing something different. The "minimum average" form is

1Adam Scherlis3y

I initially interpreted "abstract entropy" as meaning statistical entropy as opposed to thermodynamic or stat-mech or information-theoretic entropy. I think very few people encounter the phrase "algorithmic entropy" enough for it to be salient to them, so most confusion about entropy in different domains is about statistical entropy in physics and info theory. (Maybe this is different for LW readers!) This was reinforced by the introduction because I took the mentions of file compression and assigning binary strings to states to be about (Shannon-style) coding theory, which uses statistical entropy heavily to talk about these same things and is a much bigger part of most CS textbooks/courses. (It uses phrases like "length of a codeword", "expected length of a code [under some distribution]", etc. and then has lots of theorems about statistical entropy being related to expected length of an optimal code.) After getting that pattern going, I had enough momentum to see "Solomonoff", think "sure, it's a probability distribution, presumably he's going to do something statistical-entropy-like with it", and completely missed the statements that you were going to be interpreting K complexity itself as a kind of entropy. I also missed the statement about random variables not being necessary. I suspect this would also happen to many other people who have encountered stat mech and/or information theory, and maybe even K complexity but not the phrase "algorithmic entropy", but I could be wrong. A disclaimer is probably not actually necessary, though, on reflection; I care a lot more about the "minimum average" qualifiers both being included in statistical-entropy contexts. I don't know exactly how to unify this with "algorithmic entropy" but I'll wait and see what you do :)

4MikkW3y

I like the schwa and lug proposals. Trying to anticipate problems, I do suspect newcomers will see 'lug', and find themselves confused, if it has never been explained to them. It even seems possible they may not connect it to logarithms sans explanation

3Adam Scherlis3y

Agreed! Yeah, shorthand for this seems handy. I like these a lot, especially schwa, although I'm a little worried about ambiguous handwriting. My contest entry is nl (for "negative logarithm" or "ln but flipped").

5Alex_Altair3y

Omfg, I love hearing your spicy takes. (I think I remember you advocating hard tabs, and trinary logic.) XD XD guys I literally can't

3Alex_Altair3y

This aside raises concerns for me, like it makes me worry that maybe we're more deeply not on the same page. It seems to me like the weighing is just straight-forward probability, and that it's important to call it that.

1Adam Scherlis3y

I think I was overzealous with this aside and regret it. I worry that the word "probability" has connotations that are too strong or are misleading for some use cases of abstract entropy. But this is definitely probability in the mathematical sense, yes. Maybe I wish mathematical "probability" had a name with weaker connotations.

3Alex_Altair3y

Extremely pleased with this reception! I indeed feel pretty seen by it.

[-]Alex_Altair2y160Review for 2022 Review

[This is a self-review because I see that no one has left a review to move it into the next phase. So8res's comment would also make a great review.]

I'm pretty proud of this post for the level of craftsmanship I was able to put into it. I think it embodies multiple rationalist virtues. It's a kind of "timeless" content, and is a central example of the kind of content people want to see on LW that isn't stuff about AI.

It would also look great printed in a book. :)

2habryka2y

For this part of the review you only need positive or negative votes, not reviews!

[-]Adam Scherlis3y*1312

I haven't read all of this yet. I like it so far. One nitpick: I would really try to avoid referring to individual microstates as having "entropy" assigned to them. I would call $- log (p_{i})$ (or things playing a similar role) something else like "surprisal" or "information", and reserve entropy (rather than "average entropy") for things that look like $E_{i} [s u r p r i s a l (i)]$ or $\sum_{i} (p_{i} s u r p r i s a l (i))$ .

Of course, for macrostates/distributions with uniform probability, this works out to be equal to $s u r p r i s a l (i)$ for every state in the macrostate, but I think the conceptual distinction is important.

(I'm as guilty as anyone of calling simple microstates "low-entropy", but I think it's healthier to reserve that for macrostates or distributions.)

4Alex_Altair3y

That's a reasonable stance, but one of the main messages of the sequence is that we can start with the concept of individual states having entropy assigned to them, and derive everything else from there! This is especially relevant to the idea of using Kolmogorov complexity as entropy. Calling it "surprisal" or "information" has an information-theoretic connotation to it that I think doesn't apply in all contexts.

4Adam Scherlis3y

I'm fine with choosing some other name, but I think all of the different "entropies" (in stat mech, information theory, etc) refer to weighted averages over a set of states, whose probability-or-whatever adds up to 1. To me that suggests that this should also be true of the abstract version. So I stand by the claim that the negative logarithm of probability-or-whatever should have some different name, so that people don't get confused by the ([other thing], entropy) → (entropy, average entropy) terminology switch. I think "average entropy" is also (slightly) misleading because it suggests that the -log(p)'s of individual states are independent of the choice of which microstates are in your macrostate, which I think is maybe the root problem I have with footnote 17. (See new comment in that subthread)

2Alex_Altair3y

Part of what confuses me about your objection is that it seems like averages of things can usually be treated the same as the individual things. E.g. an average number of apples is a number of apples, and average height is a height ("Bob is taller than Alice" is treated the same as "men are taller than women"). The sky is blue, by which we mean that the average photon frequency is in the range defined as blue; we also just say "a blue photon". A possible counter-example I can think of is temperature. Temperature is the average [something like] kinetic energy of the molecules, and we don't tend to think of it as kinetic energy. It seems to be somehow transmuted in nature through its averaging. But entropy doesn't feel like this to me. I feel comfortable saying "the entropy of a binomial distribution", and throughout the sequence I'm clear about the "average entropy" thing just to remind the reader where it comes from.

3Adam Scherlis3y

I think it's different because entropy is an expectation of a thing which depends on the probability distribution that you're using to weight things. Like, other things are maybe... A is the number of apples, sum of p×A is the expected number of apples under distribution p, sum of q×A is the expected number of apples under distribution q. But entropy is... -log(p) is a thing, and sum of p × -log(p) is the entropy. And the sum of q × -log(p) is... not entropy! (It's "cross-entropy")

5Alex_Altair3y

That makes sense. In my post I'm saying that entropy is whatever binary string assignment you want, which does not depend on the probability distribution you're using to weight things. And then if you want the minimum average string length, it becomes in terms of the probability distribution.

3Adam Scherlis3y

Ah, I missed this on a first skim and only got it recently, so some of my comments are probably missing this context in important ways. Sorry, that's on me.

2Alex_Altair3y

One thing I'm not very confident about is how working scientists use the concept of "macrostate". If I had good resources for that I might change some of how the sequence is written, because I don't want to create any confusion for people who use this sequence to learn and then go on to work in a related field. (...That said, it's not like people aren't already confused. I kind of expect most working scientists to be confused about entropy outside their exact domain's use.)

1Adam Scherlis3y

I think it might be a bit of a mess, tbh. In probability theory, you have outcomes (individual possibilities), events (sets of possibilities), and distributions (assignments of probabilities to all possible outcomes). "microstate": outcome. "macrostate": sorta ambiguous between event and distribution. "entropy of an outcome": not a thing working scientists or mathematicians say, ever, as far as I know. "entropy of an event": not a thing either. "entropy of a distribution": that's a thing! "entropy of a macrostate": people say this, so they must mean a distribution when they are saying this phrase. I think you're within your rights to use "macrostate" in any reasonable way that you like. My beef is entirely about the type signature of "entropy" with regard to distributions and events/outcomes.

2Alex_Altair3y

Here's another thing that might be adding to our confusion. It just so happens that in the particular system that is this universe, all states with the same total energy are equally likely. That's not true for most systems (which don't even have a concept of energy), and so it doesn't seem like a part of abstract entropy to me. So e.g. macrostates don't necessarily contain microstates of equal probability (which I think you've implied a couple times).

1Adam Scherlis3y

Honestly, I'm confused about this now. I thought I recalled that "macrostate" was only used for the "microcanonical ensemble" (fancy phrase for a uniform-over-all-microstates-with-same-(E,N,V) probability distribution), but in fact it's a little ambiguous. Wikipedia says which implies microcanonical ensemble (the other are parametrized by things other than (E, N, V) triples), but then later it talks about both the canonical and microcanonical ensemble. I think a lot of our confusion comes from way physicists equivocate between macrostates as a set of microstates (with the probability distribution) unspecified) and as a probability distribution. Wiki's "definition" is ambiguous: a particular (E, N, V) triple specifies both a set of microstates (with those values) and a distribution (uniform over that set). In contrast, the canonical ensemble is a probability distribution defined by a triple (T,N,V), with each microstate having probability proportional to exp(- E / kT) if it has particle number N and volume V, otherwise probability zero. I'm not sure what "a macrostate specified by (T,N,V)" should mean here: either the set of microstates with (N, V) (and any E), or the non-uniform distribution I just described. (By the way: note that when T is being used here, it doesn't mean the average energy, kinetic or otherwise. kT isn't the actual energy of anything, it's just the slope of the exponential decay of probability with respect to energy. A consequence of this definition is that the expected kinetic energy in some contexts is proportional to temperature, but this expectation is for a probability distribution over many microstates that may have more or less kinetic energy than that. Another consequence is that for large systems, the average kinetic energy of particles in the actual true microstate is very likely to be very close to (some multiple of) kT, but this is because of the law of large numbers and is not true for small systems. Note that there's two diffe

2Alex_Altair3y

I'm not quite sure what the cruxes of our disagreement are yet. So I'm going to write up some more of how I'm thinking about things, which I think might be relevant. When we decide to model a system and assign its states entropy, there's a question of what set of states we're including. Often, we're modelling part of the real universe. The real universe is in only one state at any given time. But we're ignorant of a bunch of parts of it (and we're also ignorant about exactly what states it will evolve into over time). So to do some analysis, we decide on some stuff we do know about its state, and then we decide to include all states compatible with that information. But this is all just epistemic. There's no one true set that encompasses all possible states; there's just states that we're considering possible. And then there's the concept of a macrostate. Maybe we use the word macrostate to refer to the set of all states that we've decided are possible. But then maybe we decide to make an observation about the system, one that will reduce the number of possible states consistent with all our observations. Before we make the observation, I think it's reasonable to say that for every possible outcome of the observation, there's a macrostate consistent with that outcome. The probability that we will find the system to be in that macrostate is the sum of the probability of its microstates. Thus the macrostate has p<1 before the observation, and p=1 after the observation. This feels pretty normal to me. We can do this for any property that we can observe, and that's why I defined a macrostate as, "collections of microstates ... connotively characterized by a generalized property of the state". I also don't see why it couldn't be a set containing a singe state; a set of one thing is still a set. Whether that one thing has probability 1 or not depends on what you're deciding to do with your uncertainty model.

1Adam Scherlis3y

I think the crux of our disagreement [edit: one of our disagreements] is whether the macrostate we're discussing can be chosen independently of the "uncertainty model" at all. When physicists talk about "the entropy of a macrostate", they always mean something of the form: * There are a bunch of p's that add up to 1. We want the sum of p × (-log p) over all p's. [EXPECTATION of -log p aka ENTROPY of the distribution] They never mean something of the form: * There are a bunch of p's that add up to 1. We want the sum of p × (-log p) over just some of the p's. [???] Or: * There are a bunch of p's that add up to 1. We want the sum of p × (-log p) over just some of the p's, divided by the sum of p over the same p's. [CONDITIONAL EXPECTATION of -log p given some event] Or: * There are a bunch of p's that add up to 1. We want the sum of (-log p) over just some of the p's, divided by the number of p's we included. [ARITHMETIC MEAN of -log p over some event] This also applies to information theorists talking about Shannon entropy. I think that's the basic crux here. This is perhaps confusing because "macrostate" is often claimed to have something to do with a subset of the microstates. So you might be forgiven for thinking "entropy of a macrostate" in statmech means: * For some arbitrary distribution p, consider a separately-chosen "macrostate" A (a set of outcomes). Compute the sum of p × (-log p) over every p whose corresponding outcome is in A, maybe divided by the total probability of A or something. But in fact this is not what is meant! Instead, "entropy of a macrostate" means the following: * For some "macrostate", whatever the hell that means, we construct a probability distribution p. Maybe that's the macrostate itself, maybe it's a distribution corresponding to the macrostate, usage varies. But the macrostate determines the distribution, either way. Compute the sum of p × (-log p) over every p. EDIT: all of this applies even more to n

2Alex_Altair3y

I think it's also important for my definition of optimization (coming later), because individual microstates do deserve to be assigned a specific level of optimization.

3Alex_Altair3y

Just mulling over other names, I think "description length" is the one I like best so far. Then "entropy" would be defined as minimum average description length.

4Adam Scherlis3y

I like "description length". One wrinkle is that entropy isn't quite minimum average description length -- in general it's a lower bound on average description length. If you have a probability distribution that's (2/3, 1/3) over two things, but you assign fixed binary strings to each of the two, then you can't do better than 1 bit of average description length, but the entropy of the distribution is 0.92 bits. Or if your distribution is roughly (.1135, .1135, .7729) over three things, then you can't do better than 1.23 bits, but the entropy is 1 bit. You can only hit the entropy exactly when the probabilities are all powers of 2. (You can fix this a bit in the channel-coding context, where you're encoding sequences of things and don't have to assign fixed descriptions to individual things. In particular, you can assign descriptions to blocks of N things, which lets you get arbitrarily close as N -> infinity.)

2So8res3y

I think you can bring the two notions into harmony by allowing multiple codes per state (with the entropy/description-length of a state being the lug/nl of the fraction of the codespace that codes for that state). For instance, you can think of a prefix-free code as a particularly well-behaved many-to-one assignment of infinite bitstrings to states, with (e.g.) the prefix-free code "0" corresponding to every infinite bitstring that starts with 0 (which is half of all infinite bitstrings, under the uniform measure). If we consider all many-to-one assignments of infinite bitstrings to states (rather than just the special case of prefix-free codes) then there'll always be an encoding that matches the entropy, without needing to say stuff like "well our description-length can get closer to the theoretical lower-bound as we imagine sending more and more blocks of independent data and taking the average per-block length". (If you want to keep the codespace finite, we can also see the entropy as the limit of how well we can do as we allow the codespace to increase in size.) (I suspect that I can also often (always?) match the entropy if you let me design custom codespaces, where I can say stuff like "first we have a bit, and then depending on whether it's 0 or 1, we follow it up by either a trit or a quadit".) (epistemic status: running off of a cache that doesn't apply cleanly, but it smells right \shrug)

3Adam Scherlis3y

Sure, from one perspective what's going on here is that we're being given a distribution p and asked to come up with a distribution q such that CrossEntropy(p, q) = E_p[-log q] is as small as possible. And then a bit of calculus shows that q=p is optimal, with a minimal value of Entropy(p) = CrossEntropy(p, p) If we're happy to call -log q "description length" right off the bat, we can let q be a distribution over the set of infinite bit strings, or the set of finite simple graphs, or over any (infinite) set we like. But some settings are special, such as "q has to be the coin-flip distribution over a prefix-free code", because in those settings our quantity -log q is forced to equal the length of something in the normal sense of "length of something". So the gap I'm interested in closing is between things that have actual lengths and things that are exactly equal to entropy, and the block encoding thing is the simplest way I know to do that. I think using the coin-flip distribution over infinite strings is nice because it hits entropy exactly and has a clear relationship with the prefix-free-code case, but the block code motivates "length" better in isolation.

2So8res3y

What's your take on using "description length" for the length of a single description of a state, and "entropy" for the log-sum-exp of the description-lengths of all names for the state? (Or, well, ləg-sum-əxp, if you wanna avoid a buncha negations.) I like it in part because the ləg-sum-əxp of all description-lengths seems to me like a better concept than K-complexity anyway. (They'll often be similar, b/c ləg-sum-əxp is kinda softminish and the gap between description-lengths is often long, but when they differ it's the ləg-sum-əxp'd thing that you usually want.) For example, Solomonoff induction does not have the highest share of probability-mass on the lowest K-complexity hypothesis among those consistent with the data. It has the highest share of probability-mass on the hypothesis with lowest ləg-sum-əxp of all description-lengths among those consistent with the data. This can matter sometimes. For instance, in physics we can't always fix the gauge. Which means that any particular full description of physics needs to choose a full-fledged gauge, which spends an enormous amount of description-length. But this doesn't count against physics, b/c for every possible gauge we could describe, there's (same-length) ways of filling out the rest of the program such that it gives the right predictions. In the version of Solomonoff induction where hypotheses are deterministic programs, physics does not correspond to a short program, it corresponsd to an enormous number of long programs. With the number so enormous that the ləg-sum-əxp of all those big lengths is small. More generally, this is related to the way that symmetry makes things simpler. If your code has a symmetry in it, that doesn't make your program any shorter, but it does make the function/hypothesis your program represents simpler, not in terms of K-complexity but in terms of "entropy" (b/c, if S is the symmetry group, then there's |S|-many programs of the ~same length that represent it, which decreases

3Adam Scherlis3y

I still don't like that, because this whole subthread is kind of orthogonal to my concerns about the word "entropy". This subthread is mostly about resolving the differences between a code (assignment of one or more codewords to one or more states) and a probability distribution. I think we've made progress on that and your latest comment is useful on that front. But my concerns about "entropy" are of the form: "I notice that there's a whole field of coding theory where 'entropy' means a particular function of a probability distribution, rather than a function of an individual state. This is also consistent with how physicists and other kinds of computer scientists use the word, except for the phrase 'algorithmic entropy'. I think we should not break compatibility with this usage." Ignoring the differences between distributions and codes, I'd be fine with assigning "entropy" to various things shaped like either "sum of p(state) lug(p(state)) across all states" for a distribution or "sum of (2^-len(state)) len(state) across all states" for a code. I am also fine with assigning it to "sum of p(state) len(state)" for a (distribution, code) pair that are matched in an appropriate sense -- the distribution is the coinflip distribution for the code, or the code is optimal for the distribution, or something else roughly equivalent. Elsewhere Alex and I have been referring to this as a pair of qualifiers "average" (i.e. it's a sum over all states weighted by p or 2^-len) and "minimal" (i.e. the two factors in the sum are for matching or identical codes/distributions). "Average" distinguishes entropy from the things information theorists call "length" or "self-information" or "surprisal" or just "[negative] log-prob", and "minimal" distinguishes entropy from "expected length [of an arbitrary code]" or "cross-entropy".

2So8res3y

Cool thanks. I'm hearing you as saying "I want to reserve 'entropy' for the case where we're weighting the length-like thingies by probability-like thingies", which seems reasonable to me. I'm not sure I follow the part about matched (distribution, code) pairs. To check my understanding: for a sufficiently forgiving notion of "matching", this is basically going to yield the cross-entropy, right? Where, IIUC, we've lifted the code to a distribution in some natural way (essentially using a uniform distribution, though there might be a translation step like translating prefix-free codes to sets of infinite bitstrings), and then once we have two distributions we take the cross-entropy. (One of my hypotheses for what you're saying is "when the distribution and the code are both clear from context, we can shorten 'cross-entropy' to 'entropy'. Which, ftr, seems reasonable to me.) My own proclivities would say: if I specify only a state and a code, then the state lifts to a distribution by Kronecker's delta and the code lifts to a distribution uniformly, and I arbitrarily declare these to 'match', and so when we speak of the (cross-)entropy of a state given a code we mean the length of the code(s) for that particular state (combined by ləg-sum-əxp if there's multiple). This seems like the natural way to 'match' a state and a code, to my eye. But I acknowledge that what counts as 'matching' is a matter of intuition and convention, and that others' may differ from mine. At this point, though, the outcome I'm most invested in is emerging with a short name for "the ləg-sum-əxp of the lengths of all the descriptions". I'm fine with naming it some variation on 'complexity', though. (Komolgorov kindly left a K in K-complexity, so there's ample room to pick another letter if we have to.) (Though to be very explicit about my personal preferences, I'd use "entropy". It seems to me that once we've conceded that we can talk about the entropy of a (distribution, code) pair then we

3Adam Scherlis3y

I want something much more demanding -- I want the distribution and code to be "the same" (related by p = 2^-len), or something "as close as possible" to that. I was leaving a little bit of wiggle room to possibly include "a code matches a distribution if it is the optimal code of its type for compression under that source distribution", but this is only supposed to allow rounding errors; it seems sort of okay to say that the expected length of (0, 10, 11) under the distribution (0.4, 0.3, 0.3) is some (not quite standard) sort of entropy for that distribution, but not okay to say that the expected length of (0, 10, 11) under (0., 0., 1.) is an entropy. But I'm on the fence about even giving that much wiggle room. That's the only reason I exclude single states. I agree that the length of a state is a kind of cross-entropy, because you can choose a delta distribution, but I draw a firm line between cross-entropy and entropy. (Obviously there's a special case, where a code that has a single empty codeword for a single state matches a delta distribution. But not if the codeword isn't the empty string.)

3Adam Scherlis3y

I wonder if it would be reasonable to use "xentropy" for the broad sense of "entropy" in OP, with the understanding that xentropy is always a two-argument function. "The length of a codeword is the xentropy between [the delta distribution located at] the state and [the coinflip distribution implied by] the code"

2So8res3y

Cool cool. I can personally see the appeal of reserving 'entropy' for the case where the distribution and the (natural lifting of) the code (to a distribution) are identical, i.e. your proposal without the wiggle-room. I don't yet personally see a boundary between the wiggle-room you're considering and full-on "we can say 'entropy' as a shorthand for 'cross-entropy' when the second distribution is clear from context" proposal. In particular, I currently suspect that there's enough wiggle-room in "optimal code of its type for compression under the source distribution" to drive a truck through. Like, if we start out with a uniform code C and a state s, why not say that the "type of codes" for the source distribution δ(s) is the powerset of {c ∈ C | c codes for s}? In which case the "optimal code for compression" is the set of all such c, and the 'entropy' is the Nate!complexity? I'm not yet sure whether our different aesthetics here are due to: 1. me failing to see a natural boundary that you're pointing to 2. you not having yet seen how slippery the slope is 3. you having a higher tolerance for saying "humans sometimes just wanna put fences halfway down the slippery slope, dude". Insofar as you think I'm making the mistake of (1), I'm interested to hear arguments. My argument above is ofc tuned to case (2), and it's plausible to me that it pushes you off the fence towards "no wiggle room". Another place we might asethetically differ is that I'm much happier blurring the line between entropy and cross-entropy. One handwavy argument for blurring the line (which has the epistemic status: regurgitating from a related cache that doesn't cleanly apply) is that if the statespace is uncountably infinite then we need a measure in order to talk about entropy (and make everything work out nicely under change-of-variables). And so in the general case, entropy is already a two-place predicate function involving a distribution and some sort of measure. (...Although my ca

9Adam Scherlis3y

Yup, I think I am happy to abandon the wiggle room at this point, for this reason. I think my preferred approach to this is that the density p(x) is not really the fundamental object, and should be thought of as dP/dmu(x), with the measure in the denominator. We multiply by dmu(x) in the integral for entropy in order to remove this dependence on mu that we accidentally introduced. EDIT: this is flagrantly wrong because log(p) depends on the measure also. You're right that this is really a function of the distribution and the measure; I'm not sure offhand if it's crossentropy, either, but I'm going to think about this more. (This is an embarrassing mistake because I already knew differential entropy was cursed with dependence on a measure -- quantum mechanics famously provides the measure on phase-space that classical statistical mechanics took as axiomatic.) For what it's worth, I've heard the take "entropy and differential entropy are different sorts of things" several times; I might be coming around to that, now that I see another slippery slope on the horizon.

4Alex_Altair3y

I'd be interested in a citation of what you're referring to here!

5Adam Scherlis3y

The state-space (for particles) in statmech is the space of possible positions and momenta for all particles. The measure that's used is uniform over each coordinate of position and momentum, for each particle. This is pretty obvious and natural, but not forced on us, and: 1. You get different, incorrect predictions about thermodynamics (!) if you use a different measure. 2. The level of coarse graining is unknown, so every quantity of entropy has an extra "+ log(# microstates per unit measure)" which is an unknown additive constant. (I think this is separate from the relationship between bits and J/K, which is a multiplicative constant for entropy -- k_B -- and doesn't rely on QM afaik.) On the other hand, Liouville's theorem gives some pretty strong justification for using this measure, alleviating (1) somewhat: https://en.wikipedia.org/wiki/Liouville%27s_theorem_(Hamiltonian) In quantum mechanics, you have discrete energy eigenstates (...in a bound system, there are technicalities here...) and you can define a microstate to be an energy eigenstate, which lets you just count things and not worry about measure. This solves both problems: 1. Counting microstates and taking the classical limit gives the "dx dp" (aka "dq dp") measure, ruling out any other measure. 2. It tells you how big your microstates are in phase space (the answer is related to Planck's constant, which you'll note has units of position * momentum). This section mostly talks about the question of coarse-graining, but you can see that "dx dp" is sort of put in by hand in the classical version: https://en.wikipedia.org/wiki/Entropy_(statistical_thermodynamics)#Counting_of_microstates I wish I had a better citation but I'm not sure I do. In general it seems like (2) is talked about more in the literature, even though I think (1) is more interesting. This could be because Liouville's theorem provides enough justification for most people's tastes. Finally, knowing "how big your mic

3So8res3y

:D (strong upvote for shifting position in realtime, even if by small degrees and towards the opposite side of the fence from me :-p) (I'm not actually familiar enough w/ statmech to know what measure we use on phase-space, and I'm interested in links that explain what it is, and how it falls out of QM, if you have any handy :D) I don't currently have much sympathy for "entropy and differential entropy are different" view, b/c I occasionally have use for non-uniform measures even in the finite case. Like, maybe I'm working with distributions over 10 states, and I have certain constraints I'm trying to hit, and subject to those constraints I wanna maximize entropy wrt the Binomial distribution as the measure. And you might be like "Well, stop doing that, and start working with distributions on 2^10 states, and convert all your constraints to refer to the bitcount of the new state (instead of the value of the old state). Now you can meet your constraints while maximizing entropy wrt the uniform measure like a normal person." To which I'd reply (somewhat trollishly) "that rejoinder doesn't work in the limit, so it must get weaker and weaker as I work with more and more states, and at numbers as large as 10 this rejoinder is so weak as to not compel me." From my perspective, the obvious rejoinder to "entropy is already two-place" is "insofar as entropy is two-place, cross-entropy is three-place!". Which, ftr, I might find compelling. It depends whether differential cross-entropy needs all three parameters (P, Q, μ), or whether we can combine two of them (like by using (P, μP/Q) or something). Or, at least, that's what my intuition says off-the-cuff; I'm running on guesswork and a stale cache here :-p.

[-]Adam Scherlis3y100

From my perspective, the obvious rejoinder to "entropy is already two-place" is "insofar as entropy is two-place, cross-entropy is three-place!".

I think this is roughly where I'm at now.

After thinking a bit and peeking at Wikipedia, the situation seems to be:

The differential entropy of a probability density p is usually defined as

$- \int p (x) log (p (x)) d x$

This is unfortunate, because it isn't invariant under coordinate transformations on x. A more principled (e.g. invariant) thing to write down, courtesy of Jaynes, is

$- \int p (x) log (\frac{p (x)}{m (x)}) d x$

where $m = d μ / d x$ is a density function for some measure $μ$ . We can also write this as

$- \int log (\frac{d P}{d μ}) d P$ (Jaynes' continuous entropy of P with respect to $μ$ )

in terms of a probability measure P with $p = d P / d x$ , which is a bit more clearly invariant.

Now we can define a cross-entropy-like thing as

$- \int log (\frac{d Q}{d μ}) d P$ (continuous cross-entropy of Q under P with respect to $μ$ )

...and a small surprise is coming up. Jumping back to the discrete case, the KL divergence or "relative entropy" is

$D (P | | Q) = \sum_{x} P (X) log (\frac{P (x)}{Q (x)}) = C r o s s E n t r o p y (P, Q) - E n t r o p y (P)$

What happens when we try to write something analogous with our new continuous entropy ... (read more)

3Adam Scherlis3y

I wanted to let that comment be about the interesting question of how we unify these various things. But on the ongoing topic of "why not call all this entropy, if it's all clearly part of the same pattern?": When the definition of some F(x) refers to x twice, it's often useful to replace one of them with y and call that G(x, y). But it's usually not good for communication to choose a name for G(x, y) that (almost) everyone else uses exclusively for F(x), especially if you aren't going to mention both x and y every time you use it, and doubly especially if G is already popular enough to have lots of names of its own (you might hate those names, but get your own[1]). e.g.: x*y is not "the square of x and y" much less "the square of x [and y is implied from context]", and the dot product v.w is not "the norm-squared of v and w" etc. [1] might I suggest "xentropy"?

5So8res3y

:D (strong upvote for delicious technical content) (also fyi, the markdown syntax for footnotes is like blah blah blah[^1] and then somewhere else in the file, on a newline, [^1]: content of the footnote) This updates me a fair bit towards the view that we should keep entropy and cross-entropy separate. The remaining crux for me is whether the info theory folks are already using "entropy" to mean "the minimum expected surprise you can achieve by choosing a code from this here space of preferred codes" (as per your "wiggle room" above), in which case my inclination is to instead cackle madly while racing my truck through their wiggle-room, b/c that's clearly a cross-entropy. At a glance through wikipedia, it doesn't look like they're doing that, though, so I retreat. But, hmm, I'm not sure I retreat all the way to having separate words. I'm persuaded that "the square of x" should absolutely not mean "x * y [where y is implied from context]", and I no longer think that "entropy" should mean the cross-entropy with Q implied from context. (Thanks for the distillation of why it's mad!) But, like, if geometrists instead invented a two-place operation rect, with a general form rect(x, y) := x * y, and declared that rect(x) was shorthand for rect(x, x), then I would not protest; this seems like a reasonable an unambiguous way to reduce the number of names floating around.[1] And this seems to me like exactly what the information theorists (by contrast to the physicists) are doing with H (by contrast to S)! Like, the notations H(P) and H(P,Q) are just begging for us to pronounce H the same way in each case; we're not tempted to pronounce the former as "eych" and the latter as "cross-eych". And no ambiguity arises, so long as we use the rule that if Q is left out then we mean P. Thus, I propose the following aggressive nomenclature: * the "P-entropy of Q (wrt μ)" aka Hμ(P,Q) is the general form * the "entropy of P (wrt μ)" aka Hμ(P) is a shorthand for "the P-ent

3Adam Scherlis3y

I agree with all the claims in this comment and I rather like your naming suggestions! Especially the "P-entropy of Q = Q-complexity of P" trick which seems to handle many use cases nicely. (So the word "entropy" wasn't really my crux? Maybe not!)

5So8res3y

Convergence! 🎉🎉. Thanks for the discussion; I think we wound up somewhere cooler than I would have gotten to on my own. Now we just need to convince Alex :-p. Alex, are you still reading? I'm curious for your takes on our glorious proposal.

4Alex_Altair3y

Heh, I'm still skimming enough to catch this, but definitely not evaluating arguments. I'm definitely still open to both changing my mind about the best use of terms and also updating the terminology in the sequence (although I suspect that will be quite a non-trivial amount of modified prose). And I think it's best if I don't actually think about it until after I publish another post. I'd also be much more inclined to think harder about this discussion if there were more than two people involved. My main goal here has always been "clearly explain the existing content so that people can understand it", which is very different from "propose a unification of the whole field" (it's just that "unification" is my native method of understanding).

3So8res3y

Cool cool. A summary of the claims that feel most important to me (for your convenience, and b/c I'm having fun): * K-complexity / "algorithmic entropy" is a bad concept that doesn't cleanly relate to physics!entropy or info!entropy. * In particular, the K-complexity of a state s is just the length of the shortest code for s, and this is bad because when s has multiple codes it should count as "simpler". (A state with five 3-bit codes is simpler than a state with one 2-bit code.) (Which is why symmetry makes a concept simpler despite not making its code shorter.) * If we correct our notion of "complexity" to take multiple codes into account, then we find that complexity of a state s (with respect to a coding scheme C) is just the info!cross-entropy H(s,C). Yay! Separately, some gripes: * the algorithmic information theory concept is knuckleheaded, and only approximates info!entropy if you squint really hard, and I'm annoyed about it * I suspect that a bunch of the annoying theorems in algorithmic information theory are annoying precisely because of all the squinting you have to do to pretend that K-complexity was a good idea And some pedagogical notes: * I'm all for descriptive accounts of who uses "entropy" for what, but it's kinda a subtle situation because: * info!entropy is a very general concept, * physics!entropy is an interesting special case of that concept (in the case where the state is a particular breed of physical macrostate), * algo!entropy is a derpy mistake that's sniffing glue in the corner, * algo!entropy is sitting right next to a heretofore unnamed concept that is another interesting special case of info!(cross-)entropy (in the case where the code is universal). (oh and there's a bonus subtlety that if you port algo!entropy to a situation where the coding schema has at most one code per state--which is emphatically not the case in algorithmic information theory--then in that limited case it agrees with info!cross-entropy.

3Adam Scherlis3y

Endorsed.

3habryka3y

Note: I added LaTeX to your comment to make it easier to read. Hopefully you don't mind. Pretty sure I translated it correctly. Feel free to revert of course.

3Adam Scherlis3y

Footnote 17 sounds confusing and probably wrong to me, but I haven't thought it through. Macrostates should have some constraint that makes their total probability 1; you can't have a macrostate containing a single very unlikely microstate. (Edit: "wrong" seems a bit harsh on reflection but I dislike the vagueness about "cheating" a lot. The single-improbable-thing macrostate should just not typecheck instead of somehow being against the spirit of things.)

3Adam Scherlis3y

I would maybe say that your "average entropy" (what I'd call entropy) is always the average over every state, every single time, and (uniform) macrostates are just a handy conceptual shorthand for saying "I want all of these states to have equal p (equal -log p) and all of these to have zero p (infinite -log p)" without getting bogged down in why 0 log 0 is 0. A state is "in" a macrostate if it's one of the states with nonzero p for that macrostate, but the sum is always over everything.

[-]Ruby3y80

Curated. Came here to curate this and I think So8res's comment serves as good a curation notice as I might write myself:

This piece reads to me like the output of someone who worked hard to comprehend a topic in full, without accepting the rest of society's cloudy bullshit / papered-over-confusion / historical baggage in place of answers. And in a particularly thorny case, no less. And with significant effort made to articulate the comprehension clearly and intuitively to others.
For instance: saying "if we're going to call all of these disparate concepts 'e

... (read more)

4Alex_Altair3y

I'm psyched to have a podcast version! The narrator did a great job. I was wondering how they were going to handle several aspects of the post, and I liked how they did all of them.

[-]mukashi3y30

Thank you very much for writing this.

I have struggled with entropy since I first studied it in high school. I was taught that entropy measures "disorder" or the "chaos" in a system. I was totally disconcerted by the units (Joules/Kelvin), because I couldn't grasp it intuitively, the same way I could reason, for instance, about speed. I searched for a long time for an answer. Everything became much clearer (I remember this as a revelation) once I read "A farewell to Entropy", by Arieh Ben-Naim. According to the author, the only reason entropy is... (read more)

3Alex_Altair3y

The historical baggage is something that tripped me up, too. In an upcoming post I have a section about classical thermodynamic entropy, including an explanation of the weird units!

1mukashi3y

That's great, I subscribed and looking forward to it!

[-]Yoav Ravid2y20

You're not even holding a specific state, but are instead designing something to deal with a "type" of state, like writing a compression algorithm for astronomical images.

This reminds me of K-complexity is silly; use cross-entropy instead, though I'm not quite sure how/whether they're actually related.

2Alex_Altair2y

Yeah, So8res wrote that post after reading this one and having a lot of discussion in the comments. That said, my memory was that people eventually convinced him that the title idea in his post was wrong.

2Yoav Ravid2y

I think they later agreed he was actually right, and the initial criticism was based on a misunderstanding.

[-]Vitor3y21

While I think this post overall gives good intuition for the subject, it also creates some needless confusion.

Your concept of "abstract entropy" is just Shannon entropy applied to uniform distributions. Introducing Shannon entropy directly, while slightly harder, gives you a bunch of the ideas in this post more or less "for free":

Macrostates are just events and microstates are atomic outcomes (as defined in probability theory). Any rules how the two relate to each other follow directly from the foundations of probability.
The fact that E[-log x] is th

... (read more)

[-]Lakin3y*12

Registering that I think "[entropy] can be used to explain the arrow of time" is bunk (I've linked 4:50, but see particularly the animation at 5:49): entropy works to explain the arrow of time only if we assume a low-entropy initial state (eg the big bang) in the past
Edit: Oh hm maybe the description isn't "time moves forward" but "time moves away from the big bang"

[-]Lakin3y10

I REALLY liked this. A few years ago I scoured the internet trying to find a good conceptual explanation of entropy, and I couldn't find any. This is by far the best that I've seen. I'm glad you made it!

[-]awenonian3y10

Is it important that negentropy be the result of subtracting from the maximum entropy? It seemed a sensible choice, up until it introduces infinities, and made every state's negentropy infinite. (And also that, if you subtract from 0, then two identical states should have the same negentropy, even in different systems. Unsure if that's useful, or harmful).

Though perhaps that's important for the noting that reducing an infinite system to a finite macrostate is an infinite reduction? I'm not sure if I understand how (or perhaps when?) that's more useful than... (read more)

[-]Caerulean3y10

Reading this piece about entropy as an abstract concept reminded me of a beautiful story touching on the topic: Exhalation by Ted Chiang (wikipedia page with spoilers).

[-]andrew sauer3y10

Do you think it would be easier to solve the Diamond maximizer - Arbital problem if you were trying to maximize entropy instead of diamonds, due to entropy being more abstract and less dependent on specific ontology?

2Alex_Altair3y

Hm, it seems pretty dependent on ontology to me – that's pretty much what the set of all states is, an ontology for how the world could be.

[-]cubefox3y*10

This is a very thought provoking post. As far as I understand, it is an attempt of finding a unified theory of entropy.

I admit I am still somewhat confused about this topic. This is partly because of my insufficient grasp of the material in this post, but, I think, also partly because you didn't yet went quite far enough with your unification.

One point is the thinking in terms of "states". A macrostate is said to be a set of microstates. As far as I understand, the important thing here is that all microstates are presumed to be mutually exclusive, such tha... (read more)

1Adam Scherlis3y

I think macrostates are really a restricted kind of probability distribution, not a kind of proposition. But they're the kind of distribution p_A that's uniform over a particular disjunction A of microstates (and zero elsewhere), and I think people often equivocate between the disjunction A and the distribution p_A. [EDIT: "macrostate" is a confusing term, my goal here is really to distinguish between A and p_A, whatever you want to call them] In general, though, I think macrostates aren't fundamental and you should just think about distributions if you're comfortable with them. I think microstates should indeed be considered completely-specified possible worlds, from this perspective. "Average entropy" / "entropy of a macrostate" in OP ("entropy" in standard usage) is a function from probability distributions to reals. Shannon came up with an elegant set of axioms for this function, which I don't remember offhand, but which uniquely pins down the expectation of -log(p(microstate)) as the entropy function (up to a constant factor). "Entropy of a microstate" in OP ("surprisal" in information theory, no standard name otherwise AFAIK) is a function from probability distributions to random variables, which is equal to -log(p(microstate)). So I guess I'm not sure propositions play that much of a role in the usual definition of entropy. On the other hand, if we do extend entropy to arbitrary propositions A, it probably does make sense to define it as the conditional expectation S(A) = E[-log p | A], as you did. Then "average entropy"/"entropy" of a macrostate p_A is S(True) under the distribution p_A, and "entropy"/"surprisal" of a microstate B (in the macrostate p_A) is S(B) under the distribution p_A. By a slight coincidence, S(True) = S(A) under p_A, but S(True) is the thing that generalizes to give entropy of an arbitrary distribution. I've never seen an exploration of what happens if you apply this S() to anything other than individual microstates or True,

1cubefox3y

Could you clarify this part? I think I don't understand your notation here.

2Adam Scherlis3y

I think I was a little confused about your comment and leapt to one possible definition of S() which doesn't satisfy all the desiderata you had. Also, I don't like my definition anymore, anyway. Disclaimer: This is probably not a good enough definition to be worth spending much time worrying about. First things first: I think this is indeed how we should think of "microstates". (I don't want to use the word "macrostate" at all, at this point.) I was thinking of something like: given a probability distribution p and a proposition A, define "S(A) under p" = ∑x∈Ap(x)(−logp(x))∑x∈Ap(x) where the sums are over all microstates x in A. Note that the denominator is equal to p(A). I also wrote this as S(A) = expectation of (-log p(x)) conditional on A, or S(A)=E[(−logp)|A], but I think "log p" was not clearly "log p(x) for a microstate x" in my previous comment. I also defined a notation p_A to represent the probability distribution that assigns probability 1/|A| to each x in A and 0 to each x not in A. I used T to mean a tautology (in this context: the full set of microstates). Then I pointed out a couple consequences: * Typically, when people talk about the "entropy of a macrostate A", they mean something equal to log|A|. Conceptually, this is based on the calculation ∑x∈A1|A|(−log1|A|), which is the same as either "S(A) under p_A" (in my goofy notation) or "S(T) under p_A", but I was claiming that you should think of it as the latter. * The (Shannon/Gibbs) entropy of p, for a distribution p, is equal to "S(T) under p" in this notation. * Finally, for a microstate x in any distribution p, we get that "S({x}) under p" is equal to -log p(x). All of this satisfied my goals of including the most prominent concepts in Alex's post: * log |A| for a macrostate A * Shannon/Gibbs entropy of a distribution p * -log p(x) for a microstate x And a couple other goals: * Generalizing the Shannon/Gibbs entropy, which is S(p)=Ex∼p[−logp(x)], in a natural way to

1cubefox3y

Okay, I understand. The problem with fundamental microstates is that they only really make sense if they are possible worlds, and possible worlds bring their own problems. One is: we can gesture at them, but we can't grasp them. They are too big, they each describe a whole world. We can grasp the proposition that snow is white, but not the equivalent disjunction of all the possible worlds where snow is white. So we can't use then for anything psychological like subjective Bayesianism. But maybe that's not your goal anyway. A more general problem is that there are infinitely many possible worlds. There are even infinitely many where snow is white. This means it is unclear how we should define a uniform probability distribution over them. Naively, if 1∞ is 0, their probabilities do not sum to 1, and if it is larger than 0, they sum to infinity. Either option would violate the probability axioms. Warning: long and possibly unhelpful tangent ahead Wittgenstein's solution for this and other problems (in the Tractatus) was to ignore possible worlds and instead regard "atomic propositions" as basic. Each proposition is assumed to be equivalent to a finite logical combination of such atomic propositions, where logical combination means propositional logic (i.e. with connectives like not, and, or, but without quantifiers). Then the a priori probability of a proposition is defined as the rows in its truth table where the proposition is true divided by the total number of rows. For example, for a and b atomic, the proposition a∨b has probability 3/4, while a∧b has probability 1/4: The disjunction has three out of four possible truth-makers - (true, true), (true, false), (false, true), while the conjunction has only one - (true, true). This definition in terms of the ratio of true rows in the "atomicized" truth-table is equivalent to the assumption that all atomic propositions have probability 1/2 and that they are all probabilistically independent. Wittgenstein did not d

[-]Algon3y10

Whilst I think that what you've written so far is covered in single texts in the literature, I don't think I've seen something that is as approachable as this before. Well done!

[-]Jon Garcia3y10

Excellent introduction. Your examples were all very intuitive.

For those who are reading, one way to get an intuition for the difference between binary strings and bits is to look at data compression. To begin with, it's easy to create a code like ASCII, where every character is represented by a binary string of length 8 (usually referred to as 8 "bits" or one byte), allowing up to $2^{8} = 256$ unique characters. This type of code will allow you to represent a text document in English that's 1024 characters in length with exactly 1 kB of information.

Exc... (read more)

1Jon Garcia3y

Also, just a couple minor errors: 1. In your "The first 31 binary strings in lexical order" figure, you're missing a white square at the top of the fourth 3-bit string. 2. "diving by W" should be "dividing by W". I know spell check would miss that one. I didn't notice any other errors. Again, great article.

4Alex_Altair3y

Nice catches. I love that somebody double-checked all the binary strings. :)

^{^}

The quickest gloss is that optimization is a decrease in entropy. So it's a pretty tight connection! But those six words are hiding innumerable subtleties.

^{^}

Something like "the entropy of a closed system tends to increase over time"; there are many formulations.

^{^}

Some contexts will use "nats" or "dits" or whatever. This comes from using logarithms with different bases, and is just a change of units, like meters versus feet.

^{^}

I've justified the use of binary before. There's a lot of interesting detail to go into about what changes when you use three symbols or more, but all of the heavy-lifting conclusions are the same. Turing machines that use three symbols can compute exactly the set of things that Turing machines with two symbols can; the length of a number $n$ is $O (log n)$ whether it's represented in binary or trinary; et cetera.

^{^}

Binary strings are usually written out with 0s and 1s, and I'll do that in the text. But I personally always visualize them as strings of little white and black squares, which is what I'll use in the illustrations. This is probably because I first learned about them in the context of Turing machines with tapes.

^{^}

Note that the entropy of a state is the length of its label, and not the label itself; the specific layout of 0s and 1s just serves to distinguish that label from other labels of the same length.

^{^}

Justification for merely taking the log comes from the derivation at the end of this section, though you may have been able to intuit it already!

^{^}

Again, this is assuming you're using a binary string labeling scheme that uses all the smaller strings before using bigger strings. You could always decide to label every state with binary strings of length 100.

^{^}

The typical minimal notation for describing Rubik's Cube algorithms has one letter for each of the six faces (F, B, L, R, U, D), and then an apostrophe for denoting counter-clockwise (and a number of other symbols for more compact representations). This means that six of the one-move states have a label of length one, and six others have length two. This all comes out in the big-O wash, and the label lengths will end up differing by a constant factor, because e.g. ${log}_{6} (x) = \frac{{log}_{2} (x)}{{log}_{2} (6)} = 0.387 \cdot {log}_{2} (x)$ .

^{^}

I'll have bits of math throughout this sequence. This is a pretty math-heavy concept, but I still don't think that most of the actual equations are essential for gaining a useful understanding of entropy (though it is essential to understand how logarithms work). So if you feel disinclined to follow the derivations, I'd still encourage you to continue reading the prose.

None of the derivations in this sequence are here for the purpose of rigorously proving anything, and I've tried to include them when the structure of the equations actually helped me understand the concepts more clearly.

^{^}

Wikipedia informs me that this is not technically the optimal strategy for winning the game, because if you are behind and your opponent plays optimally, then you're better off guessing specific people and hoping to get lucky.

^{^}

Or as close to half as you can get.

^{^}

Again, only exact when $W$ is a power of 2, but in any case, the binary string one is strictly less than the yes/no questions one, which is what we want to resolve here.

^{^}

For this reason I've been careful not to use the phrase "bit string", instead sticking with "binary string". For our purposes, a binary string is a bit string if each of those symbols could have been the flipped value (for some relevant definition of "could").

^{^}

Examples of finite prefix codes: {0, 1}, {0, 10, 11}, {00, 01, 10, 110, 1110, 1111}

Example of an infinite prefix code: {0, 10, 110, 1110, 11110, ... }

^{^}

Note that for many systems, most states have maximum or near-maximum entropy, such that the negentropy is virtually the same as the average entropy minus the state's entropy; this would also mean that most states have virtually zero negentropy.

^{^}

You could argue that the maximum entropy macrostate is just the macrostate that contains only the highest entropy state(s). I think the spirit of macrostates is that you don't consider individual states, and thus it would be "cheating" to pick out specific states to form your macrostate. In the spirit of $S = log (W)$ , the way to maximize $S$ is to maximize $W$ , that is, include all states into $W$ .

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

246

Introduction to abstract entropy

246

246

Introduction & motivation

Abstract definition

Macrostates

Two basic strategies for distinguishing states

Binary string labels

Yes/no questions

How they compare

Exactly what is a bit?

Probabilities over states

Negentropy

What's next