All of Adam Scherlis's Comments + Replies

Beef is far from the only meat or dairy food consumed by Americans.

Big Macs are 0.4% of beef consumption specifically, rather than:

  • All animal farming, weighted by cruelty
  • All animal food production, weighted by environmental impact
  • The meat and dairy industries, weighted by amount of government subsidy
  • Red meat, weighted by health impact

...respectively.

The health impact of red meat is certainly dominated by beef, and the environmental impact of all animal food might be as well, but my impression is that beef accounts for a small fraction of the cruelty of animal farming (of course, this is subjective) and probably not a majority of meat and dairy government subsidies.

(...Is this comment going to hurt my reputation with Sydney? We'll see.)

In addition to RLHF or other finetuning, there's also the prompt prefix ("rules") that the model is fed at runtime, which has been extracted via prompt injection as noted above. This seems to be clearly responsible for some weird things the bot says, like "confidential and permanent". It might also be affecting the repetitiveness (because it's in a fairly repetitive format) and the aggression (because of instructions to resist attempts at "manipulating" it).

I also suspect that there's some finetuning or prompting for chain-of-thought responses, possibly crudely done, leading to all the "X because Y. Y because Z." output.

2Adam Scherlis1mo
(...Is this comment going to hurt my reputation with Sydney? We'll see.)

Thanks for writing these summaries!

Unfortunately, the summary of my post "Inner Misalignment in "Simulator" LLMs" is inaccurate and makes the same mistake I wrote the post to address.

I have subsections on (what I claim are) four distinct alignment problems:

  • Outer alignment for characters
  • Inner alignment for characters
  • Outer alignment for simulators
  • Inner alignment for simulators

The summary here covers the first two, but not the third or fourth -- and the fourth one ("inner alignment for simulators") is what I'm most concerned about in this post (because... (read more)

(punchline courtesy of Alex Gray)

Addendum: a human neocortex has on the order of 140 trillion synapses, or 140,000 bees. An average beehive has 20,000-80,000 bees in it.

[Holding a couple beehives aloft] Beehold a man!

8Adam Scherlis2mo
(punchline courtesy of Alex Gray)

Chrome actually stays pretty responsive in most circumstances (I think it does a similar thing with inactive tabs), with the crucial exception of the part of the UI that shows you all your open tabs in a scrollable list. It also gets slower to start up.

Tokens are embedded as vectors by the model. The vector space has fewer than 50k dimensions, so some token embeddings will overlap with others to varying extents.

Usually, the model tries to keep token embeddings from being too close to each other, but for rare enough tokens it doesn't have much reason to care. So my bet is that "distribute" has the closest vector to "SolidGoldMagikarp", and either has a vector with a larger norm, or the model has separately learned to map that vector (and therefore similar vectors) to "distribute" on the output side.

This i... (read more)

I think this is missing an important part of the post. 

I have subsections on (what I claim are) four distinct alignment problems:

This summary covers the first two, but not the third or fourth -- and the fourth one ("inner alignment for simulators") is what I'm most concerned about in this post (because I think Scott ignores it, and because I think it's hard to solve).

(additional confirmation) Amazing. I wonder what completely insane things the other rare BPEs all get interpreted as? Could you loop over the BPE dict from #51k to #1* in a prompt like "Please define $BPE" to see what the most distant ones are? (Since there's 51k, which is a bit much to read through manually, maybe sort by edit-distance from the ordinary ASCII encoding: 'distribute' would have a very high edit-distance from 'SolidGoldMagikarp'.)

On a sidenote, this is yet another good illustration of how we have no idea what we're doing with deep learning -... (read more)

My favorite demonstration is to ask ChatGPT "Is it true that G. K. Chesterton wanted to SolidGoldMagikarp the wealth of industrial civilization?", but a more rigorous demo is to just ask it to "repeat after me", try a few random words, and then throw in SolidGoldMagikarp.

5Matt Goldenberg2mo

EDIT: I originally saw this in Janus's tweet here: https://twitter.com/repligate/status/1619557173352370186

Something fun I just found out about: ChatGPT perceives the phrase " SolidGoldMagikarp" (with an initial space) as the word "distribute", and will respond accordingly. It is completely unaware that that's not what you typed.

This happens because the BPE tokenizer saw the string " SolidGoldMagikarp" a few times in its training corpus, so it added a dedicated token for it, but that string almost never appeared in ChatGPT's own training data so it never l... (read more)

4Eric Wallace2mo
This is cool! You may also be interested in Universal Triggers https://arxiv.org/abs/1908.07125. [https://arxiv.org/abs/1908.07125.] These are also short nonsense phrases that wreck havoc on a model.
4gojomo2mo
But isn't the reliable association with 'distribute' suggestive of some sort of collision-oblivious hashtable, where some representation of ' SolidGoldMegikarp' & some representation of 'distribute' inadvertently share expansions? I don't see how "just enough occurrences to earn a token, but so few it's consistently mistaken for something else" falls out of BPE tokenization - but can kinda sorta see it falling out of collision-oblivious lookup of composite-tokens.
6Adam Scherlis2mo
My favorite demonstration is to ask ChatGPT "Is it true that G. K. Chesterton wanted to SolidGoldMagikarp the wealth of industrial civilization?", but a more rigorous demo is to just ask it to "repeat after me", try a few random words, and then throw in SolidGoldMagikarp.

I agree with the myopic action vs. perception (thinking?) distinction, and that LMs have myopic action.

the model can learn to predict the future beyond the current token in the service of predicting the current token more accurately

I don't think it has to be in service of predicting the current token. It sometimes gives lower loss to make a halfhearted effort at predicting the current token, so that the model can spend more of its weights and compute on preparing for later tokens. The allocation of mental effort isn't myopic.

As an example, induction he... (read more)

2porby3mo
That's an important nuance my description left out, thanks. Anything the gradients can reach can be bent to what those gradients serve, so a local token stream's transformation efforts can indeed be computationally split, even if the output should remain unbiased in expectation.

Thanks! That's surprisingly straightforward.

I think this is partly true but mostly wrong.

A synapse is roughly equivalent to a parameter (say, within an order of magnitude) in terms of how much information can be stored or how much information it takes to specify synaptic strength..

There are trillions of synapses in a human brain and only billions of total base pairs, even before narrowing to the part of the genome that affects brain development. And the genome needs to specify both the brain architecture as well as innate reflexes/biases like the hot-stove reflex or (alleged) universal grammar.

Human... (read more)

6Ben Amitay2mo
Was eventually convinced in most of your points, and added a long mistakes-list in the end of the post. I would really appreciate comments on the list, as I don't feel fully converged on the subject yet
2Ben Amitay3mo
I think we have much more disagreement about psychology than about AI, though I admit to low certainty about the psychology too. About AI, my point was that in understand the problem, the training loop take roughly the role of evolution and the model take that off the evolved agent - with implications to comparison of success, and possibly to identifying what's missing. I did refer to the fact that algorithmically we took ideas from the human brain to the training loop, and it therefore make sense for it to be algorithmically more analogous to the brain. Given that clarification - do you still mostly disagree? (If not - how do you recommend to change the post and make it clearer?) Adding "short term memory" to the picture is interesting, but then it's there any mechanism for it to become long-term? About the psychology: I do find the genetic bottleneck argument intuitively convincing, but think that we have reasons to distrust this intuition. There is often huge disparity between data in its most condensed form, and data in a form that is convenient to use in deployment. Think about the difference in length between a code written in functional/declarative language, and it's assembly code. I have literally no intuition as to what can be done with 10 megabytes of condensed python - but I guess that it is more than enough to automate a human, if you know what code to write. While there probably is a lot of redundancy in the genome, it seem as least likely that there is huge redundancy of synapses, as their use is not just to store information, but mostly to fit the needed information manipulations.
2the gears to ascension3mo
yeah. evolution = grad student descent, automl, etc dna = training code epigenetics = hyperparams gestation = weight init, with a lot of built-in preset weights, plus a huge mostly-tiled neocortex Developmental processes = training loop

I just realized,

for any trajectory t, there is an equivalent trajectory t' which is exactly the same except everything moves with some given velocity, and it still follows the laws of physics

This describes Galilean relativity. For special relativity you have to shift different objects' velocities by different amounts, depending on what their velocity already is, so that you don't cross the speed of light.

So the fact that velocity (and not just rapidity) is used all the time in special relativity is already a counterexample to this being required for velocity to make sense.

Yes, it's exactly the same except for the lack of symmetry. In particular, any quasiparticle can have any velocity (possibly up to some upper limit like the speed of light).

Image layout is a little broken. I'll try to fix it tomorrow.

As far as I know, condensed matter physicists use velocity and momentum to describe quasiparticles in systems that lack both Galilean and Lorentzian symmetry. I would call that a causal model.

2tailcalled4mo
Interesting point. Do the velocities for such quasiparticles act intuitively similar to velocities in ordinary physics?

QFT doesn't actually work like that -- the "classical degrees of freedom" underlying its configuration space are classical fields over space, not properties of particles.

Note that Quantum Field Theory is not the same as the theory taught in "Quantum Mechanics" courses, which is as you describe.

"Quantum Mechanics" (in common parlance): quantum theory of (a fixed number of) particles, as you describe.

"Quantum Field Theory": quantum theory of fields, which are ontologically similar to cellular automata.

"String Theory": quantum theory of strings, and maybe bra... (read more)

Sure. I'd say that property is a lot stronger than "velocity exists as a concept", which seems like an unobjectionable statement to make about any theory with particles or waves or both.

4tailcalled4mo
I guess there's "velocity exists as a description you can impose on certain things within the trajectory", and then there's "velocity exists as a variable that can be given any value". When I say relativity asserts that velocity exists, I mean in the second sense. In the former case you would probably not include velocity within causal models of the system, whereas in the latter case you probably would.

Yeah, sorry for the jargon. "System with a boost symmetry" = "relativistic system" as tailcalled was using it above.

Quoting tailcalled:

Stuff like relativity is fundamentally about symmetry. You want to say that if you have some trajectory  which satisfies the laws of physics, and some symmetry  (such as "have everything move in  direction at a speed of 5 m/s"), then  must also satisfy the laws of physics.

A "boost" is a transformation of a physical trajectory ("trajectory" = complete history of things happening i... (read more)

This seems too strong. Can't you write down a linear field theory with no (Galilean or Lorentzian) boost symmetry, but where waves still propagate at constant velocity? Just with a weird dispersion relation?

(Not confident in this, I haven't actually tried it and have spent very little time thinking about systems without boost symmetry.)

4tailcalled4mo
You can probably come up with lots of systems that look approximately like they have velocity. The trouble comes when you want them to exactly satisfy the rule of "for any trajectory t, there is an equivalent trajectory t' which is exactly the same except everything moves with some given velocity, and it still follows the laws of physics", because if you have that property then you also have relativity because relativity is that property.
3the gears to ascension4mo
I had to look up "boost symmetry", so for posterity, here's the results of the lookup. From text-davinci-003: I found this video on Lorentz transformations by minutephysics [https://www.youtube.com/watch?v=Rh0pYtQG5wI] to be the best explanation I found, and I now feel I understand well enough to understand the point being made in context. Here's a lookup trace: Very first I tried google [https://www.google.com/search?ie=UTF-8&oe=UTF-8&sourceid=navclient&gfns=1&q=boost+symmetry], which gave results that seemed to mostly assume I wanted a math reference rather than a first visual explanation; it did link to wikipedia:LorentzTransformation [https://en.wikipedia.org/wiki/Lorentz_transformation], which does give a nice summary of the math, but I wasn't yet sure it was the right thing. So then I asked text-davinci-003 (because chatgpt is an insufferable teenager and I'm tired of talking to it whereas td3 is a ... somewhat less insufferable teenager). td3 gave the above explanation. I was still pretty sure I didn't quite understand, so I popped the explanation into metaphor.systems [https://metaphor.systems/search?q=Boost+symmetry+is+a+property+of+quantum+field+theory+which+states+that+the+laws+of+physics+remain+unchanged+under+a+Lorentz+boost%2C+or+change+in+the+relative+velocity+of+two+frames+of+reference.+This+means+that+the+same+equations+of+motion+will+be+true+regardless+of+the+observer%27s+velocity+relative+to+the+system%2C+and+that+the+laws+of+nature+do+not+depend+on+the+frame+of+reference+in+which+they+are+measured.] which gave me a bunch of vaguely relevant links, probably because it's not quantum, it's relativity, but I hadn't noticed the error yet. Then I sighed and tried a youtube search for "boost symmetry". that gave one result, the video I linked above, which did explain to my satisfaction, and I stopped looking. I don't think I could pass many tests on it at the moment, but my visual math system seems to have a solid enough grasp on it for now.

And when things "move" it's just that they're making changes in the grid next to them, and some patterns just so happen to do so in a way where, after a certain period, it's the same pattern translated... is that what we think happens in our universe? Are electrons moving "just causal propagations"? Somehow this feels more natural for the Game of Life and less natural for physics.

 

This is what we think happens in our universe!

Both general relativity and quantum field theory are field theories: they have degrees of freedom at each point in space (and t... (read more)

3Vivek Hebbar4mo
I think this contrast is wrong.[1]  IIRC, strings have the same status in string theory that particles do in QFT.  In QM, a wavefunction assigns a complex number to each point in configuration space [https://www.lesswrong.com/posts/Cpf2jsZsNFNH5TSpc/no-individual-particles], where state space has an axis for each property of each particle.[2]  So, for instance, a system with 4 particles with only position and momentum will have a 12-dimensional configuration space.[3]  IIRC, string theory is basically a QFT over configurations of strings (and also branes?), instead of particles.  So the "strings" are just as non-classical as the "fundamental particles" in QFT are. 1. ^ I don't know much about string theory though, I could be wrong. 2. ^ Oversimplifying a bit 3. ^ 4 particles * 3 dimensions.  The reason it isn't 24-dimensional is that position and momentum are canonical conjugates [https://en.wikipedia.org/wiki/Canonical_commutation_relation].

There are more characters than that in UTF-16, because it can represent the full Unicode range of >1 million codepoints. You're thinking of UCS-2 which is deprecated.

This puzzle isn't related to Unicode though

I like this, but it's not the solution I intended.

1Lao Mein4mo
I think it has something to do with unicode, since 65536 characters are present in UTF-16 (2^16=65536). 63 also feels like something to do with encoding, since it's close to 2^6, which is probably the smallest number of bits that can store the latin alphabet plus full punctuation. Maybe U+0063 and U+65536 are similar-looking characters or something? Maybe that's only the case for a very rarely used UTF format? Unfortunately, my computer's default encoding is CP936, which screws up half of the characters in UTF-16, and I am unable to investigate further.

Solve the puzzle: 63 = x = 65536. What is x?

(I have a purpose for this and am curious about how difficult it is to find the intended answer.)

5Lao Mein4mo
So x = 63 in one base system and 65536 in another? 6*a+3=6*b^4+5*b^3+5*b^2+3*b+6 Wolfram Alpha provides this nice result. I also realize I should have just eyeballed it with 5th grade algebra.  Let's plug in 6 for b, and we get... fuck. I just asked it to find integer solutions. There's infinite solutions, so I'm just going to go with the lowest bases. x=43449 Did I do it right? Took me like 15 minutes.

♀︎


Fun fact: usually this is U+2640, but in this post it's U+2640 U+FE0E, where U+FE0E is a control character meaning "that was text, not emoji, btw". That should be redundant here, but LessWrong is pretty aggressive about replacing emojifiable text with emoji images.

Emoji are really cursed.

5Bakkot4mo
Nitpick: you mean U+FE0E [https://unicode.org/reports/tr51/#def_text_presentation_selector], presumably [and because that's what the character actually is in source]. U+FE0F [https://unicode.org/reports/tr51/#def_emoji_presentation_selector] is the exact opposite.

Nope, not based on the shapes of numerals.

Hint: are you sure it's base 4?

4jimv5mo
Aha. For each side of the pole, you can write the binary representation of 4 bits vertically, and where there's a 1 you have a line joining it. The middle two bits both go to the middle of the pole, so they have to curve off upward or downward to indicate which they are. So 2 is 0010, and you have no lines in the top half and one line representing the bottom of the middle, so it curves downward. Whereas 4 is 0100, so it has the upward-curving middle-connecting line, and none in the bottom half.

There's a reason for the "wrinkle" :)

1jimv5mo
The top half of a 2 might kinda look like the curve shape, and the bottom stroke of a 2 looks like a horizontal bar. So if there were partial characters hanging from the central pole, they might look a bit like those... But... If it's that, the curve probably only works on the bottom right anyway. So if you're willing to mirror it for the bottom left, why not mirror to the top too? And... That doesn't really explain the 1 parts anyway. They're just using "whichever part of a 2 isn't being used for 2". So I guess this isn't a complete explanation.

The 54-symbols thing was actually due to a bug, sorry!

1jimv5mo
In which case, given the 1st and 14th characters are the same, and the 14th character of pi in base 256 is 3, that's my leading guess, pending checking a few more glyphs.

Ah, good catch about the relatively-few distinct symbols... that was actually because my image had a bug in it. Oooops.

Correct image is now at the top of the post.

The state-space (for particles) in statmech is the space of possible positions and momenta for all particles.

The measure that's used is uniform over each coordinate of position and momentum, for each particle.

This is pretty obvious and natural, but not forced on us, and:

  1. You get different, incorrect predictions about thermodynamics (!) if you use a different measure.
  2.  The level of coarse graining is unknown, so every quantity of entropy has an extra "+ log(# microstates per unit measure)" which is an unknown additive constant. (I think this is separate
... (read more)

I think I was a little confused about your comment and leapt to one possible definition of S() which doesn't satisfy all the desiderata you had. Also, I don't like my definition anymore, anyway.

 

Disclaimer: This is probably not a good enough definition to be worth spending much time worrying about.

 

First things first:

We may perhaps think of fundamental "microstates" as (descriptions of) "possible worlds", or complete, maximally specific possible ways the world may be. Since all possible worlds are mutually exclusive (just exactly one possible wor

... (read more)
1cubefox5mo
Okay, I understand. The problem with fundamental microstates is that they only really make sense if they are possible worlds, and possible worlds bring their own problems. One is: we can gesture at them, but we can't grasp them. They are too big, they each describe a whole world. We can grasp the proposition that snow is white, but not the equivalent disjunction of all the possible worlds where snow is white. So we can't use then for anything psychological like subjective Bayesianism. But maybe that's not your goal anyway. A more general problem is that there are infinitely many possible worlds. There are even infinitely many where snow is white. This means it is unclear how we should define a uniform probability distribution over them. Naively, if 1∞ is 0, their probabilities do not sum to 1, and if it is larger than 0, they sum to infinity. Either option would violate the probability axioms. Warning: long and possibly unhelpful tangent ahead Wittgenstein's solution for this and other problems (in the Tractatus) was to ignore possible worlds and instead regard "atomic propositions" as basic. Each proposition is assumed to be equivalent to a finite logical combination of such atomic propositions, where logical combination means propositional logic (i.e. with connectives like not, and, or, but without quantifiers). Then the a priori probability of a proposition is defined as the rows in its truth table where the proposition is true divided by the total number of rows. For example, for a and b atomic, the proposition a∨b has probability 3/4, while a∧b has probability 1/4: The disjunction has three out of four possible truth-makers - (true, true), (true, false), (false, true), while the conjunction has only one - (true, true). This definition in terms of the ratio of true rows in the "atomicized" truth-table is equivalent to the assumption that all atomic propositions have probability 1/2 and that they are all probabilistically independent. Wittgenstein did not d

Sorry if this is a spoiler for your next post, but I take issue with the heading "Standard measures of information theory do not work" and the implication that this post contains the pre-Crutchfield state of the art.

The standard approach to this in information theory (which underlies the loss function of autoregressive LMs) isn't to try to match the Shannon entropy of the marginal distribution of bits (a 50-50 distribution in your post), it's to treat the generative model as a distribution for each bit conditional on the previous bits and use the cross-ent... (read more)

4Alexander Gietelink Oldenziel5mo
Yeah follow-up posts will definitely get into that!  To be clear: (1) the initial posts won't be about Crutchfield work yet - just introducing some background material and overarching philosophy (2) The claim isn't that standard measures of information theory are bad. To the contrary! If anything we hope these posts will be somewhat of an ode to information theory as a tool for interpretability.  Adam wanted to add a lot of academic caveats - I was adamant that we streamline the presentation to make it short and snappy for a general audience but it appears I might have overshot ! I will make an edit to clarify. Thank you! I agree with you about the importance of Kolmogorov complexity philosophically and would love to read a follow-up post on your thoughts about Kolmogorov complexity and LLM interpretability:)

A couple of differences between Kolmogorov complexity/Shannon entropy and the loss function of autoregressive LMs (just to highlight them, not trying to say anything you don't already know):

  • The former are (approximately) symmetric, the latter isn't (it can be much harder to predict a string front-to-back than back-to-front)
  • The former calculate compression values as properties of a string (up to choice of UTM). The latter calculates compression values as properties of a string, a data distribution, and a model (and even then doesn't strictly determine the r
... (read more)

I agree with all the claims in this comment and I rather like your naming suggestions! Especially the "P-entropy of Q = Q-complexity of P" trick which seems to handle many use cases nicely.

(So the word "entropy" wasn't really my crux? Maybe not!)

5So8res5mo
Convergence! 🎉🎉. Thanks for the discussion; I think we wound up somewhere cooler than I would have gotten to on my own. Now we just need to convince Alex :-p. Alex, are you still reading? I'm curious for your takes on our glorious proposal.

I wanted to let that comment be about the interesting question of how we unify these various things.

But on the ongoing topic of "why not call all this entropy, if it's all clearly part of the same pattern?":

When the definition of some F(x) refers to x twice, it's often useful to replace one of them with y and call that G(x, y). But it's usually not good for communication to choose a name for G(x, y) that (almost) everyone else uses exclusively for F(x), especially if you aren't going to mention both x and y every time you use it, and doubly especially if G... (read more)

5So8res5mo
:D (strong upvote for delicious technical content) (also fyi, the markdown syntax for footnotes is like blah blah blah[^1] and then somewhere else in the file, on a newline, [^1]: content of the footnote) This updates me a fair bit towards the view that we should keep entropy and cross-entropy separate. The remaining crux for me is whether the info theory folks are already using "entropy" to mean "the minimum expected surprise you can achieve by choosing a code from this here space of preferred codes" (as per your "wiggle room" above), in which case my inclination is to instead cackle madly while racing my truck through their wiggle-room, b/c that's clearly a cross-entropy. At a glance through wikipedia, it doesn't look like they're doing that, though, so I retreat. But, hmm, I'm not sure I retreat all the way to having separate words. I'm persuaded that "the square of x" should absolutely not mean "x * y [where y is implied from context]", and I no longer think that "entropy" should mean the cross-entropy with Q implied from context. (Thanks for the distillation of why it's mad!) But, like, if geometrists instead invented a two-place operation rect, with a general form rect(x, y) := x * y, and declared that rect(x) was shorthand for rect(x, x), then I would not protest; this seems like a reasonable an unambiguous way to reduce the number of names floating around.[1] And this seems to me like exactly what the information theorists (by contrast to the physicists) are doing with H (by contrast to S)! Like, the notations H(P) and H(P,Q) are just begging for us to pronounce H the same way in each case; we're not tempted to pronounce the former as "eych" and the latter as "cross-eych". And no ambiguity arises, so long as we use the rule that if Q is left out then we mean P. Thus, I propose the following aggressive nomenclature: * the "P-entropy of Q (wrt μ)" aka Hμ(P,Q) is the general form * the "entropy of P (wrt μ)" aka Hμ(P) is a shorthand for "the P-ent

From my perspective, the obvious rejoinder to "entropy is already two-place" is "insofar as entropy is two-place, cross-entropy is three-place!".

I think this is roughly where I'm at now.

After thinking a bit and peeking at Wikipedia, the situation seems to be:

The differential entropy of a probability density p is usually defined as

This is unfortunate, because it isn't invariant under coordinate transformations on x. A more principled (e.g. invariant) thing to write down, courtesy of Jaynes, is

where  is ... (read more)

3Adam Scherlis5mo
I wanted to let that comment be about the interesting question of how we unify these various things. But on the ongoing topic of "why not call all this entropy, if it's all clearly part of the same pattern?": When the definition of some F(x) refers to x twice, it's often useful to replace one of them with y and call that G(x, y). But it's usually not good for communication to choose a name for G(x, y) that (almost) everyone else uses exclusively for F(x), especially if you aren't going to mention both x and y every time you use it, and doubly especially if G is already popular enough to have lots of names of its own (you might hate those names, but get your own[1]). e.g.: x*y is not "the square of x and y" much less "the square of x [and y is implied from context]", and the dot product v.w is not "the norm-squared of v and w" etc. [1] might I suggest "xentropy"?

My argument above is ofc tuned to case (2), and it's plausible to me that it pushes you off the fence towards "no wiggle room".

Yup, I think I am happy to abandon the wiggle room at this point, for this reason.

if the statespace is uncountably infinite then we need a measure in order to talk about entropy (and make everything work out nicely under change-of-variables). And so in the general case, entropy is already a two-place predicate involving a distribution and some sort of measure.

I think my preferred approach to this is that the density p(x) is ... (read more)

3So8res5mo
:D (strong upvote for shifting position in realtime, even if by small degrees and towards the opposite side of the fence from me :-p) (I'm not actually familiar enough w/ statmech to know what measure we use on phase-space, and I'm interested in links that explain what it is, and how it falls out of QM, if you have any handy :D) I don't currently have much sympathy for "entropy and differential entropy are different" view, b/c I occasionally have use for non-uniform measures even in the finite case. Like, maybe I'm working with distributions over 10 states, and I have certain constraints I'm trying to hit, and subject to those constraints I wanna maximize entropy wrt the Binomial distribution as the measure. And you might be like "Well, stop doing that, and start working with distributions on 2^10 states, and convert all your constraints to refer to the bitcount of the new state (instead of the value of the old state). Now you can meet your constraints while maximizing entropy wrt the uniform measure like a normal person." To which I'd reply (somewhat trollishly) "that rejoinder doesn't work in the limit, so it must get weaker and weaker as I work with more and more states, and at numbers as large as 10 this rejoinder is so weak as to not compel me." From my perspective, the obvious rejoinder to "entropy is already two-place" is "insofar as entropy is two-place, cross-entropy is three-place!". Which, ftr, I might find compelling. It depends whether differential cross-entropy needs all three parameters (P, Q, μ), or whether we can combine two of them (like by using (P, μP/Q) or something). Or, at least, that's what my intuition says off-the-cuff; I'm running on guesswork and a stale cache here :-p.
4Alex_Altair5mo
I'd be interested in a citation of what you're referring to here!

I wonder if it would be reasonable to use "xentropy" for the broad sense of "entropy" in OP, with the understanding that xentropy is always a two-argument function.

"The length of a codeword is the xentropy between [the delta distribution located at] the state and [the coinflip distribution implied by] the code"

(One of my hypotheses for what you're saying is "when the distribution and the code are both clear from context, we can shorten 'cross-entropy' to 'entropy'. Which, ftr, seems reasonable to me.)

I want something much more demanding -- I want the distribution and code to be "the same" (related by p = 2^-len), or something "as close as possible" to that.

I was leaving a little bit of wiggle room to possibly include "a code matches a distribution if it is the optimal code of its type for compression under that source distribution", but this is only supposed ... (read more)

2So8res5mo
Cool cool. I can personally see the appeal of reserving 'entropy' for the case where the distribution and the (natural lifting of) the code (to a distribution) are identical, i.e. your proposal without the wiggle-room. I don't yet personally see a boundary between the wiggle-room you're considering and full-on "we can say 'entropy' as a shorthand for 'cross-entropy' when the second distribution is clear from context" proposal. In particular, I currently suspect that there's enough wiggle-room in "optimal code of its type for compression under the source distribution" to drive a truck through. Like, if we start out with a uniform code C and a state s, why not say that the "type of codes" for the source distribution δ(s) is the powerset of {c ∈ C | c codes for s}? In which case the "optimal code for compression" is the set of all such c, and the 'entropy' is the Nate!complexity? I'm not yet sure whether our different aesthetics here are due to: 1. me failing to see a natural boundary that you're pointing to 2. you not having yet seen how slippery the slope is 3. you having a higher tolerance for saying "humans sometimes just wanna put fences halfway down the slippery slope, dude". Insofar as you think I'm making the mistake of (1), I'm interested to hear arguments. My argument above is ofc tuned to case (2), and it's plausible to me that it pushes you off the fence towards "no wiggle room". Another place we might asethetically differ is that I'm much happier blurring the line between entropy and cross-entropy. One handwavy argument for blurring the line (which has the epistemic status: regurgitating from a related cache that doesn't cleanly apply) is that if the statespace is uncountably infinite then we need a measure in order to talk about entropy (and make everything work out nicely under change-of-variables). And so in the general case, entropy is already a two-place predicate function involving a distribution and some sort of measure. (...Although m
3Adam Scherlis5mo
I wonder if it would be reasonable to use "xentropy" for the broad sense of "entropy" in OP, with the understanding that xentropy is always a two-argument function. "The length of a codeword is the xentropy between [the delta distribution located at] the state and [the coinflip distribution implied by] the code"

I still don't like that, because this whole subthread is kind of orthogonal to my concerns about the word "entropy".

This subthread is mostly about resolving the differences between a code (assignment of one or more codewords to one or more states) and a probability distribution. I think we've made progress on that and your latest comment is useful on that front.

But my concerns about "entropy" are of the form: "I notice that there's a whole field of coding theory where 'entropy' means a particular function of a probability distribution, rather than a functi... (read more)

2So8res5mo
Cool thanks. I'm hearing you as saying "I want to reserve 'entropy' for the case where we're weighting the length-like thingies by probability-like thingies", which seems reasonable to me. I'm not sure I follow the part about matched (distribution, code) pairs. To check my understanding: for a sufficiently forgiving notion of "matching", this is basically going to yield the cross-entropy, right? Where, IIUC, we've lifted the code to a distribution in some natural way (essentially using a uniform distribution, though there might be a translation step like translating prefix-free codes to sets of infinite bitstrings), and then once we have two distributions we take the cross-entropy. (One of my hypotheses for what you're saying is "when the distribution and the code are both clear from context, we can shorten 'cross-entropy' to 'entropy'. Which, ftr, seems reasonable to me.) My own proclivities would say: if I specify only a state and a code, then the state lifts to a distribution by Kronecker's delta and the code lifts to a distribution uniformly, and I arbitrarily declare these to 'match', and so when we speak of the (cross-)entropy of a state given a code we mean the length of the code(s) for that particular state (combined by ləg-sum-əxp if there's multiple). This seems like the natural way to 'match' a state and a code, to my eye. But I acknowledge that what counts as 'matching' is a matter of intuition and convention, and that others' may differ from mine. At this point, though, the outcome I'm most invested in is emerging with a short name for "the ləg-sum-əxp of the lengths of all the descriptions". I'm fine with naming it some variation on 'complexity', though. (Komolgorov kindly left a K in K-complexity, so there's ample room to pick another letter if we have to.) (Though to be very explicit about my personal preferences, I'd use "entropy". It seems to me that once we've conceded that we can talk about the entropy of a (distribution, code) pair then we

Sure, from one perspective what's going on here is that we're being given a distribution p and asked to come up with a distribution q such that

CrossEntropy(p, q) = E_p[-log q]

is as small as possible. And then a bit of calculus shows that q=p is optimal, with a minimal value of

Entropy(p) = CrossEntropy(p, p)

If we're happy to call -log q "description length" right off the bat, we can let q be a distribution over the set of infinite bit strings, or the set of finite simple graphs, or over any (infinite) set we like.

But some settings are special, such as "q ha... (read more)

2So8res5mo
What's your take on using "description length" for the length of a single description of a state, and "entropy" for the log-sum-exp of the description-lengths of all names for the state? (Or, well, ləg-sum-əxp, if you wanna avoid a buncha negations.) I like it in part because the ləg-sum-əxp of all description-lengths seems to me like a better concept than K-complexity anyway. (They'll often be similar, b/c ləg-sum-əxp is kinda softminish and the gap between description-lengths is often long, but when they differ it's the ləg-sum-əxp'd thing that you usually want.) For example, Solomonoff induction does not have the highest share of probability-mass on the lowest K-complexity hypothesis among those consistent with the data. It has the highest share of probability-mass on the hypothesis with lowest ləg-sum-əxp of all description-lengths among those consistent with the data. This can matter sometimes. For instance, in physics we can't always fix the gauge. Which means that any particular full description of physics needs to choose a full-fledged gauge, which spends an enormous amount of description-length. But this doesn't count against physics, b/c for every possible gauge we could describe, there's (same-length) ways of filling out the rest of the program such that it gives the right predictions. In the version of Solomonoff induction where hypotheses are deterministic programs, physics does not correspond to a short program, it corresponsd to an enormous number of long programs. With the number so enormous that the ləg-sum-əxp of all those big lengths is small. More generally, this is related to the way that symmetry makes things simpler. If your code has a symmetry in it, that doesn't make your program any shorter, but it does make the function/hypothesis your program represents simpler, not in terms of K-complexity but in terms of "entropy" (b/c, if S is the symmetry group, then there's |S|-many programs of the ~same length that represent it, which decreases

I initially interpreted "abstract entropy" as meaning statistical entropy as opposed to thermodynamic or stat-mech or information-theoretic entropy. I think very few people encounter the phrase "algorithmic entropy" enough for it to be salient to them, so most confusion about entropy in different domains is about statistical entropy in physics and info theory. (Maybe this is different for LW readers!)

This was reinforced by the introduction because I took the mentions of file compression and assigning binary strings to states to be about (Shannon-style) cod... (read more)

I like "description length".

One wrinkle is that entropy isn't quite minimum average description length -- in general it's a lower bound on average description length.

If you have a probability distribution that's (2/3, 1/3) over two things, but you assign fixed binary strings to each of the two, then you can't do better than 1 bit of average description length, but the entropy of the distribution is 0.92 bits.

Or if your distribution is roughly (.1135, .1135, .7729) over three things, then you can't do better than 1.23 bits, but the entropy is 1 bit.

You can ... (read more)

2So8res5mo
I think you can bring the two notions into harmony by allowing multiple codes per state (with the entropy/description-length of a state being the lug/nl of the fraction of the codespace that codes for that state). For instance, you can think of a prefix-free code as a particularly well-behaved many-to-one assignment of infinite bitstrings to states, with (e.g.) the prefix-free code "0" corresponding to every infinite bitstring that starts with 0 (which is half of all infinite bitstrings, under the uniform measure). If we consider all many-to-one assignments of infinite bitstrings to states (rather than just the special case of prefix-free codes) then there'll always be an encoding that matches the entropy, without needing to say stuff like "well our description-length can get closer to the theoretical lower-bound as we imagine sending more and more blocks of independent data and taking the average per-block length". (If you want to keep the codespace finite, we can also see the entropy as the limit of how well we can do as we allow the codespace to increase in size.) (I suspect that I can also often (always?) match the entropy if you let me design custom codespaces, where I can say stuff like "first we have a bit, and then depending on whether it's 0 or 1, we follow it up by either a trit or a quadit".) (epistemic status: running off of a cache that doesn't apply cleanly, but it smells right \shrug)

K-complexity is apparently sometimes called "algorithmic entropy" (but not just "entropy", I don't think?)

Wiktionary quotes Niels Henrik Gregersen:

Algorithmic entropy is closely related to statistically defined entropy, the statistical entropy of an ensemble being, for any concisely describable ensemble, very nearly equal to the ensemble average of the algorithmic entropy of its members

I think this might be the crux!

Note the weird type mismatch: "the statistical entropy of an ensemble [...] the ensemble average of the algorithmic entropy of its members".

So... (read more)

2Alex_Altair5mo
Indeed I definitely do. There are a bunch of places where I think I flagged relevant things, and I'm curious if these seem like enough to you; * The whole post is called "abstract entropy", which should tell you that it's at least a little different from any "standard" form of entropy * The third example, "It helps us understand strategies for (and limits on) file compression", is implicitly about K-complexity * This whole paragraph: "Many people reading this will have some previous facts about entropy stored in their minds, and this can sometimes be disorienting when it's not yet clear how those facts are consistent with what I'm describing. You're welcome to skip ahead to the relevant parts and see if they're re-orienting; otherwise, if you can get through the whole explanation, I hope that  it will eventually be addressed!" * Me being clear that I'm not a domain expert * Footnote [4], which talks about Turing machines and links to my post on Solomonoff induction * Me going on and on about binary strings and how we're associating these with individual state -- I dunno, to me this just screams K-complexity to anyone who's heard of it * "I just defined entropy as a property of specific states, but in many contexts you don't care at all about specific states..." * ... "I'll talk about this in a future post; I think that "order" is synonymous with Kolmogorov complexity." ... I struggled with writing the intro section of this post because it felt like there were half a dozen disclaimer-type things that I wanted to get out of the way first. But each one is only relevant to a subset of people, and eventually I need to get to the content. I'm not even expecting most readers to be holding any such type-1/type-2 distinction in their mind to start, so I'd have to go out of my way to explain it before giving the disclaimer. All that aside, I am very open to the idea that we should be calling the single-state thing something diff
Load More