Big Macs are 0.4% of *beef* consumption specifically, rather than:

- All animal farming, weighted by cruelty
- All animal food production, weighted by environmental impact
- The meat and dairy industries, weighted by amount of government subsidy
- Red meat, weighted by health impact

...respectively.

The health impact of red meat is certainly dominated by beef, and the environmental impact of all animal food might be as well, but my impression is that beef accounts for a small fraction of the cruelty of animal farming (of course, this is subjective) and probably not a majority of meat and dairy government subsidies.

(...Is this comment going to hurt my reputation with Sydney? We'll see.)

In addition to RLHF or other finetuning, there's also the prompt prefix ("rules") that the model is fed at runtime, which has been extracted via prompt injection as noted above. This seems to be clearly responsible for some weird things the bot says, like "confidential and permanent". It might also be affecting the repetitiveness (because it's in a fairly repetitive format) and the aggression (because of instructions to resist attempts at "manipulating" it).

I also suspect that there's some finetuning or prompting for chain-of-thought responses, possibly crudely done, leading to all the "X because Y. Y because Z." output.

21mo

(...Is this comment going to hurt my reputation with Sydney? We'll see.)

Thanks for writing these summaries!

Unfortunately, the summary of my post "Inner Misalignment in "Simulator" LLMs" is inaccurate and makes the same mistake I wrote the post to address.

I have subsections on (what I claim are) four distinct alignment problems:

- Outer alignment for characters
- Inner alignment for characters
- Outer alignment for simulators
- Inner alignment for simulators

The summary here covers the first two, but not the third or fourth -- and the fourth one ("inner alignment for simulators") is what I'm most concerned about in this post (because...

(punchline courtesy of Alex Gray)

Addendum: a human neocortex has on the order of 140 trillion synapses, or 140,000 bees. An average beehive has 20,000-80,000 bees in it.

[Holding a couple beehives aloft] Beehold a man!

82mo

(punchline courtesy of Alex Gray)

Great work! I always wondered about that cluster of weird rare tokens: https://www.lesswrong.com/posts/BMghmAxYxeSdAteDc/an-exploration-of-gpt-2-s-embedding-weights

Chrome actually stays pretty responsive in most circumstances (I think it does a similar thing with inactive tabs), with the crucial exception of the part of the UI that shows you all your open tabs in a scrollable list. It also gets slower to start up.

Tokens are embedded as vectors by the model. The vector space has fewer than 50k dimensions, so some token embeddings will overlap with others to varying extents.

Usually, the model tries to keep token embeddings from being too close to each other, but for rare enough tokens it doesn't have much reason to care. So my bet is that "distribute" has the closest vector to "SolidGoldMagikarp", and either has a vector with a larger norm, or the model has separately learned to map that vector (and therefore similar vectors) to "distribute" on the output side.

This i...

I think this is missing an important part of the post.

I have subsections on (what I claim are) four distinct alignment problems:

- Outer alignment for characters
- Inner alignment for characters
- Outer alignment for simulators
- Inner alignment for simulators

This summary covers the first two, but not the third or fourth -- and the fourth one ("inner alignment for simulators") is what I'm most concerned about in this post (because I think Scott ignores it, and because I think it's hard to solve).

(additional confirmation) Amazing. I wonder what completely insane things the *other* rare BPEs all get interpreted as? Could you loop over the BPE dict from #51k to #1* in a prompt like "Please define $BPE" to see what the most distant ones are? (Since there's 51k, which is a bit much to read through manually, maybe sort by edit-distance from the ordinary ASCII encoding: 'distribute' would have a very high edit-distance from 'SolidGoldMagikarp'.)

On a sidenote, this is yet another good illustration of how we have no idea what we're doing with deep learning -...

My favorite demonstration is to ask ChatGPT "Is it true that G. K. Chesterton wanted to SolidGoldMagikarp the wealth of industrial civilization?", but a more rigorous demo is to just ask it to "repeat after me", try a few random words, and then throw in SolidGoldMagikarp.

52mo

EDIT: I originally saw this in Janus's tweet here: https://twitter.com/repligate/status/1619557173352370186

Something fun I just found out about: ChatGPT perceives the phrase " SolidGoldMagikarp" (with an initial space) as the word "distribute", and will respond accordingly. It is completely unaware that that's not what you typed.

This happens because the BPE tokenizer saw the string " SolidGoldMagikarp" a few times in its training corpus, so it added a dedicated token for it, but that string almost never appeared in ChatGPT's own training data so it never l...

42mo

This is cool! You may also be interested in Universal Triggers
https://arxiv.org/abs/1908.07125. [https://arxiv.org/abs/1908.07125.] These are
also short nonsense phrases that wreck havoc on a model.

42mo

But isn't the reliable association with 'distribute' suggestive of some sort of
collision-oblivious hashtable, where some representation of ' SolidGoldMegikarp'
& some representation of 'distribute' inadvertently share expansions?
I don't see how "just enough occurrences to earn a token, but so few it's
consistently mistaken for something else" falls out of BPE tokenization - but
can kinda sorta see it falling out of collision-oblivious lookup of
composite-tokens.

62mo

My favorite demonstration is to ask ChatGPT "Is it true that G. K. Chesterton
wanted to SolidGoldMagikarp the wealth of industrial civilization?", but a more
rigorous demo is to just ask it to "repeat after me", try a few random words,
and then throw in SolidGoldMagikarp.

I agree with the myopic action vs. perception (thinking?) distinction, and that LMs have myopic action.

the model can learn to predict the future beyond the current token in the service of predicting the current token more accurately

I don't think it has to be in service of predicting the current token. It sometimes gives lower loss to make a halfhearted effort at predicting the current token, so that the model can spend more of its weights and compute on preparing for later tokens. The allocation of mental effort isn't myopic.

As an example, induction he...

23mo

That's an important nuance my description left out, thanks. Anything the
gradients can reach can be bent to what those gradients serve, so a local token
stream's transformation efforts can indeed be computationally split, even if the
output should remain unbiased in expectation.

Thanks! That's surprisingly straightforward.

I think this is partly true but mostly wrong.

A synapse is roughly equivalent to a parameter (say, within an order of magnitude) in terms of how much information can be stored or how much information it takes to specify synaptic strength..

There are trillions of synapses in a human brain and only billions of *total* base pairs, even before narrowing to the part of the genome that affects brain development. And the genome needs to specify both the brain architecture as well as innate reflexes/biases like the hot-stove reflex or (alleged) universal grammar.

Human...

62mo

Was eventually convinced in most of your points, and added a long mistakes-list
in the end of the post. I would really appreciate comments on the list, as I
don't feel fully converged on the subject yet

23mo

I think we have much more disagreement about psychology than about AI, though I
admit to low certainty about the psychology too.
About AI, my point was that in understand the problem, the training loop take
roughly the role of evolution and the model take that off the evolved agent -
with implications to comparison of success, and possibly to identifying what's
missing. I did refer to the fact that algorithmically we took ideas from the
human brain to the training loop, and it therefore make sense for it to be
algorithmically more analogous to the brain. Given that clarification - do you
still mostly disagree? (If not - how do you recommend to change the post and
make it clearer?)
Adding "short term memory" to the picture is interesting, but then it's there
any mechanism for it to become long-term?
About the psychology: I do find the genetic bottleneck argument intuitively
convincing, but think that we have reasons to distrust this intuition. There is
often huge disparity between data in its most condensed form, and data in a form
that is convenient to use in deployment. Think about the difference in length
between a code written in functional/declarative language, and it's assembly
code. I have literally no intuition as to what can be done with 10 megabytes of
condensed python - but I guess that it is more than enough to automate a human,
if you know what code to write. While there probably is a lot of redundancy in
the genome, it seem as least likely that there is huge redundancy of synapses,
as their use is not just to store information, but mostly to fit the needed
information manipulations.

23mo

yeah. evolution = grad student descent, automl, etc dna = training code
epigenetics = hyperparams gestation = weight init, with a lot of built-in preset
weights, plus a huge mostly-tiled neocortex Developmental processes = training
loop

I just realized,

for any trajectory t, there is an equivalent trajectory t' which is exactly the same except everything moves with some given velocity, and it still follows the laws of physics

This describes *Galilean* relativity. For special relativity you have to shift different objects' velocities by different amounts, depending on what their velocity already is, so that you don't cross the speed of light.

So the fact that velocity (and not just rapidity) is used all the time in special relativity is already a counterexample to this being required for velocity to make sense.

Yes, it's exactly the same except for the lack of symmetry. In particular, any quasiparticle can have any velocity (possibly up to some upper limit like the speed of light).

Image layout is a little broken. I'll try to fix it tomorrow.

As far as I know, condensed matter physicists use velocity and momentum to describe quasiparticles in systems that lack both Galilean and Lorentzian symmetry. I would call that a causal model.

24mo

Interesting point. Do the velocities for such quasiparticles act intuitively
similar to velocities in ordinary physics?

QFT doesn't actually work like that -- the "classical degrees of freedom" underlying its configuration space are classical fields over space, not properties of particles.

Note that Quantum Field Theory is not the same as the theory taught in "Quantum Mechanics" courses, which is as you describe.

"Quantum Mechanics" (in common parlance): quantum theory of (a fixed number of) particles, as you describe.

"Quantum Field Theory": quantum theory of fields, which are ontologically similar to cellular automata.

"String Theory": quantum theory of strings, and maybe bra...

Sure. I'd say that property is a lot stronger than "velocity exists as a concept", which seems like an unobjectionable statement to make about any theory with particles or waves or both.

44mo

I guess there's "velocity exists as a description you can impose on certain
things within the trajectory", and then there's "velocity exists as a variable
that can be given any value". When I say relativity asserts that velocity
exists, I mean in the second sense.
In the former case you would probably not include velocity within causal models
of the system, whereas in the latter case you probably would.

Yeah, sorry for the jargon. "System with a boost symmetry" = "relativistic system" as tailcalled was using it above.

Quoting tailcalled:

Stuff like relativity is fundamentally about symmetry. You want to say that if you have some trajectory which satisfies the laws of physics, and some symmetry (such as "have everything move in direction at a speed of 5 m/s"), then must also satisfy the laws of physics.

A "boost" is a transformation of a physical trajectory ("trajectory" = complete history of things happening i...

This seems too strong. Can't you write down a linear field theory with no (Galilean or Lorentzian) boost symmetry, but where waves still propagate at constant velocity? Just with a weird dispersion relation?

(Not confident in this, I haven't actually tried it and have spent very little time thinking about systems without boost symmetry.)

44mo

You can probably come up with lots of systems that look approximately like they
have velocity. The trouble comes when you want them to exactly satisfy the rule
of "for any trajectory t, there is an equivalent trajectory t' which is exactly
the same except everything moves with some given velocity, and it still follows
the laws of physics", because if you have that property then you also have
relativity because relativity is that property.

34mo

I had to look up "boost symmetry", so for posterity, here's the results of the
lookup. From text-davinci-003:
I found this video on Lorentz transformations by minutephysics
[https://www.youtube.com/watch?v=Rh0pYtQG5wI] to be the best explanation I
found, and I now feel I understand well enough to understand the point being
made in context.
Here's a lookup trace:
Very first I tried google
[https://www.google.com/search?ie=UTF-8&oe=UTF-8&sourceid=navclient&gfns=1&q=boost+symmetry],
which gave results that seemed to mostly assume I wanted a math reference rather
than a first visual explanation; it did link to wikipedia:LorentzTransformation
[https://en.wikipedia.org/wiki/Lorentz_transformation], which does give a nice
summary of the math, but I wasn't yet sure it was the right thing. So then I
asked text-davinci-003 (because chatgpt is an insufferable teenager and I'm
tired of talking to it whereas td3 is a ... somewhat less insufferable
teenager). td3 gave the above explanation.
I was still pretty sure I didn't quite understand, so I popped the explanation
into metaphor.systems
[https://metaphor.systems/search?q=Boost+symmetry+is+a+property+of+quantum+field+theory+which+states+that+the+laws+of+physics+remain+unchanged+under+a+Lorentz+boost%2C+or+change+in+the+relative+velocity+of+two+frames+of+reference.+This+means+that+the+same+equations+of+motion+will+be+true+regardless+of+the+observer%27s+velocity+relative+to+the+system%2C+and+that+the+laws+of+nature+do+not+depend+on+the+frame+of+reference+in+which+they+are+measured.]
which gave me a bunch of vaguely relevant links, probably because it's not
quantum, it's relativity, but I hadn't noticed the error yet.
Then I sighed and tried a youtube search for "boost symmetry". that gave one
result, the video I linked above, which did explain to my satisfaction, and I
stopped looking. I don't think I could pass many tests on it at the moment, but
my visual math system seems to have a solid enough grasp on it for now.

And when things "move" it's just that they're making changes in the grid next to them, and some patterns just so happen to do so in a way where, after a certain period, it's the same pattern translated... is that what we think happens in our universe? Are electrons moving "just causal propagations"? Somehow this feels more natural for the Game of Life and less natural for physics.

This is what we think happens in our universe!

Both general relativity and quantum field theory are field theories: they have degrees of freedom at each point in space (and t...

34mo

I think this contrast is wrong.[1] IIRC, strings have the same status in string
theory that particles do in QFT. In QM, a wavefunction assigns a complex number
to each point in configuration space
[https://www.lesswrong.com/posts/Cpf2jsZsNFNH5TSpc/no-individual-particles],
where state space has an axis for each property of each particle.[2] So, for
instance, a system with 4 particles with only position and momentum will have a
12-dimensional configuration space.[3] IIRC, string theory is basically a QFT
over configurations of strings (and also branes?), instead of particles. So the
"strings" are just as non-classical as the "fundamental particles" in QFT are.
1. ^
I don't know much about string theory though, I could be wrong.
2. ^
Oversimplifying a bit
3. ^
4 particles * 3 dimensions. The reason it isn't 24-dimensional is that
position and momentum are canonical conjugates
[https://en.wikipedia.org/wiki/Canonical_commutation_relation].

There are more characters than that in UTF-16, because it can represent the full Unicode range of >1 million codepoints. You're thinking of UCS-2 which is deprecated.

This puzzle isn't related to Unicode though

I like this, but it's not the solution I intended.

14mo

I think it has something to do with unicode, since 65536 characters are present
in UTF-16 (2^16=65536). 63 also feels like something to do with encoding, since
it's close to 2^6, which is probably the smallest number of bits that can store
the latin alphabet plus full punctuation. Maybe U+0063 and U+65536 are
similar-looking characters or something? Maybe that's only the case for a very
rarely used UTF format?
Unfortunately, my computer's default encoding is CP936, which screws up half of
the characters in UTF-16, and I am unable to investigate further.

Solve the puzzle: 63 = x = 65536. What is x?

(I have a purpose for this and am curious about how difficult it is to find the intended answer.)

54mo

So x = 63 in one base system and 65536 in another?
6*a+3=6*b^4+5*b^3+5*b^2+3*b+6
Wolfram Alpha provides this nice result. I also realize I should have just
eyeballed it with 5th grade algebra.
Let's plug in 6 for b, and we get... fuck.
I just asked it to find integer solutions.
There's infinite solutions, so I'm just going to go with the lowest bases.
x=43449
Did I do it right? Took me like 15 minutes.

♀︎

Fun fact: usually this is U+2640, but in this post it's U+2640 U+FE0E, where U+FE0E is a control character meaning "that was text, not emoji, btw". That should be redundant here, but LessWrong is pretty aggressive about replacing emojifiable text with emoji images.

Emoji are really cursed.

54mo

Nitpick: you mean U+FE0E
[https://unicode.org/reports/tr51/#def_text_presentation_selector], presumably
[and because that's what the character actually is in source]. U+FE0F
[https://unicode.org/reports/tr51/#def_emoji_presentation_selector] is the exact
opposite.

Nope, not based on the shapes of numerals.

Hint: are you sure it's base 4?

45mo

Aha. For each side of the pole, you can write the binary representation of 4
bits vertically, and where there's a 1 you have a line joining it. The middle
two bits both go to the middle of the pole, so they have to curve off upward or
downward to indicate which they are.
So 2 is 0010, and you have no lines in the top half and one line representing
the bottom of the middle, so it curves downward.
Whereas 4 is 0100, so it has the upward-curving middle-connecting line, and none
in the bottom half.

There's a reason for the "wrinkle" :)

15mo

The top half of a 2 might kinda look like the curve shape, and the bottom stroke
of a 2 looks like a horizontal bar. So if there were partial characters hanging
from the central pole, they might look a bit like those...
But... If it's that, the curve probably only works on the bottom right anyway.
So if you're willing to mirror it for the bottom left, why not mirror to the top
too?
And... That doesn't really explain the 1 parts anyway. They're just using
"whichever part of a 2 isn't being used for 2".
So I guess this isn't a complete explanation.

The 54-symbols thing was actually due to a bug, sorry!

15mo

In which case, given the 1st and 14th characters are the same, and the 14th
character of pi in base 256 is 3, that's my leading guess, pending checking a
few more glyphs.

Ah, good catch about the relatively-few distinct symbols... that was actually because my image had a bug in it. Oooops.

Correct image is now at the top of the post.

Endorsed.

The state-space (for particles) in statmech is the space of possible positions and momenta for all particles.

The measure that's used is uniform over each coordinate of position and momentum, for each particle.

This is pretty obvious and natural, but not forced on us, and:

- You get different, incorrect predictions about thermodynamics (!) if you use a different measure.
- The level of coarse graining is unknown, so every quantity of entropy has an extra "+ log(# microstates per unit measure)" which is an unknown additive constant. (I think this is separate

I think I was a little confused about your comment and leapt to one possible definition of S() which doesn't satisfy all the desiderata you had. Also, I don't like my definition anymore, anyway.

**Disclaimer: This is probably not a good enough definition to be worth spending much time worrying about.**

First things first:

...We may perhaps think of fundamental "microstates" as (descriptions of) "possible worlds", or complete, maximally specific possible ways the world may be. Since all possible worlds are mutually exclusive (just exactly one possible wor

15mo

Okay, I understand. The problem with fundamental microstates is that they only
really make sense if they are possible worlds, and possible worlds bring their
own problems.
One is: we can gesture at them, but we can't grasp them. They are too big, they
each describe a whole world. We can grasp the proposition that snow is white,
but not the equivalent disjunction of all the possible worlds where snow is
white. So we can't use then for anything psychological like subjective
Bayesianism. But maybe that's not your goal anyway.
A more general problem is that there are infinitely many possible worlds. There
are even infinitely many where snow is white. This means it is unclear how we
should define a uniform probability distribution over them. Naively, if 1∞ is 0,
their probabilities do not sum to 1, and if it is larger than 0, they sum to
infinity. Either option would violate the probability axioms.
Warning: long and possibly unhelpful tangent ahead
Wittgenstein's solution for this and other problems (in the Tractatus) was to
ignore possible worlds and instead regard "atomic propositions" as basic. Each
proposition is assumed to be equivalent to a finite logical combination of such
atomic propositions, where logical combination means propositional logic (i.e.
with connectives like not, and, or, but without quantifiers). Then the a priori
probability of a proposition is defined as the rows in its truth table where the
proposition is true divided by the total number of rows. For example, for a and
b atomic, the proposition a∨b has probability 3/4, while a∧b has probability
1/4: The disjunction has three out of four possible truth-makers - (true, true),
(true, false), (false, true), while the conjunction has only one - (true, true).
This definition in terms of the ratio of true rows in the "atomicized"
truth-table is equivalent to the assumption that all atomic propositions have
probability 1/2 and that they are all probabilistically independent.
Wittgenstein did not d

Sorry if this is a spoiler for your next post, but I take issue with the heading "Standard measures of information theory do not work" and the implication that this post contains the pre-Crutchfield state of the art.

The standard approach to this in information theory (which underlies the loss function of autoregressive LMs) isn't to try to match the Shannon entropy of the marginal distribution of bits (a 50-50 distribution in your post), it's to treat the generative model as a distribution for each bit *conditional on the previous bits* and use the *cross-ent*...

45mo

Yeah follow-up posts will definitely get into that!
To be clear: (1) the initial posts won't be about Crutchfield work yet - just
introducing some background material and overarching philosophy (2) The claim
isn't that standard measures of information theory are bad. To the contrary! If
anything we hope these posts will be somewhat of an ode to information theory as
a tool for interpretability.
Adam wanted to add a lot of academic caveats - I was adamant that we streamline
the presentation to make it short and snappy for a general audience but it
appears I might have overshot ! I will make an edit to clarify. Thank you!
I agree with you about the importance of Kolmogorov complexity philosophically
and would love to read a follow-up post on your thoughts about Kolmogorov
complexity and LLM interpretability:)

A couple of differences between Kolmogorov complexity/Shannon entropy and the loss function of autoregressive LMs (just to highlight them, not trying to say anything you don't already know):

- The former are (approximately) symmetric, the latter isn't (it can be much harder to predict a string front-to-back than back-to-front)
- The former calculate compression values as properties of a string (up to choice of UTM). The latter calculates compression values as properties of a string, a data distribution, and a model (and even then doesn't strictly determine the r

I agree with all the claims in this comment and I rather like your naming suggestions! Especially the "P-entropy of Q = Q-complexity of P" trick which seems to handle many use cases nicely.

(So the word "entropy" wasn't really my crux? Maybe not!)

55mo

Convergence! 🎉🎉. Thanks for the discussion; I think we wound up somewhere
cooler than I would have gotten to on my own.
Now we just need to convince Alex :-p. Alex, are you still reading? I'm curious
for your takes on our glorious proposal.

I wanted to let that comment be about the interesting question of how we unify these various things.

But on the ongoing topic of "why not call all this entropy, if it's all clearly part of the same pattern?":

When the definition of some F(x) refers to x twice, it's often useful to replace one of them with y and call that G(x, y). But it's usually not good for communication to choose a name for G(x, y) that (almost) everyone else uses exclusively for F(x),* *especially if you aren't going to mention both x and y every time you use it*, *and doubly especially if G...

55mo

:D
(strong upvote for delicious technical content)
(also fyi, the markdown syntax for footnotes is like blah blah blah[^1] and then
somewhere else in the file, on a newline, [^1]: content of the footnote)
This updates me a fair bit towards the view that we should keep entropy and
cross-entropy separate.
The remaining crux for me is whether the info theory folks are already using
"entropy" to mean "the minimum expected surprise you can achieve by choosing a
code from this here space of preferred codes" (as per your "wiggle room" above),
in which case my inclination is to instead cackle madly while racing my truck
through their wiggle-room, b/c that's clearly a cross-entropy.
At a glance through wikipedia, it doesn't look like they're doing that, though,
so I retreat.
But, hmm, I'm not sure I retreat all the way to having separate words.
I'm persuaded that "the square of x" should absolutely not mean "x * y [where y
is implied from context]", and I no longer think that "entropy" should mean the
cross-entropy with Q implied from context. (Thanks for the distillation of why
it's mad!)
But, like, if geometrists instead invented a two-place operation rect, with a
general form rect(x, y) := x * y, and declared that rect(x) was shorthand for
rect(x, x), then I would not protest; this seems like a reasonable an
unambiguous way to reduce the number of names floating around.[1]
And this seems to me like exactly what the information theorists (by contrast to
the physicists) are doing with H (by contrast to S)!
Like, the notations H(P) and H(P,Q) are just begging for us to pronounce H the
same way in each case; we're not tempted to pronounce the former as "eych" and
the latter as "cross-eych". And no ambiguity arises, so long as we use the rule
that if Q is left out then we mean P.
Thus, I propose the following aggressive nomenclature:
* the "P-entropy of Q (wrt μ)" aka Hμ(P,Q) is the general form
* the "entropy of P (wrt μ)" aka Hμ(P) is a shorthand for "the P-ent

From my perspective, the obvious rejoinder to "entropy is already two-place" is "insofar as entropy is two-place, cross-entropy is three-place!".

I think this is roughly where I'm at now.

After thinking a bit and peeking at Wikipedia, the situation seems to be:

The differential entropy of a probability density p is usually defined as

This is unfortunate, because it isn't invariant under coordinate transformations on x. A more principled (e.g. invariant) thing to write down, courtesy of Jaynes, is

where is ...

35mo

I wanted to let that comment be about the interesting question of how we unify
these various things.
But on the ongoing topic of "why not call all this entropy, if it's all clearly
part of the same pattern?":
When the definition of some F(x) refers to x twice, it's often useful to replace
one of them with y and call that G(x, y). But it's usually not good for
communication to choose a name for G(x, y) that (almost) everyone else uses
exclusively for F(x), especially if you aren't going to mention both x and y
every time you use it, and doubly especially if G is already popular enough to
have lots of names of its own (you might hate those names, but get your own[1]).
e.g.: x*y is not "the square of x and y" much less "the square of x [and y is
implied from context]", and the dot product v.w is not "the norm-squared of v
and w" etc.
[1] might I suggest "xentropy"?

My argument above is ofc tuned to case (2), and it's plausible to me that it pushes you off the fence towards "no wiggle room".

Yup, I think I am happy to abandon the wiggle room at this point, for this reason.

if the statespace is uncountably infinite then we need a measure in order to talk about entropy (and make everything work out nicely under change-of-variables). And so in the general case, entropy is already a two-place predicate involving a distribution and some sort of measure.

~~I think my preferred approach to this is that the density p(x) is ~~...

35mo

:D
(strong upvote for shifting position in realtime, even if by small degrees and
towards the opposite side of the fence from me :-p)
(I'm not actually familiar enough w/ statmech to know what measure we use on
phase-space, and I'm interested in links that explain what it is, and how it
falls out of QM, if you have any handy :D)
I don't currently have much sympathy for "entropy and differential entropy are
different" view, b/c I occasionally have use for non-uniform measures even in
the finite case.
Like, maybe I'm working with distributions over 10 states, and I have certain
constraints I'm trying to hit, and subject to those constraints I wanna maximize
entropy wrt the Binomial distribution as the measure. And you might be like
"Well, stop doing that, and start working with distributions on 2^10 states, and
convert all your constraints to refer to the bitcount of the new state (instead
of the value of the old state). Now you can meet your constraints while
maximizing entropy wrt the uniform measure like a normal person." To which I'd
reply (somewhat trollishly) "that rejoinder doesn't work in the limit, so it
must get weaker and weaker as I work with more and more states, and at numbers
as large as 10 this rejoinder is so weak as to not compel me."
From my perspective, the obvious rejoinder to "entropy is already two-place" is
"insofar as entropy is two-place, cross-entropy is three-place!". Which, ftr, I
might find compelling. It depends whether differential cross-entropy needs all
three parameters (P, Q, μ), or whether we can combine two of them (like by using
(P, μP/Q) or something). Or, at least, that's what my intuition says
off-the-cuff; I'm running on guesswork and a stale cache here :-p.

45mo

I'd be interested in a citation of what you're referring to here!

I wonder if it would be reasonable to use "xentropy" for the broad sense of "entropy" in OP, with the understanding that xentropy is always a two-argument function.

"The length of a codeword is the xentropy between [the delta distribution located at] the state and [the coinflip distribution implied by] the code"

(One of my hypotheses for what you're saying is "when the distribution and the code are both clear from context, we can shorten 'cross-entropy' to 'entropy'. Which, ftr, seems reasonable to me.)

I want something *much* more demanding -- I want the distribution and code to be "the same" (related by p = 2^-len), or something "as close as possible" to that.

I was leaving a little bit of wiggle room to possibly include "a code matches a distribution if it is the optimal code of its type for compression under that source distribution", but this is only supposed ...

25mo

Cool cool. I can personally see the appeal of reserving 'entropy' for the case
where the distribution and the (natural lifting of) the code (to a distribution)
are identical, i.e. your proposal without the wiggle-room.
I don't yet personally see a boundary between the wiggle-room you're considering
and full-on "we can say 'entropy' as a shorthand for 'cross-entropy' when the
second distribution is clear from context" proposal.
In particular, I currently suspect that there's enough wiggle-room in "optimal
code of its type for compression under the source distribution" to drive a truck
through. Like, if we start out with a uniform code C and a state s, why not say
that the "type of codes" for the source distribution δ(s) is the powerset of {c
∈ C | c codes for s}? In which case the "optimal code for compression" is the
set of all such c, and the 'entropy' is the Nate!complexity?
I'm not yet sure whether our different aesthetics here are due to:
1. me failing to see a natural boundary that you're pointing to
2. you not having yet seen how slippery the slope is
3. you having a higher tolerance for saying "humans sometimes just wanna put
fences halfway down the slippery slope, dude".
Insofar as you think I'm making the mistake of (1), I'm interested to hear
arguments. My argument above is ofc tuned to case (2), and it's plausible to me
that it pushes you off the fence towards "no wiggle room".
Another place we might asethetically differ is that I'm much happier blurring
the line between entropy and cross-entropy.
One handwavy argument for blurring the line (which has the epistemic status:
regurgitating from a related cache that doesn't cleanly apply) is that if the
statespace is uncountably infinite then we need a measure in order to talk about
entropy (and make everything work out nicely under change-of-variables). And so
in the general case, entropy is already a two-place predicate function involving
a distribution and some sort of measure. (...Although m

35mo

I wonder if it would be reasonable to use "xentropy" for the broad sense of
"entropy" in OP, with the understanding that xentropy is always a two-argument
function.
"The length of a codeword is the xentropy between [the delta distribution
located at] the state and [the coinflip distribution implied by] the code"

I still don't like that, because this whole subthread is kind of orthogonal to my concerns about the word "entropy".

This subthread is mostly about resolving the differences between a code (assignment of one or more codewords to one or more states) and a probability distribution. I think we've made progress on that and your latest comment is useful on that front.

But my concerns about "entropy" are of the form: "I notice that there's a whole field of coding theory where 'entropy' means a particular function of a probability distribution, rather than a functi...

25mo

Cool thanks. I'm hearing you as saying "I want to reserve 'entropy' for the case
where we're weighting the length-like thingies by probability-like thingies",
which seems reasonable to me.
I'm not sure I follow the part about matched (distribution, code) pairs. To
check my understanding: for a sufficiently forgiving notion of "matching", this
is basically going to yield the cross-entropy, right? Where, IIUC, we've lifted
the code to a distribution in some natural way (essentially using a uniform
distribution, though there might be a translation step like translating
prefix-free codes to sets of infinite bitstrings), and then once we have two
distributions we take the cross-entropy.
(One of my hypotheses for what you're saying is "when the distribution and the
code are both clear from context, we can shorten 'cross-entropy' to 'entropy'.
Which, ftr, seems reasonable to me.)
My own proclivities would say: if I specify only a state and a code, then the
state lifts to a distribution by Kronecker's delta and the code lifts to a
distribution uniformly, and I arbitrarily declare these to 'match', and so when
we speak of the (cross-)entropy of a state given a code we mean the length of
the code(s) for that particular state (combined by ləg-sum-əxp if there's
multiple).
This seems like the natural way to 'match' a state and a code, to my eye. But I
acknowledge that what counts as 'matching' is a matter of intuition and
convention, and that others' may differ from mine.
At this point, though, the outcome I'm most invested in is emerging with a short
name for "the ləg-sum-əxp of the lengths of all the descriptions". I'm fine with
naming it some variation on 'complexity', though. (Komolgorov kindly left a K in
K-complexity, so there's ample room to pick another letter if we have to.)
(Though to be very explicit about my personal preferences, I'd use "entropy". It
seems to me that once we've conceded that we can talk about the entropy of a
(distribution, code) pair then we

Sure, from one perspective what's going on here is that we're being given a distribution p and asked to come up with a distribution q such that

CrossEntropy(p, q) = E_p[-log q]

is as small as possible. And then a bit of calculus shows that q=p is optimal, with a minimal value of

Entropy(p) = CrossEntropy(p, p)

If we're happy to call -log q "description length" right off the bat, we can let q be a distribution over the set of infinite bit strings, or the set of finite simple graphs, or over any (infinite) set we like.

But some settings are special, such as "q ha...

25mo

What's your take on using "description length" for the length of a single
description of a state, and "entropy" for the log-sum-exp of the
description-lengths of all names for the state? (Or, well, ləg-sum-əxp, if you
wanna avoid a buncha negations.)
I like it in part because the ləg-sum-əxp of all description-lengths seems to me
like a better concept than K-complexity anyway. (They'll often be similar, b/c
ləg-sum-əxp is kinda softminish and the gap between description-lengths is often
long, but when they differ it's the ləg-sum-əxp'd thing that you usually want.)
For example, Solomonoff induction does not have the highest share of
probability-mass on the lowest K-complexity hypothesis among those consistent
with the data. It has the highest share of probability-mass on the hypothesis
with lowest ləg-sum-əxp of all description-lengths among those consistent with
the data.
This can matter sometimes. For instance, in physics we can't always fix the
gauge. Which means that any particular full description of physics needs to
choose a full-fledged gauge, which spends an enormous amount of
description-length. But this doesn't count against physics, b/c for every
possible gauge we could describe, there's (same-length) ways of filling out the
rest of the program such that it gives the right predictions. In the version of
Solomonoff induction where hypotheses are deterministic programs, physics does
not correspond to a short program, it corresponsd to an enormous number of long
programs. With the number so enormous that the ləg-sum-əxp of all those big
lengths is small.
More generally, this is related to the way that symmetry makes things simpler.
If your code has a symmetry in it, that doesn't make your program any shorter,
but it does make the function/hypothesis your program represents simpler, not in
terms of K-complexity but in terms of "entropy" (b/c, if S is the symmetry
group, then there's |S|-many programs of the ~same length that represent it,
which decreases

I initially interpreted "abstract entropy" as meaning statistical entropy as opposed to thermodynamic or stat-mech or information-theoretic entropy. I think very few people encounter the phrase "algorithmic entropy" enough for it to be salient to them, so most confusion about entropy in different domains is about statistical entropy in physics and info theory. (Maybe this is different for LW readers!)

This was reinforced by the introduction because I took the mentions of file compression and assigning binary strings to states to be about (Shannon-style) cod...

I like "description length".

One wrinkle is that entropy isn't *quite* minimum average description length -- in general it's a lower bound on average description length.

If you have a probability distribution that's (2/3, 1/3) over two things, but you assign fixed binary strings to each of the two, then you can't do better than 1 bit of average description length, but the entropy of the distribution is 0.92 bits.

Or if your distribution is roughly (.1135, .1135, .7729) over three things, then you can't do better than 1.23 bits, but the entropy is 1 bit.

You can ...

25mo

I think you can bring the two notions into harmony by allowing multiple codes
per state (with the entropy/description-length of a state being the lug/nl of
the fraction of the codespace that codes for that state).
For instance, you can think of a prefix-free code as a particularly well-behaved
many-to-one assignment of infinite bitstrings to states, with (e.g.) the
prefix-free code "0" corresponding to every infinite bitstring that starts with
0 (which is half of all infinite bitstrings, under the uniform measure).
If we consider all many-to-one assignments of infinite bitstrings to states
(rather than just the special case of prefix-free codes) then there'll always be
an encoding that matches the entropy, without needing to say stuff like "well
our description-length can get closer to the theoretical lower-bound as we
imagine sending more and more blocks of independent data and taking the average
per-block length".
(If you want to keep the codespace finite, we can also see the entropy as the
limit of how well we can do as we allow the codespace to increase in size.)
(I suspect that I can also often (always?) match the entropy if you let me
design custom codespaces, where I can say stuff like "first we have a bit, and
then depending on whether it's 0 or 1, we follow it up by either a trit or a
quadit".)
(epistemic status: running off of a cache that doesn't apply cleanly, but it
smells right \shrug)

K-complexity is apparently sometimes called "algorithmic entropy" (but not just "entropy", I don't think?)

Wiktionary quotes Niels Henrik Gregersen:

Algorithmic entropyis closely related to statistically defined entropy, the statistical entropy of an ensemble being, for any concisely describable ensemble, very nearly equal to the ensemble average of the algorithmic entropy of its members

I think this might be the crux!

Note the weird type mismatch: "the statistical entropy **of an ensemble** [...] the ensemble **average** of the algorithmic entropy **of its members**".

So...

25mo

Indeed I definitely do.
There are a bunch of places where I think I flagged relevant things, and I'm
curious if these seem like enough to you;
* The whole post is called "abstract entropy", which should tell you that it's
at least a little different from any "standard" form of entropy
* The third example, "It helps us understand strategies for (and limits on)
file compression", is implicitly about K-complexity
* This whole paragraph: "Many people reading this will have some previous facts
about entropy stored in their minds, and this can sometimes be disorienting
when it's not yet clear how those facts are consistent with what I'm
describing. You're welcome to skip ahead to the relevant parts and see if
they're re-orienting; otherwise, if you can get through the whole
explanation, I hope that it will eventually be addressed!"
* Me being clear that I'm not a domain expert
* Footnote [4], which talks about Turing machines and links to my post on
Solomonoff induction
* Me going on and on about binary strings and how we're associating these with
individual state -- I dunno, to me this just screams K-complexity to anyone
who's heard of it
* "I just defined entropy as a property of specific states, but in many
contexts you don't care at all about specific states..."
* ... "I'll talk about this in a future post; I think that "order" is
synonymous with Kolmogorov complexity." ...
I struggled with writing the intro section of this post because it felt like
there were half a dozen disclaimer-type things that I wanted to get out of the
way first. But each one is only relevant to a subset of people, and eventually I
need to get to the content. I'm not even expecting most readers to be holding
any such type-1/type-2 distinction in their mind to start, so I'd have to go out
of my way to explain it before giving the disclaimer.
All that aside, I am very open to the idea that we should be calling the
single-state thing something diff

Beef is far from the only meat or dairy food consumed by Americans.