This is mostly correct, though I think there are phase changes making some more natural than others.
Haha, I did initially start with trying to be more explanatory but that ended after a few sentences. Where I think this could immediately improve a lot of models is by replacing the VAEs everyone is using in diffusion models with information bottleneck autoencoders. In short: VAEs are viruses. In long: VAEs got popular because they work decently well, but they are not theoretically correct. Their paper gestures at a theoretical justification, but it settles for less than is optimal. They do work better than vanilla autoencoders, because they "splat out" encodings which lets you interpolate between datapoints smoothly, and this is why everyone uses them today. If you ask most people using them, they will tell you it's "industry standard" and "the right way to do things, because it is industry standard." An information bottleneck autoencoder also ends up "splatting out" encodings, but has the correct theoretical backing. My expectations are you will automatically get things like finer details and better instruction following ("the table is on the apple"), because bottleneck encoders have more pressure to conserve encoding bits for such details.
There are probably a few other places this would be useful—for example, in LLM autoregression, you should try to minimize the mutual information between the embeddings and the previous tokens—but I have yet to do any experiments in other places. This is because estimating the mutual information is hard and makes training more fragile.
In terms of just philosophy, well I don't particularly care for just the subject of philosophy. Philosophers too often assign muddy meanings to words and wonder why they're confused ten propositions in. My goal when interacting with such sophistry is usually to define the words and figure out what that entails. I think philosophers just do not have the mathematical training to put into words what they mean, and even with that training it's hard to do and will often be wrong. For example, I do not think the information bottleneck is a proper definition of "ontology" but is closer to "describing an ontology". It does not say why something is the way it is, but it helps you figure out what it is. It's a way to find natural ontologies, but does not say anything about how they came to be.
A couple of terms that seem relevant:
Holonomy—a closed loop which changes things that move around it. I think consciousness is generally built out of a bunch of holonomies which take sense data, change the information in the sense data while remaining unchanged itself, and shuttle it off for more processing. In a sense, genes and memes are holonomies operating at a bigger scale, reproducing across groups of humans rather than merely groups of neurons.
Russell's vicious circle principle—self-reference invariably leads to logical contradictions. This means an awareness of self-referential consciousness (or phenomenological experience) cannot be a perfect awareness, you must be compressing some of the 'self' when you refer to yourself.
The tricky part when parsing John's post is understanding what he means by "insensitive functions." He doesn't define it anywhere, and I think it's because he was pointing at an idea but didn't yet have a good definition for it. However, the example he gives—conservation of energy—occurs because the laws of physics are insensitive to some kind of symmetry, in this particular case time-translation. I've been thinking a lot about the relationship symmetries + physics + information theory this past year or two, and you can see some of my progress here and here. To me, it felt kind of natural to jump to "insensitive functions" being a sort of stochastic symmetry in the data.
I haven't fleshed out exactly what that means. For exact symmetries, we can break up the data into a symmetry-invariant piece and the symmetry factor
However, it feels like in real data, there is not such a clean separation. It's closer to something like this: we could write "big-edian" form, so that we get finer and finer details about as we read off more bits. My guess is there is an "elbow" in the importance of the bits, similar to how Chebyshev series have this elbow in coefficient magnitude that chebfun identifies and chops off for quicker calculations:
(Source: Chopping a Chebyshev Series)
In fact, as I'm writing this, I just realized that your autoencoder model could just be a discrete cosine (Chebyshev series) transform. It won't be the best autoencoder to exist, but it is what JPEG uses. Anyway, I think the "arm"—or the bits to the left of the elbow—seems to form a natural ontology. The bits to the left of it seem to be doing something to help describe , but not the bits to the right.
How does this relate to symmetries? Well, an exact symmetry is cleanly separable, which means the bits could be added after every other bit—it's far to the right of the elbow. Chopping off the elbow does satisfy our idea of "ontology" in the exact symmetry case. Then all we need to do is create a model that chops off those uninteresting bits. The parameter in the information bottleneck pretty much specifies a chopping point. The first term, , says to keep important bits, while the second term, says to cut out unimportant bits, and specifies at what point bits become too unimportant to leave in. You can slowly increase until things start catastrophically failing (e.g. validation loss goes down), at which point you've probably identified the elbow.
I am sure you are already aware of this, but the conserved quantities we see come from symmetries in the function space (Noether's theorem). The question I think you should be asking is, how do we extend this to random variables in information theory?
I am not sure of The Answer™, but I have an answer and I believe it is The Answer™: with the information bottleneck. Suppose we have a map from some bits in the world to some properties we care about. In my head, I'm using the example of and . If there is a symmetry in that is invariant to, this means we should lose no predictive information by transforming a sample according to that symmetry
If we already know the symmetry we can force our predictor to be invariant to it. This is most commonly seen in chemistry models where they choose graph neural networks that are invariant to vertex and edge relabeling. However, usually it is hard to specify the exact symmetries (how do you code a "seven" is symmetric to stretching out its leg a little?), and the symmetries may also not be exact (a "seven" can bleed into a "one" by shortening its arm). The solution is to first run through an autoencoder model that automatically finds these exact symmetries, and the inexact ones up to whatever bit precision you care about.
If we replace the exact by the autoencoder , then
means we want to maximize the mutual information between (the property we care about, like "sevenness") and . Also,
meaning the more symmetries captures, the smaller its entropy should be. Since probably has some fixed entropy, this is equivalent to minimizing the mutual information between and . Together, we have a tradeoff between maximizing and minimizing , which is just the information bottleneck:
The larger the the more "stochastic symmetries" are eliminated, which means it gets closer to the essence of "sevenness" or whatever properties are in , but further from saying anything else about . The fun thing you can do is make and the same entity, and now you are getting the essence of unsupervised (e.g., with MNIST, though it does have reconstruction loss too).
Finally, a little evidence that seems to align with autoencoders being the solution comes from adversarial robustness. For a decade or so, it was believed that generalization and adversarial robustness are counter to one another. This seems a little ridiculous to me now, but then again I was not old enough to be aware of the problem before the myth was dispelled (this myth has been dispelled, right? People today know that generalization and robustness are essentially the same problem, right?). Anyway, the way everyone was training for adversarial robustness is they took the training images, perturbed them as little as possible to get the model to mispredict (adversarial backprop), and then trained on these new images. This ended up making the generalization error worse ("Robustness May Be at Odds with Accuracy"). It turns out if you just first autoencode the images or use a GAN to keep the perturbations on-manifold, then it generalizes better ("Disentangling Adversarial Robustness and Generalization "). Almost like it captured the ontology better.
I would generalize my argument to: charisma is an adversarial game. If you are more charismatic, it does not mean you actually know what you are talking about, or are actually more skilled, but because you make people feel good about themselves, they will still choose you over someone who could actually help them out.
Part of this is an intelligence issue: if everyone had much more INT than CHA, people would easily notice and dismiss charismatic influences. In real life, too high-INT and low-CHA groups will get invaded (arbitraged?) by higher-CHA individuals. On forums, the dynamic plays out where people with more charisma get more upvotes, even if they're actively making the discussion worse. So, we see dynamics like this:
There are about a dozen top-level comments similar to cata's, but despite getting 10x the upvotes as Ben's, they do not actually provide anything useful to the discussion. Basically, all of these comments say something like, "because you did not get the outcome I expected, you must have done the procedure wrong." They do not justify why their expected outcome is the right one, and if they had a modicum of respect for John's intelligence they would not believe he had done the procedure wrong. The only thing arguably useful about these comments is building a consensus around being fake-polite while saying, "John is stupid and wrong." If they just said that, with none of the charisma, their comment would probably be as visible as it actually deserves to be.
I think, when someone feels negatively toward a post, that choosing to translate that feeling as "I think this conclusion requires a more delicate analysis" reflects more epistemic humility and willingness to cooperate than does translating it as "your analysis sucks".
I think the epistemically humble thing to do is say, "this seems wrong because <...> though I notice I'm not super confident." Or, if you don't know why it seems wrong, just say, "this seems to go against my intuitions. There's probably a reason I (and most people) have these intuitions, so a priori I want to say you're wrong, but I can only vaguely point out that something seems wrong."
To be epistemically humble myself: your comment seems generally correct when you're talking as individuals, and I think the issue only really comes about when you have lots of people interacting so that it's too costly for them all to analyze every comment they read/upvote/reply to. Also, although I implied intention when I said things like, "I think your version of 'having manners' is social deception to get people to like you and hate the person you're replying to," it seems more a function of the system you're interacting in (the system ends up promoting such things) rather than the individuals (except Reddit; there people are just karma farming). It isn't really fair to say this is intentional.
As for the part where it makes you look good, the other person can look equally good simply by being equally polite.
This is just unfortunately not true. They cannot always respond politely while remaining honest. For example, a Mormon might find it impolite if, when they ask why you don't want to join their church, you say, "it's a cult." If you instead say, "it's just not my cup of tea—or more like, I enjoy my cups of tea too much," it would definitely be more polite, but they'll also always wonder why people are willing to give up on eternal happiness for a little caffeine. You might think someone else will surely come along and be a little more rude and honest and help them overcome their confusion, but I think most Mormons below the age of sixteen have invited a friend or two to church and also have no idea most people consider their church a cult.
Predictions:
The best meal I found was AGORTV. In general, ORV + optional P have good synergy, and must be included, and it's also good to sample a few desserts from AGT. These were the top twenty meals:
```
AGORTV
GOPRTV
GOPRV
GORTV
ABOPRV
AGORV
AOPRTV
AGOPRTV
GORV
AOPRV
ADGORTV
ABOPRTV
GMOPTV
GPRTV
ABPRTV
AGOPRV
AHORTV
BGOPTV
DGOPRTV
ADOPRTV
```
Method:
I trained 100 small models with a bit of weight regularization and averaged their predictions together. I think a better approach would be to train on diffs between meals, if anyone wants to try that out.
Okay, but why? I think being faux-polite is social deception because the purpose it serves usually isn't to take a more cooperative approach with the person you're arguing with, but to look nice to other people less invested in the argument who are reading through the comments. I've seen instances where people are genuinely trying to be nice, and I woud agree that that is "having manners". I've just seen much more (esp. on LessWrong) of people sneering while pretending not to sneer, and when they do that to me it's pretty obvious what they're doing and I'm upset at the deception, but when they do it to others I notice it takes me longer to catch, and I'm sure the agree/upvote balance has been skewed by that.
I think a great example of this is many of the comments that reply to some of John Wentsworth's more controversial opinions, like "My Empathy is Rarely Kind".
I think your version of "having manners" is social deception to get people to like you and hate the person you're replying to.
I don't have any more thoughts on this at present, and I probably won't think too much on it in the future, as it isn't super interesting to me.