I think your main counterpoint to what I said is that people are doing an optimization process where they look at the data while simultaneously doing a search for a better theory. In fact, you cannot even disentangle their brain from the reality that created and runs it, so even a best attempt at theory first, observation second is doomed to fail.
I think the second, stronger sentence is mostly wrong. You do not need a universe similar enough to our universe to produce reasoning similar to ours, just one that can produce similar reasoning and has an incentive to. That incentive can be as little as, "I wonder what physics looks like in 3+1 dimensions?" just like our physicists wonder what it looks like in more or less dimensions, with different fundamental constants, with different laws of motion, with positive spacetime curvature, and so on. Or, we can just shove a bunch of data from our universe into theirs, and reward them for figuring it out (i.e. training LLMs).
As for the first, weaker sentence, yes this is true. Pretty much everyone has tight feedback loops, probably because the search space is too large to first categorize its entirety and then match the single branch you end up observing. I think the role of observation here is closer to moving attention to certain areas of the search space, rather than moving the search tree forward (see Richard Ngo's shortform on chess). The thing is, this process is unnecessary for simple things. You probably learned to solve TicTacToe by playing a bunch of games, but you could have just solved it. I think the concept of trees are relatively simple, though of course if you want a refined concept like its protein composition or DNA sequencing, yeah that space is too big and you probably have to just go out and observe it.
I don't really understand your point about unsupervised learning. With unsupervised learning, you can just run a bunch of data through your model until it learns something. That's the observation -> theory pipeline and it's astoundingly inefficient and bad at generalization. Humans could do the same with 100x fewer examples, which is the gap models need to clear to solve ARC-AGI. Humans are probably doing something closer to theory -> observation.
Presumably prepend or append "is wrong."
I think you're missing something here. Science, and more specifically physics, is built on first theorizing or philosophizing, coming up with a lot of potential worlds a priori, and only looking to see which one you probably fall in after the philosophizing is done. How do you know a tree exists? Well, I bet a good philosopher from a different universe could come up with the concept of trees only knowing we live in 3+1 dimensions:
Looking isn't needed to know trees really exist, just to know that tree over there really exists. That involves a bit of cyclical reasoning, which is basically because you cannot prove a system consistent within the system. The best you can do is check that the current statement isn't inconsistent, such as if someone says, "how do you know that tree over there really exists," while pointing at empty air, and you say, "it doesn't."
I think my issue with empiricism is that it does not generalize, at all. A better way to compress sense data is to first come up with several theories, then use a few bits to point out which theory is correct and how it is being applied.
The mental frame I've found myself getting into when interacting with conflict practitioners is treating them not as defecting against me, but against everyone. Even if I agree with someone's position, I will give a pretty scathing reply or employ the dark arts even harder to make them look stupid. Why? Well, in adversarial games, there is nothing to gain from honesty, so the only reason they are using words at all is to manipulate people. The only way to have an honest discussion on the topic is to get them to stop playing adversarial games around it. If I guess they're just too stupid to realize what they're doing (as in, they sound like they're just repeating memes instead of carefully picking their words), I think it's a better idea to break them out of the adversarial game by forcing them to actually think (by being very precise in pointing out the issue with their rhetoric). But if they know what they're doing? Better to take them out of the game.
My guess is most people in America's political climate are conflict practitioners. I do not know if this was the case 50 years ago, but I suspect it was not so, and it is merely an issue with democracy where infecting the populace with adversarial memes is a great way to get elected so after time (and especially the internet) the populace is more and more disease-ridden. If everyone were smarter, they would be more immune to banal manipulation, but I think the memes would also just be more adversarially selected.
Why do some people, who lack the ability to write this well, assume no one else has that ability and thus it must have been written by AI?
I don't have any more thoughts on this at present, and I probably won't think too much on it in the future, as it isn't super interesting to me.
This is mostly correct, though I think there are phase changes making some more natural than others.
Haha, I did initially start with trying to be more explanatory but that ended after a few sentences. Where I think this could immediately improve a lot of models is by replacing the VAEs everyone is using in diffusion models with information bottleneck autoencoders. In short: VAEs are viruses. In long: VAEs got popular because they work decently well, but they are not theoretically correct. Their paper gestures at a theoretical justification, but it settles for less than is optimal. They do work better than vanilla autoencoders, because they "splat out" encodings which lets you interpolate between datapoints smoothly, and this is why everyone uses them today. If you ask most people using them, they will tell you it's "industry standard" and "the right way to do things, because it is industry standard." An information bottleneck autoencoder also ends up "splatting out" encodings, but has the correct theoretical backing. My expectations are you will automatically get things like finer details and better instruction following ("the table is on the apple"), because bottleneck encoders have more pressure to conserve encoding bits for such details.
There are probably a few other places this would be useful—for example, in LLM autoregression, you should try to minimize the mutual information between the embeddings and the previous tokens—but I have yet to do any experiments in other places. This is because estimating the mutual information is hard and makes training more fragile.
In terms of just philosophy, well I don't particularly care for just the subject of philosophy. Philosophers too often assign muddy meanings to words and wonder why they're confused ten propositions in. My goal when interacting with such sophistry is usually to define the words and figure out what that entails. I think philosophers just do not have the mathematical training to put into words what they mean, and even with that training it's hard to do and will often be wrong. For example, I do not think the information bottleneck is a proper definition of "ontology" but is closer to "describing an ontology". It does not say why something is the way it is, but it helps you figure out what it is. It's a way to find natural ontologies, but does not say anything about how they came to be.
A couple of terms that seem relevant:
Holonomy—a closed loop which changes things that move around it. I think consciousness is generally built out of a bunch of holonomies which take sense data, change the information in the sense data while remaining unchanged itself, and shuttle it off for more processing. In a sense, genes and memes are holonomies operating at a bigger scale, reproducing across groups of humans rather than merely groups of neurons.
Russell's vicious circle principle—self-reference invariably leads to logical contradictions. This means an awareness of self-referential consciousness (or phenomenological experience) cannot be a perfect awareness, you must be compressing some of the 'self' when you refer to yourself.
Fewer rows might not give interpretable/rules-based solutions an advantage. I tried training on only the first 100 or 20 rows, and I got
CDEFMW (15.66)andEMOPSV (15.34)as the predicted best meals. AdmittedlyCDEFMWshows up in the first 100 rows scoring 18 points, but notEMOPSV. Maybe a human with 20 rows could do better by coming up with a lot of hypothetical rules, but it seems tough to beat the black box.