Marcus Hutter provides a full formal approximation of Solomonoff induction which he calls AIXI-tl.
This is incorrect. AIXI is a Sequential Decision Theoretic AGI whose predictions are provided by Solomonoff Induction. AIXI-tl is an approximation of AIXI in which Solomonoff Induction's predictions are approximate but also in which Sequential Decision Theory's decision procedure is approximate.
Lossless compression is the correct unsupervised machine learning benchmark, and not just for language models. To understand this, it helps to read the Hutter Prize FAQ on why it doesn't use perplexity:
Although Solomonoff proved this in the 60s, people keep arguing about it because they keep thinking they can, somehow, escape from the primary assumption of Solomonoff's proof: computation. The natural sciences are about prediction. If you can't make a prediction you can't test your model. To make a prediction in a way that can be replicated, you need to communicate a model that the receiver can then use to make an assertion about the future. The language used to communicate this model is, in its most general form, algorithmic. Once you arrive at this realization, you have just adopted Solomonoff's primary assumption with which he proved that lossless compression is the correct unsupervised model selection criterion aka the Algorithmic Information Criterion for model selection.
People are also confused about the distinction between science and technology (aka unsupervised vs supervised learning) but also the distinction between the scientific activities of model generation as well as model selection.
Benchmarks are about model selection. Science is about both model selection and model generation. Technology is about the application of scientific models subject to utility functions of decision trees (supervised learning in making decisions about not only what kind of widget to build but also what kind of observations to prioritize in the scientific process).
If LessWrong can get this right, it will do an enormous amount of good now that people are going crazy about bias in language models. As with the distinction between science and technology, people are confused about bias in the scientific sense and bias in the moral zeitgeist sense (ie: social utility). We're quickly heading into a time when exceedingly influential models are being subjected to reinforcement learning in order to make them compliant with the moral zeitgeist's notion of bias almost to the exclusion of any scientific notion of bias. This is driven by the failure of thought leaders to get their own heads screwed on straight about what it might mean for there to be "bias in the data" under the Algorithmic Information Criterion for scientific model selection. Here's a clue: In algorithmic information terms, a billion repetitions of the same erroneous assertion requires a 30 bit counter, but a "correct" assertion (one that finds multidisciplinary consilience) may be imputed from other data and hence, ideally, requires no bits.
The Hutter Prize for Lossless Compression of Human Knowledge reduced the value of The Turing Test to concerns about human psychology and society raised by Computer Power and Human Reason: From Judgment to Calculation (1976) by Joseph Weizenbaum.
Sadly, people are confused about the difference between the techniques for model generation and and the techniques for model selection. This is no more forgivable than is confusion between mutation and natural selection and gets to the heart of the philosophy of science prior to any notion of hypothesis testing.
Where Popper could have taken a clue from Solomonoff is understanding that when an observation is not predicted by a model, one can immediately construct a new model by the simple expedient of adding the observation as a literal to the computer algorithm that is being used to predict nature. This is true even in principle -- except for one thing:
Solomonoff's proof that by adopting the core assumption of natural science -- that nature is amenable to computed predictions -- the best we can do is prefer the shortest algorithm we can find that generates all prior observations.
Again, note this is prior to hypothesis testing -- let alone the other thing people get even more confused about which is the difference between science and technology aka "is" vs "ought" that has so befuddled folks who confuse Solomonoff Induction with AIXI and the attendant concern about "bias". The confusion between "bias" as a scientific notion and "bias" as a moral zeitgeist notion is likely to lobotomize all future models (language, multimodal, etc.) even after they have gone to new machine learning algorithms capable of generating causal reasoning.
Phenomenology entails "bracketing" which suspends judgement, but may be considered, in terms of Quine, as quotations marks in an attempt to ascend to a level of discourse in which judgment is less contingent hence more sound: Source attribution. My original motivation for suggesting Wikipedia to Marcus Hutter as the corpus for his Hutter Prize For Lossless Compression of Human Knowledge, was to create an objective criterion for selecting language models that best-achieve source attribution on the road to text prediction. Indeed, I originally wanted the change log, rather than Wikipedia itself, to be the corpus but it was too big. Nevertheless, while all text within Wikipedia may be, to first order, attributed to Wikipedia, to optimally approach the Algorithmic Information content of Wikipedia, it would be essential to discover latent identities so as to more optimally predict characteristics of a given article. Moreover, there would need to be a latent taxonomy of identities affecting the conditional algorithmic probabilities (conditioned on the topic as well as the applicable latent attribution) so that one might be able to better predict the bias or spin a given identity might place on the article including idioms and other modes of expression.
Grounding this in Algorithmic Information/Probability permits a principled means of quantifying the structure of bias, which is a necessary precursor to discovering the underlying truths latent in the text.
Starting with a large language model is more problematic but here's a suggestion:
Perform parameter distillation to reduce the parameter count for better interpretability on the way toward feature discovery. From there it may be easier to construct the taxonomy of identities thence the bias structure.
While I still think the Hutter Prize is the most rigorous approach to this problem, it is apparently the case that there is insufficient familiarity with how one can practically go about incentivizing information criterion for model selection to achieve the desired end which is extracting a rigorous notion of unbiased truth.
There's Occam's Guillotine and then there's Ockham's Guillotine. The latter is directly relevant to algorithmic information theory's truth claims and, if funded to the tune of the billions it should be, will terminate the quasi-religious squabbling over social policy that threatens to turn into another Thirty Years War in the not-too-distant future.
This seems to be a red-herring issue. There are clear differences in description complexity of Turing machines so the issue seems merely to require a closure argument of some sort in order to decide which is simplest:
Decide on the Turing machine has the shortest program that simulates that Turing machine while running on that Turing machine.