If an AI model is trained on all human text conversations, or all scientific papers ever written in the test language, using GPT right now, why wouldn't it immediately be more likely to pass the Turing test than any human? As a GPT runs, if it's temperature is set to 0, it will always pick the most likely token that a human would have emitted on average in the given context, to the limits of it's capacity for compression at chinchilla scaling.
What bothers me about this 'log likelihood' metric is that a GPT is going to appear more humanlike on it. Remember, the "judge" has read lots of scientific papers and talked to lots of humans. If their algorithm is "how likely is a human to have emitted the next token", GPTs (even early ones) should always win. Every time. This is because an actual human participant in a test like this isn't an amalgamation of all speakers of the language, or all peers of the judge*, but has "personality traits" and a "unique method of speaking", this is why stylometry is possible. If I am reading it right, the algorithm you propose will deterministically fail the human almost every run.
Right now some of the main ways that you can detect a GPT's text is from tells that are from OAI's RLHF efforts. These cause the model to use certain phrases* more often than a real human, and excessive use of bullet point summaries. There are also currently forbidden words and phrases that act as tells. Another way is to ask a logic question that current architectures seem to have difficulty with, Richard Ngo has some.
The other element that bothers me about this is how do we know if a scientific paper is new/useful? By the words chosen by the AI model? Of course not. The presentation layer (words used in a paper) offer little value.
Most (all?) useful papers involve tool use. Whether it's a CS or math paper using computers or hundreds of pages of arguments using proven building blocks, or a biology or physics paper involving laboratory equipment and robotic manipulation, effective tool use is what is required for AI to contribute to science.
If an AI model were effective at using tools, but were unable to write a plausible scientific paper, only summarize it's findings in easy to understand language with a link to the raw data and procedures in a reproducible format, it would probably be a better scientist than most human scientists.
What am I missing here. Why is this a plausible way to project the date for AGI?
* Footnote : it seems like actually passing the Turing test is mostly a matter of training a model specifically intended to game this test. Freshly train a model mostly on text that the fake biography emitted by humans the model is trying to mimic, RLHF it to sound more human instead of answering questions, and other things like a model to mimic human typing.
it would need that of compute. Assuming that GPT-4 cost 10 million USD to train, this hypothetical AI would cost USD, or 200 years of global GDP2023.
This implies that the first AGI will not be a scaled-up GPT -- autoregressive transformer generatively pretrained on a lightly filtered text dataset. It has to include something else, perhaps multimodal data, high-quality data, better architecture, etc. Even if we were to attempt to merely scale it up, turning earth into a GPT-factory,[6] with even 50% of global GDP devoted,[7] and with 2% growth rate forever, it would still take 110 years,[8] arriving at year 2133. Whole brain emulation would likely take less time
I don't think this is the correct response to thinking you need 9 OOMs more compute. 9 OOMs is a ton, and also, we've been getting an OOM every few years, excluding OOMs that come from $ investment. I think if you believe we would need 9 OOMs more than GPT-4, this is a substantial update away from expecting LLMs to scale to AGI in the next say 8 years, but it's not a strong update against 10-40 year timelines.
I think it's easy to look at large number like 9 OOMs and say things like "but that requires substantially more energy than all of humanity produces each year — no way!" (a thing I have said before in a similar context). But this thinking ignores the dropping cost of compute and strong trends going on now. Building AGI with 2024 hardware might be pretty hard, but we won't be stuck with 2024 hardware for long.
Abstract
This essay explains the Direct Approach proposed by [@barnettScalingTransformativeAutoregressive2023].[1] I encourage you to play with the Direct Approach Interactive Model to explore an interactive simulation using the approach.
From the POV of the judge, a Turing test is a sequential test for two statistical hypotheses -- "is human" and "is machine". Under reasonable assumptions, halving the (reducible part of) log-perplexity loss of the language model would double the time it can survive in a Turing test.
We can think of the peer-review of scientific papers as a Turing test, and say that AGI has arrived when we have AI scientists that can pass the papers peer-review. This allows us to calculate the log-perplexity loss of the first AGI. If we assume it is just a scaled-up GPT, then assuming the Chinchilla scaling law, it would cost about 200 years of global GDP. This makes it virtually certain that the first AGI will not be a scaled-up GPT.
Turing test as statistical hypothesis test
Turing test
In the Turing test, there are three players: one judge and two players. The first player is a human, and the second is a machine. The judge asks each player text questions and receives text answers. The judge must decide who is the human.
We consider a simplified Turing test. In this test, the judge does not ask, and simply receives one stream of text X1:∞. The judge must decide whether the stream is produced by the human or the machine, and do so quickly.
Cast in the language of statistical hypothesis testing, we have two hypotheses:
The judge would read from the stream X1:∞,
o-n-e- -t-o-k-e-n
at a time, and at each token, decide whether to take another one, or announce its judgment: H0 or H1.As the organizers of the Turing test, we would start the test by flipping a fair coin to decide whether to use the human or the machine. Therefore, Pr(H0)=Pr(H1), and by Bayes, the posterior log-probability ratio is
lnPr(X1:n|H0)Pr(X1:n|H1)=lnPr(H0|X1:n)Pr(H1|X1:n)
This allows us to use the sequential probability ratio test (SPRT). The judge would decide on two decision boundaries, and calculate lnPr(X1:n|H0)Pr(X1:n|H1) at each token. It would stop and announce the decision as soon as the quantity exceeds one of the boundaries.
For example, suppose the judge wants to decide when the odds ratio is 10 to 1, then it would set the decision boundaries to be [−ln10,+ln10]. If lnPr(X1:n|H0)Pr(X1:n|H1) goes above +ln10 when n=60, then the judge would announce "H0" at that point.
The ln10 is a good rule of thumb, which we will use for the remainder of the essay.
Sequential hypothesis testing
Consider the following simple equation:
1nEX∼Pr(⋅|H0)[lnPr(X1:n|H0)Pr(X1:n|H1)]1nDKL(Pr(⋅|H0)∥Pr(⋅|H1))=1nEX∼Pr(⋅|H0)[ln1Pr(X1:n|H1)]negative log-likelihood loss per token−1nEX∼Pr(⋅|H0)[1lnPr(X1:n|H0)]entropy rate of the human itself{#eq-sprt}
The first term is the KL-divergence per token between the machine and the human. Roughly speaking, it is how different they are, per token emitted. It is an information-theoretic quantity.
The second term is negative log-likelihood loss per token. This is what language models are trained to minimize. We write it as L.
The third term is the entropy rate of the human. It is how random the human is. We write it as L∞, because it is the theoretical minimal loss that the language model can reach.
If the machine is a perfect replica of the human, then the second term is zero, and the first term equals the third term.
Assuming that the human is an ergodic speakers of English,[2] we can sample an infinite stream X1:∞ from the human, then call up the Shannon--McMillan--Breiman theorem and find that
1nlnPr(X1:n|H0)Pr(X1:n|H1)→L−L∞
On the other hand, if the machine is also an ergodic speaker of English, then we can sample an infinite stream X1:∞ from the machine, then call up the SMB theorem and find that
1nlnPr(X1:n|H1)Pr(X1:n|H0)→L′−L′∞
where unfortunately, we have the odd L′ and L′∞, defined by
L′:=limn1nEX∼Pr(⋅|H1)[ln1Pr(X1:n|H0)],L′∞:=limn1nEX∼Pr(⋅|H1)[ln1Pr(X1:n|H1)]
We can interpret them as the loss of the human at imitating the machine, and the entropy rate of the machine itself. When the machine is close enough to the human, we can take the approximation L′≈L,L′∞≈L∞.
Now, define the log-ratio at step n to be rn:=Pr(X1:n|H0)Pr(X1:n|H1). During a Turing test, the judge calculates
r0=1r1=r0+Pr(X1:1|H0)Pr(X1:1|H1)r2=r1+Pr(X1:2|X1:1,H0)Pr(X1:2|X1:1,H1)⋯
So, imagine that such a perfect judge is going through a Turing test, upon receiving "my cat is technically", and we are listening on its thoughts:
"If it were a human, then it would start with 'my' with probability 0.01. If it were a machine, then 0.05. Therefore, the odds ratio is 2 to 1."
"If it were a human, then it would follow 'my' with 'cat' with probability 0.01. If it were a machine, then 0.033. Therefore, the odds ratio is 3 to 1."
"If it were a human, then it would follow 'is' with 'my cat' with probability... I do not know. However, I do know that the odds ratio is 2 to 1. Now that the total odds ratio is 12 to 1, I can decide: H0."
We see that the judge does not have to know the probabilities Pr(X1:n|H0) and Pr(X1:n|H1), only their ratio. This might be a minor point, but this idea of likelihood ratio is quite important. It is like "I don't know how often you say 'cat' but I know that you say it twice as often than I do!".
Let T∗ be the time it takes for the judge to decide.
T∗≈ln10L−L∞
Intuitively, each token on average moves the log-probability-ratio away from 0 by another (L−L∞). Decision is triggered when it finally moves out of the interval [−ln10,+ln10].
We are not able to simply look at a few tokens, draw a straight line, and call it a day, because the trajectory of log-probability-ratio is much closer to a random walk with drift. Subjectively, if you were a judge and watching the log-probability-ratio moving, you'd see ups and downs, keeping you in suspense, until it finally crosses the decision boundaries.
Slowdown factor
To perform the SPRT as described, the judge must know intimately the difference between a human and a machine. Can the judge do that? Can anyone know, with certainty, that I would start my speech with "Forty cats ..." with a probability that is exactly 32.42 times that of GPT-3?
As a crude approximation, we can model real-world judges as slowed-down version of the perfect judge. We can imagine that at each step, instead of updating the log-ratio by
lnrn+1←lnrn+lnPr(Xn+1|H0,X1:n)Pr(Xn+1|H1,X1:n)
we update it by
lnrn+1←lnrn+1slnPr(Xn+1|H0,X1:n)Pr(Xn+1|H1,X1:n)
where s>1 is the slowdown factor. This implies that if it takes ∼T tokens for the perfect judge to reach a likelihood ratio of r, it would take ∼sT tokens for a human judge.
Measuring the slowdown factor
The slowdown factor s is unknown.
The original paper [@barnettScalingTransformativeAutoregressive2023] contains no estimate of s. They did propose to measure it experimentally by running the Turing test with a human judge and two language models. One model H0 "perfectly imitates humans" by simply sampling a random text segment from a corpus, and the other model H1 is a trained language model, finetuned to imitate the same corpus. They claimed that for any piece of text X1:n, they can calculate the log-ratio lnPr(X1:n|H0)Pr(X1:n|H1), but I found it difficult: Suppose X1:n= technically fork, which is unlikely but possible, yet the phrase never appears in the corpus, what should be Pr(X1:n|H0)? We can use one of the many smoothing tricks [@jurafskySpeechLanguageProcessing2023, chapter 3], but this gets complicated.
What I think would work well is if both H0and H1 are language models, perhaps even the same model with different sampling temperatures, then the human judge only has to distinguish the two models.
Perhaps we can make this into a gambling game. The human subject would be presented with two long outputs from two hidden Markov models. Then the subject becomes the judge of a Turing test: "Are you seeing the output from machine 0 or machine 1?". At each step, the subject can either pay a few cents of fake money to see another character, or stop and make a bet with the entire bankroll: "I bet 70% of my bankroll on machine 0 and the rest on machine 1!". Both bets have payoff odds of 2:1. I believe that if the cost of seeing another character is just right, the subject would be nudged to make a decision at exactly 10:1 posterior odds ratio on the two hypotheses "machine 0" and "machine 1".
There was one large-scale attempt at the Turing test in early 2023, in a game called "Human or Not?" [@jannaiHumanNotGamified2023]. Human participants took 2-minute conversations, and at the end, had to decide whether they were talking to a human or a bot.[3]
I counted that during a typical message, each side sends [20,40] English words in total, or [30,50] tokens. In [60%,70%] of trials, the human participant judged correctly. This suggests that the log-ratio achieved after [30,50] tokens is around the range of [±ln6/4,±ln7/3]. In other words, the average log-ratio per token is
[ln6/4,ln7/3][30,50]∈[0.01,0.03]nat/token
They used several different AI, ranging between Jurassic-2, GPT-4, and Cohere. None of them have published their training compute or loss curves. The only good estimate is for GPT-4, which has training cost C=2×1025FLOP.
Assuming that Chinchilla scaling holds, average log-ratio per token that an ideal judge should achieve is L−L∞=1070C0.154=0.14nat/token. Therefore,
s∈[5,14]
I did not expect the estimate to be nearly symmetric around 10.
Entropy of natural languages
In @eq-sprt, we argued that L∞ should be interpreted as the entropy rate of the source, usually human-generated English. Unfortunately, unlike that of coin flips or Markov chains, the entropy rate of English cannot be calculated, only estimated. Fortunately, it can be estimated in several ways, and we can check their agreement.
Since tokenizers are temporary, but English is permanent, we convert all units to bit/character for easy comparison.
Chinchilla scaling
In the Chinchilla scaling law paper, the authors trained many language models with various sizes from a single architecture family, and fitted a statistical law to the data, giving L∞=1.69nat/token (without error bars, unfortunately) [@hoffmannTrainingComputeOptimalLarge2022, page 25].
To find the effective bit/character for the Chinchilla scaling law, we need to convert nat to bit, and token to character. The first is easy: 1bit=ln(2)nat. The second can be estimated by running a tokenizer over a large natural English corpus. I have estimated this by running the GPT-2 tokenizer on the
WikiText-2
corpus, and found that on average,1token=4.5character=0.85word
Thus, L∞≈1.694.5×ln2=0.54bit/character.
Guessing game
The earliest attempt to measure the entropy rate of English is by Shannon himself [@shannonPredictionEntropyPrinted1951]: [0.6,1.3]bit/character. He obtained the estimate by presenting human subjects n−1 characters from a text, and ask them to guess the next character repeatedly, until they got it right. In this case, the optimal strategy is to construct the n-gram table, and pick the argmax character for the given (n−1)-gram, then the arg-next-max, and so on.
Let N be the total number of characters allowed -- Shannon's experiment used N=27, with 26 lowercase letters and one white space. Let pk be the frequency that the subject makes exactly k guesses -- including the correct guess, so that ∑Nk=1pk=1. By convention, pN+1:=0. Shannon derived both an upper and a lower bound for the entropy per character:
N∑k=1k(pk−pk+1)lnk≤H≤−N∑k=1pklnpk
The upper bound is proved by Shannon's source coding theorem. Taking a human subject, copy it, then they can be used as an encoder-decoder pair.[4] The lower bound is not only tricky to prove, but also wrong in general. It is only correct when the human subject is the optimal N-gram predictor.[5] Because of this, I do not recommend using this lower bound, but will quote it anyway.
Over the years, others devised other methods to estimate this entropy. For example, [@coverConvergentGamblingEstimate1978] used a gambling game estimation, in the style of the Kelly criterion. Subjects were required to divide their entire bankroll into 27 differently-sized bets over 27 possibilities (26 letters and 1 whitespace). The right bet pays back 27-fold, and the other bets are lost. Let Sn be the size of bankroll after n rounds of betting, then
H≤ln27−limsupn1nlnSn
They found that H≤1.3bit/character.
The guesser does not have to be a human. It can very well be a language model. [@brownEstimateUpperBound1992] made a simple trigram model over the Brown corpus (600 million words), and found that it gives H≤1.75bit/character. [@behrjrEstimatingComparingEntropy2002] used a model that combines multiple n-gram models, giving H≤1.46bit/character.
Lossless compression
Another way to estimate is by lossless compression of a large corpus, since the entropy rate is the lower bound on compression rate. In more detail, if you have a source of information emitting symbols, and its symbol stream has an entropy rate of xbit/symbol, then it takes at least ∼xl bits to encode a long segment with l symbols. Furthermore, this lower bound is approachable using the entropy encoding.
The Hutter prize is a competition for compressing a 109-byte corpus from the English Wikipedia (
enwik9
). For the size of the finished product, both the algorithm and the compressed data must be counted. In particular, if a neural network is used, then the size of the neural network weights must be counted as well.The
enwik9
dataset is inXML
format, and thus contains a lot of non-English content like<timestamp>2005-12-27T18:46:47Z</timestamp>
. It has 109 bytes. It is tricky to decide how to clean it up to remove all theXML
formatting. As a simple estimate, we counted its words and characters directly with Linux commandwc
without any preprocessing, which gives us13,147,025 words=129,347,857 characters=1,000,000,000 bytes
Therefore, the entropy rate is
8×108/13,147,025compression ratio=6.15compression ratiobit/character{#eq-hutter-prize-entropy-rate}
The standard zip algorithm can compress it down to about 300 Mb in size, a compression ratio of ∼3×. Over the years, the progress has been slow but somewhat steady. The current winning entry (Saurabh Kumar, 2023) has a compression ratio of 8.76×. If we extrapolate the prize-winning entries over the years, it seems that the best possible compression ratio is ∼10×.
Similar to the Hutter prize, the Large Text Compression Benchmark also asks for compressing the
enwik9
dataset. However, there is no limit to the algorithm runtime or size, so the compression ratio for this benchmark is always higher. Currently (2024-01-19), the maximal compression rate reached is 9.35× withnncp v3.2
, which uses a small Transformer model.[@grassbergerDataCompressionEntropy2002] used a substitutional compression algorithm with increasingly large codebooks. When the codebook had 6000 codes, the algorithm gave h≤1.82bit/character. By extrapolating the {codebook size}-{entropy rate} curve to an infinitely large codebook, they estimated that English has entropy rate 0.7±0.2bit/character.
Summary
nncp v3.2
, 2023)Notably, the above table has mostly upper bounds, and only one dubious lower bound (by Shannon) from 1951. Perhaps lower bounds can be established by using randomness extractors on a large corpus, and checking that the output from the extractor passes pseudorandomness tests.
Forecasting AGI
According to the Chinchilla scaling law [@hoffmannTrainingComputeOptimalLarge2022], if we have a fixed amount of computing budget C, by choosing the model and dataset size correctly, the minimal reducible loss achievable is
L−L∞=1070(C/FLOP)0.154nat/token{#eq-chinchilla-scaling}
Assuming a slowdown factor s, that the judge decides when the odds ratio is r:1, and the Chinchilla scaling law, we have a direct method to predict how long a language model can survive in a Turing test, according to the cost of training compute C:
T∗∼slnr1070(C/FLOP)0.154token
This gives, as a rule of thumb, 100× compute means 2× length of survival in a Turing test.
For example, assuming a slowdown factor of s=10, and that the judge decides when the odds ratio is 10:1, for a language model to survive for 1000 tokens, it needs
L−L∞≤10×ln10/1000=0.023nat/token
If GPT-4 costs 2×1025FLOP in compute, and 1word≈1.2token, then
T∗≈170 tokens≈150 words
meaning it has a good chance of passing the Turing test if limited to only 150 words. For context, the Attention is All You Need paper has an abstract that's 200 tokens long.
A typical scientific paper is about 4000 words long, which is 27× that of 150 words, so it would need 271/0.153=(2×109)× that of compute. Assuming that GPT-4 cost 10 million USD to train, this hypothetical AI would cost 2×1016 USD, or 200 years of global GDP2023.
This implies that the first AGI will not be a scaled-up GPT -- autoregressive transformer generatively pretrained on a lightly filtered text dataset. It has to include something else, perhaps multimodal data, high-quality data, better architecture, etc. Even if we were to attempt to merely scale it up, turning earth into a GPT-factory,[6] with even 50% of global GDP devoted,[7] and with 2% growth rate forever, it would still take 110 years,[8] arriving at year 2133. Whole brain emulation would likely take less time.[9]
Appendix: Ergodic theory
Since we used ergodic theory during the essay, we should quickly explain what it is about. This section is foundational, but the full complexity is not necessary.
Measure-theoretic POV
I know, you know too, nobody really likes measure theory any more than pianists like practicing scales hundreds of times. Still, it is at the right level of abstraction for many theories, including probability.
We omit all mentions of "almost-everywhere", "except on a set of measure zero", and similar annoying phrases. As long as you never make a union of uncountable many subsets, you will not be hurt by this omission.
A probability space is a measurable space with a measure of 1. We write it as (Ω,B,Pr), where B is the sigma-algebra of measurable sets, and Pr is the probability measure. We also write μ for the measure.[10]
We consider a single measurable function T:Ω→Ω, and call it the shift map.
We demand that T must preserve measure. That is, ∀S∈B, we have Pr(T−1(S))=Pr(S).
A subset is measurable iff it is an element of B. A measurable set is also called an event.
A subset S∈B is T-invariant iff T−1(S)=S almost everywhere.[11] Let I be the set of all T-invariant subsets:
I:={S∈B:T−1(S)=S}
Now, obviously any set of measure zero or one are T-invariant. We say that those are trivially T-invariant. We say that T is ergodic iff I has only such trivial subsets. In other words, T is ergodic iff it cannot be factored into two nontrivial chunks:
S,S′ partitions Ω,such that T−1(S)=S,T−1(S′)=S′,Pr(S)>0,Pr(S′)>0
We usually ask T to also be ergodic, though sometimes we don't need that.
Ergodic maps have many very good properties. We will use the following one. For the theorem, you can picture it as the real space Rn with the gaussian probability distribution, but in fact, it applies for just about everything we would care about, such as the space of English texts, queuing jobs, random walks, etc.[12]
Theorem: If the state space is a topological space with a countable basis, and any nonempty open set has positive measure, then almost any X∈Ω has a dense orbit.
Proof: Let U be a nonempty open set.
Ω−∪i≥0T−iU is T-invariant, and since it excludes U, it does not have the full measure. Since T is ergodic, the set actually has zero measure.
Now, ∪(Ω−∪T−iU) is a union of countably many zero-measure sets, so it still has zero measure. By expanding the definition, this is the set of all points with non-dense orbit.
Finally, there is a common theme in ergodic theory. There are rigorous versions of it, but instead of going for rigor, the spirit is more important:
Theorem (ergodic decomposition): Any interesting map is a partition/sum/integral of ergodic maps.
For example, the shear map on the unit square [0,1]2 defined by
(x,y)↦(x,x+ymod1)
can be thought of as an integral over rotations: For each x∈[0,1], we have Tx:y↦x+ymod1. For almost all x∈[0,1], we have Tx an irrational rotation, thus ergodic.
Sequence POV
We must interpret the language of measure theory, which is dead like chalk dust, back into the language of sequence predictions, which is alive like reinforced concrete.
Each point in the state space X∈Ω is a text: a stream of tokens infinite both forwards and backwards. The state space Ω is the all possible texts (Xn)n. We assume that all tokens come from the same finite-size alphabet, for example, the 128 ASCII symbols.
The shift map on the state space T:Ω→Ω is defined by moving the origin to the right by one:
T(…,X−1,X0,X1,…):=(…,X0,X1,X2,…)
The shift map is measure-preserving, meaning that the process is stationary: We could have started reading at any point, and we would still expect to see the same kind of probability distribution. It would not be like "Sorry, the word 'cat' appears with zero probability when n≥1000.". It would be like "No matter where we start reading, we should expect to the first three tokens to be 'cat' with probability 10−4.".
Repeatedly applying the shift map T is just reading through the stream, one token at a time:
...Lorem ipsum ...↦...orem ipsum d...↦...rem ipsum do...↦⋯
A periodic point of T is a text that repeats itself like a broken record. For example, X:=... and and and ... satisfies T4X=X.
A T-invariant set S⊂Ω is a set of texts, such that if we take any text X from S, and jump either forwards or backwards for an arbitrary amount, we get another set in S. In other words, S is a set of token streams where there is no origin: you can start reading from any token.
A probability distribution over Ω describes the probability of observing various kinds of text streams.
If we can partition Ω into two subsets P,Q, with probabilities ϵ>0,1−ϵ>0, then it means that any text from P is different from any text from Q, after any shift. It is as if there are two languages, and each text can be exclusively written in one language only.
We wish to consider only texts created by some imaginary "universal English speaker". In particular, we do not want it to get stuck in one sub-language of English, then never escape from it. That is, we assume the universal speaker is ergodic.
Now imagine that we randomly sample two pieces of text generated by the universal speaker, and we shift the first text around to match it against the second. By @thm-ergodic-dense-orbit, the orbit of the first text is dense in the space of all possible English texts spoken by the universal speaker. We can gamify this situation thus:
Prover: "I take one piece of text x, then another piece x′.".
Challenger: "I challenge you to find a stretch of text from x that matches the −1000:1000 stretch in x′.".
Prover asks a team of immortal monkeys to do the task. A million years later: "At 49134819.".
Challenger verifies that T49134819(x)−1000:1000=(x′)−1000:1000.
Shannon--McMillan--Breiman
If someone has created an infinite sequence of coin flips X−∞:+∞, then revealed it to us one by one, then each reveal would give us 1bit=ln2nat. The long-term average obtained per reveal is still ln2nat, a rather boring situation.
How do we measure the entropy of an English speaker? It speaks token by token, and we have to measure the average information we obtain per token. The problem is that there are two senses of "average". It could be the time-average: we listen to the speaker speak for a very long time, and calculate the entropy in the speech. It could be the ensemble-average: we listen to the speaker speak for a very long time, then do it again, then again, etc, then average together the time-averages.
If the speaker is ergodic, then the speaker essentially has just one speech, and any two samples of its speech are just translations of each other. Consequently, it is intuitively clear that with probability 1, the time-average of the entropy of one speech equals the ensemble-average of the entropy of all speeches. Intuitively, with probability 1,
1nlnPr(X1:n)→E[1nlnPr(X1:n)]
For non-ergodic speakers. We simply decompose the speaker into an ensemble of ergodic speakers, then apply the SMB theorem to each one. It is like the strong law of large numbers. Intuitively, with probability 1,
1nlnPr(X1:n|X is type i)→E[1nlnPr(X1:n)|X is type i]
This is the Shannon--McMillan--Breiman theorem.
In textbooks and Wikipedia, the SMB theorem is stated rigorously, but you have already understood the idea of SMB, and the rigorous versions are simply paraphrases of the idea.
The thing is released in a scattered way, typical for an internet-native publication. There is the report [@barnettScalingTransformativeAutoregressive2023], in the form of a paper -- clearly meant to be cited, despite being hard to read. There is the website [@barnettDirectApproach2023], in the form of a blog post -- clearly meant to be read, despite not being upper-class enough to be cited in journal papers. Finally there is the interactive model which looks like an optional add-on to the blog post. ↩︎
In short, an ergodic speaker is someone who has only one speech. If you hear it speak once for a very long time, then hear it speak again for a very long time, then you can take the first and shift it around, so that it looks like the second over a very long sub-segment. Ergodic speakers allow you to take the average over a single very long speech, and be assured that it is close to the average over all possible speeches.
In long, see the appendix on ergodic theory. ↩︎
There was no mention of whether the bots had to decide the same question. ↩︎
It still works even if the humans are pseudorandom. We just have to whisper the same RNG seed into both humans' ears, and then they would behave in the same pseudorandom way. ↩︎
The simplest counterexample: Suppose the source is binary, and satisfies Xn+1=Xn+1mod2, so it has zero entropy. Nevertheless, the human intentionally guesses wrong the first time. Therefore, we have p2=1, and we have violated the lower bound by 2ln2>0.
This source can be made ergodic by adding an ϵ amount of coin-flip noise: Xn+1=Xn+1mod2 with probability 1−ϵ. This would still give us 2ln2+O(ϵ)>O(ϵlnϵ). ↩︎
Consider this anecdote from Edward Teller:
↩︎Only in a life-or-death situation does 50% of GDP get devoted to one purpose. For example, that is about the level of GDP devoted to war production during WWII in the major combatant countries. The USA spent 4 trillion USD2011 over 6 years out of an annual GDP of 1.3 trillion USD2011. ↩︎
Solve for x in 200=∑xk=00.5×1.02k. ↩︎
<iframe src="https://www.metaculus.com/questions/question_embed/2813/?theme=dark" style="height:430px; width:100%; max-width:550px"></iframe> ↩︎
Pronounced "mu" -- it is a pun because both "mu" and "measure" starts with "m". ↩︎
That is, except on a subset of measure zero: Pr(T−1(S)−S)=0 and Pr(S−T−1(S))=0. This is the last time we will measure this. ↩︎
Except pathological examples constructed by logicians who have nothing better to do than to care about the continuum hypothesis, large cardinals, and the arithmetic hierarchy. Those who desire the rigor-mortis of logic, let them have it. ↩︎