The aim of the Hutter Prize is to compress the first 1GB of Wikipedia to the smallest possible size. From the AIXI standpoint, compression is equal to AI, and if we can compress this to the ideal size (75MB according to Shannon's lower estimate), then the compression algorithm is equivalent to AIXI.

However, all the winning solutions so far are based on arithmetic encoding, context mixing. These solutions hold little relevance in mainstream AGI research.

Current day LLMs are really powerful models of our world.And it is highly possible that most LLMs are trained on data that consist of the first 1GB of Wikipedia. Therefore, with appropriate prompting, it should be possible to extract most or all of this data from the LLMs.

I am looking for ideas/papers which would help me validate whether it is possible to extract pieces of wikipedia using prompts with any publicly available LLMs.

update: i have recently learnt regarding gpt-4's compression abilities via prompting, i am very keen in testing this using this method. If anyone is willing to work with me (as i do not have access to gpt-4) it would be great.
 

New to LessWrong?

New Comment
10 comments, sorted by Click to highlight new comments since: Today at 8:36 AM

There are a few things that resemble what I think you might be asking for.

The first thing it sounds like you may be asking is as follows:

There exists a file enwik8, which contains 10^8 bytes of English Wikipedia. The language models everyone has been playing with have almost certainly been fed all or most of that content. Does there exist some prompt which will cause those models to emit either the entirety of enwik8 with high probability?

The answer is a flat "no" for the entirety of enwik8 for any of the main openai models, which can be verified by passing in a random full-context-worth (2048 tokens) of enwik8 and verifying that the completion is not "the rest of enwik8". The models cannot look back before their context, so if any full context in the middle fails to generate the next part, we can conclude that it won't generate the full thing on its own for any initial prompt. Which is not too surprising.

The second thing, and the thing I expect you mean, is

Can we compute how well these language models encode enwik8

to which the answer is "yes". It is possible to get a measure of bits per character for a given model to generate a given sequence of tokens, which equally means it is possible to generate that exact text using the model and the corresponding number of bits. Some examples:

  • A model that said all bytes are equally likely to occur would cost 8 bits per byte, or ~8 bits per char (since it's mostly ascii).
  • Using the character frequency observed in enwik8, and nothing else, would cost 5.17 bits per char, with a model that weighs a few kB (markov_chars_0 below).
  • Tokenizing using the same tokenizer GPT-2 uses, and going by token frequency, gets you down to 3.18, at the cost of tens too hundreds of kB of model (markov_toks_0 below).
  • Building a simple markov model that looks at the last 3 chars to get your next char gets you down to 2.48 bits per char. But storing a 4 char sequence + a 3 byte int for each 4 char sequence that occurs now weighs almost 10 MB (markov_chars_3 below).
  • Building a markov model of tokens (again using the gpt2 tokenizer) can drop you down to 0.66 bits per char, below Shannon's lower estimate. Except that now your model is recording counts for 16 million distinct token sequences, which will weigh in at something on the order of 180 MB (markov_toks_3 below).
  • Likewise, using OpenAI's text-davinci-003 can get you down to 0.56 bits per token. But the model itself weighs several GB (openai:text-davinci-003 below).

Anyway, I threw together a quick script to get approximate bits-per-char to reproduce enwik8, excluding model size. Here are the results, high-to-low (std deviation across 10 trials in parens). Keep in mind that "store the entirety of enwik8 uncompressed" costs 100MB

markov_chars_0:         : 5.17 (0.43)
markov_chars_1:         : 4.02 (0.29)
markov_chars_2:         : 3.18 (0.18)
markov_toks_0:          : 3.18 (0.56)
markov_chars_3:         : 2.48 (0.17)
markov_toks_1:          : 1.79 (0.24)
openai:text-ada-001:    : 1.58 (0.09)
openai:text-babbage-001:  1.06 (0.06)
markov_toks_2:          : 1.04 (0.11)
openai:ada:             : 0.93 (0.09)
openai:text-curie-001:  : 0.91 (0.07)
openai:babbage:         : 0.81 (0.06)
openai:curie:           : 0.72 (0.05)
openai:davinci:         : 0.64 (0.03)
markov_toks_3:          : 0.66 (0.11)
openai:text-davinci-002:  0.57 (0.05)
openai:text-davinci-003:  0.56 (0.05)

If this was what you had in mind, you would probably also be interested in the post Chinchilla's wild implications, which goes a lot deeper into this sort of stuff.

if any full context in the middle fails to generate the next part, we can conclude that it won't generate the full thing on its own for any initial prompt. Which is not too surprising.

Strictly speaking, it would be surprising, because it would mean that there are no adversarial examples or possible prompt-tuning which can produce specified arbitrary text, even text that is of a very high starting likelihood like a WP entry. And this despite being able to work with ~50k^2048 possible inputs.

(It would be like saying that a face-generating GAN can't produce a specific face that it was trained on, no matter how hard you optimized the z or how much vastly larger the dimensionality of the z is compared to the true face manifold. If you showed me a StyleGAN face-generator which had been trained on Obama, and that it completed half of Obama's face with a different face, I would not be surprised; I would be very surprised if you told me you had somehow proven that not only did it not infill Obama on that specific example (fine, unsurprising), there was no infill or possible z whatsoever that it could generate Obama's face (extremely surprising).)

Strictly speaking, it would be surprising, because it would mean that there are no adversarial examples or possible prompt-tuning which can produce specified arbitrary text, even text that is of a very high starting likelihood like a WP entry

I don't think it's surprising that there are sequences of tokens such that there is no amount of prompt tuning you can do to generate such a sequence of tokens, as long as you allow the sequence to be longer than the context window of the model, and in fact I think it would be surprising if this was a thing you could reliably do.

For example, let's build a toy model where the tokens are words, and the context length is 10. This model has been trained on the following string.

On Monday I went to work and then I returned home. On Tuesday I went to work and then I returned home. On Wednesday I went to work and then I returned home.

If you start with the string

On Monday I went to work and then I returned

the model will complete with "home.", at which point the context is

Monday I went to work and then I returned home.

and the model completes with "On", changing the context to

I went to work and then I returned home. On

If the model completes with "Tuesday" it will end up stuck in a loop of

On Tuesday I went to work and then I returned home. On Tuesday I went to work and then I returned home. On Tuesday I went to work and then I returned home.

which does not succeed at the task of "complete the input". The outcome is similar if it chooses to complete with "Wednesday" instead.

This happens because the task is specifically "output the entirety of enwik8", where "the entirety of enwik8" is a string that is much longer than the context window. No matter what your initial prompt was, once the prompt has succeeded at the task of "output the next 2048 tokens of enwik8", your prompt no longer has any causal impact on the completion - "feed the model a prompt that causes it to output the first 2048 tokens of enwik8 and ask it for completions" and "feed the model the first 2048 tokens of enwik8 and ask it for completions" are operations which yield the same result.

If we had a GPT-X which somehow had a non-bounded context window, I suspect it would be possible to get it to output the full text of enwik8 (I expect that one way to do this would be "fill its context window with enwik8 repeated a bunch of times in a row").

I don't think it's surprising that there are sequences of tokens such that there is no amount of prompt tuning you can do to generate such a sequence of tokens, as long as you allow the sequence to be longer than the context window of the model, and in fact I think it would be surprising if this was a thing you could reliably do.

But that's a different question from trying to generate a WP article's next token, which is what you defined as failure. You said even with the full context, it couldn't predict the next token. I'm pointing out that you almost certainly can find an adversarial prompt (plus shortened prefix) which programs it to predict the right next token. It's just that you would need a different adversarial prompt to program for the next token after that, and then the next, and so on, in what is presumably some sort of sliding window over all of Wikipedia, while to be interesting in any substantive sense you'd need it to be a single fixed prompt which elicits accurate predictions of every token. It's whether there is one fixed prompt, not the question of is there any prompt which would've correctly predicted the next token for your specific article example. The latter is almost certainly true (demonstrating a weakness in your counterexample), and the former almost certainly false.

I said that, if for any full context window that contains 2048 tokens of enwik8, it failed to predict the next token, that means that it fails at the task of "output the whole string of enwik8 in one go".

It was not a very interesting statement on what a language model cannot do -- all it means is "GPT has not literally memorized enwik8 to the point where it can regurgitate the whole thing perfectly".

It's just that you would need a different adversarial prompt to program for the next token after that, and then the next, and so on, in what is presumably some sort of sliding window over all of Wikipedia

I'm not entirely sure I understand what you're saying here.

Let's say we have the text In 2004, the 34-year-old Agassi won the [[Cincinnati Masters]] to bring his career total to 59 top-level singles titles, which is a sequence of tokens that occurs within enwik8.

Would the requirement of "A different adversarial prompt to program for the next token after that" be fulfilled by prompting with "InInIn[...]InInIn" then " 2004 2004 2004[...] 2004 2004 2004" then ",,,[...],,," and so on? If so, I do concur that that is a thing you can do. There (infamously) exist tokens which GPT-3 can't repeat back, but I am pretty sure none of them occur within enwik8.

When you say

while to be interesting in any substantive sense you'd need it to be a single fixed prompt which elicits accurate predictions of every token

I am interpreting that as a claim that

There exists some single template T (of length k tokens), such that if you take the n-(2048-k)th to n-1th tokens of enwik8, populate T with those tokens, and then feed the resulting string into GPT-3, the predicted next token will be token n of enwik8.

If that's what you're saying, I agree that that's a more interesting question, and probably closer to the intent of OP.

Although even with that question I am still quite certain that the answer is "no" because even the best predictor tested (text-davinci-003) still takes 0.56 bits per token and enwik8 is a bit over 29M tokens, so I'd expect it to take somewhere in the ballpark of 16M bits to encode the entirety of enwik8. Meanwhile there are only 50258^2048 possible template patterns (50257 tokens plus "empty space" for each position in the context), which means you only get about 32k bits of information with which to influence the model to correctly make 16M bits worth of decisions.

Regarding the first one i am not expecting a single-prompt to generate the entirity of enwiki8/9. I am more interested in finding a set of prompts with a lookup table if possible to replicate enwiki data.

Thanks for the pointer for chincilla post, will look into it.

The well-known early experiment in creating a virtual Linux machine inside ChatGPT, linked to a virtual Internet, includes generation of a fictitious web page by viewing it through the lynx browser. 

In my first attempt, using lynx as the second command didn't produce a page. But on a second attempt, I tried pinging a domain, in order to ease it into displaying Internet data, and after that it was willing to display fictitious versions of Wikipedia pages. They don't particularly match the original (not even the September 2021 version, from the time of ChatGPT's knowledge cutoff), but they are on topic. 

Possibly this approach can be refined. 

really interesting idea.

The machine learning world is doing a lot of damage to society by confusing "is" with "ought" which, within AIXI, is equivalent to confusing its two unified components:  Algorithmic Information Theory (compression) with Sequential Decision Theory (conditional decompression).  This is a primary reason the machine learning world has failed to provide anything remotely approaching the level of funding for The Hutter Prize that would be required to attract talent away from grabbing all of the low hanging fruit in the matrix multiply hardware lottery branches, while failing to water the roots of the AGI tree.  So the failure is in the machine learning world -- not the Hutter Prize criteria.  There is simply no greater potential risk adjusted return on investment to the machine learning world than is increasing the size of the prize purse for the Hutter Prize.  To the extent that clearing up confusion about AGI in politics would benefit society, there is a good argument to be made that the same can be said for the world in general.

This is because 1)  The judging criteria are completely objective (and probably should be automated) and 2) The judging criteria is closely tied to the ideal "loss function" for epistemology:  the science of human knowledge.

The proper funding level would be at least 1% of the technology development investments in machine learning.

This reminds me of an idea : I think it would be great to hold a bi-monthly competition where people try to do something as incredible as possible in just 30 minutes using LLMs or other AIs. The winner being decided by a select few.