The aim of the Hutter Prize is to compress the first 1GB of Wikipedia to the smallest possible size. From the AIXI standpoint, compression is equal to AI, and if we can compress this to the ideal size (75MB according to Shannon's lower estimate), then the compression algorithm is equivalent to AIXI.
However, all the winning solutions so far are based on arithmetic encoding, context mixing. These solutions hold little relevance in mainstream AGI research.
Current day LLMs are really powerful models of our world.And it is highly possible that most LLMs are trained on data that consist of the first 1GB of Wikipedia. Therefore, with appropriate prompting, it should be possible to extract most or all of this data from the LLMs.
I am looking for ideas/papers which would help me validate whether it is possible to extract pieces of wikipedia using prompts with any publicly available LLMs.
update: i have recently learnt regarding gpt-4's compression abilities via prompting, i am very keen in testing this using this method. If anyone is willing to work with me (as i do not have access to gpt-4) it would be great.
There are a few things that resemble what I think you might be asking for.
The first thing it sounds like you may be asking is as follows:
The answer is a flat "no" for the entirety of enwik8 for any of the main openai models, which can be verified by passing in a random full-context-worth (2048 tokens) of
enwik8and verifying that the completion is not "the rest of enwik8". The models cannot look back before their context, so if any full context in the middle fails to generate the next part, we can conclude that it won't generate the full thing on its own for any initial prompt. Which is not too surprising.
The second thing, and the thing I expect you mean, is
to which the answer is "yes". It is possible to get a measure of bits per character for a given model to generate a given sequence of tokens, which equally means it is possible to generate that exact text using the model and the corresponding number of bits. Some examples:
enwik8, and nothing else, would cost 5.17 bits per char, with a model that weighs a few kB (
text-davinci-003can get you down to 0.56 bits per token. But the model itself weighs several GB (openai:text-davinci-003 below).
Anyway, I threw together a quick script to get approximate bits-per-char to reproduce
enwik8, excluding model size. Here are the results, high-to-low (std deviation across 10 trials in parens). Keep in mind that "store the entirety of enwik8 uncompressed" costs 100MB
If this was what you had in mind, you would probably also be interested in the post Chinchilla's wild implications, which goes a lot deeper into this sort of stuff.
Strictly speaking, it would be surprising, because it would mean that there are no adversarial examples or possible prompt-tuning which can produce specified arbitrary text, even text that is of a very high starting likelihood like a WP entry. And this despite being able to work with ~50k^2048 possible inputs.
(It would be like saying that a face-generating GAN can't produce a specific face that it was trained on, no matter how hard you optimized the z or how much vastly larger the dimensionality of the z is compared to the true face manifold. If you showed me a StyleGAN face-generator which had been trained on Obama, and that it completed half of Obama's face with a different face, I would not be surprised; I would be very surprised if you told me you had somehow proven that not only did it not infill Obama on that specific example (fine, unsurprising), there was no infill or possible z whatsoever that it could generate Obama's face (extremely surprising).)
I don't think it's surprising that there are sequences of tokens such that there is no amount of prompt tuning you can do to generate such a sequence of tokens, as long as you allow the sequence to be longer than the context window of the model, and in fact I think it would be surprising if this was a thing you could reliably do.
For example, let's build a toy model where the tokens are words, and the context length is 10. This model has been trained on the following string.
If you start with the string
the model will complete with "home.", at which point the context is
and the model completes with "On", changing the context to
If the model completes with "Tuesday" it will end up stuck in a loop of
which does not succeed at the task of "complete the input". The outcome is similar if it chooses to complete with "Wednesday" instead.
This happens because the task is specifically "output the entirety of enwik8", where "the entirety of enwik8" is a string that is much longer than the context window. No matter what your initial prompt was, once the prompt has succeeded at the task of "output the next 2048 tokens of enwik8", your prompt no longer has any causal impact on the completion - "feed the model a prompt that causes it to output the first 2048 tokens of enwik8 and ask it for completions" and "feed the model the first 2048 tokens of enwik8 and ask it for completions" are operations which yield the same result.
If we had a GPT-X which somehow had a non-bounded context window, I suspect it would be possible to get it to output the full text of enwik8 (I expect that one way to do this would be "fill its context window with enwik8 repeated a bunch of times in a row").
But that's a different question from trying to generate a WP article's next token, which is what you defined as failure. You said even with the full context, it couldn't predict the next token. I'm pointing out that you almost certainly can find an adversarial prompt (plus shortened prefix) which programs it to predict the right next token. It's just that you would need a different adversarial prompt to program for the next token after that, and then the next, and so on, in what is presumably some sort of sliding window over all of Wikipedia, while to be interesting in any substantive sense you'd need it to be a single fixed prompt which elicits accurate predictions of every token. It's whether there is one fixed prompt, not the question of is there any prompt which would've correctly predicted the next token for your specific article example. The latter is almost certainly true (demonstrating a weakness in your counterexample), and the former almost certainly false.
I said that, if for any full context window that contains 2048 tokens of enwik8, it failed to predict the next token, that means that it fails at the task of "output the whole string of enwik8 in one go".
It was not a very interesting statement on what a language model cannot do -- all it means is "GPT has not literally memorized enwik8 to the point where it can regurgitate the whole thing perfectly".
I'm not entirely sure I understand what you're saying here.
Let's say we have the text
In 2004, the 34-year-old Agassi won the [[Cincinnati Masters]] to bring his career total to 59 top-level singles titles, which is a sequence of tokens that occurs within enwik8.
Would the requirement of "A different adversarial prompt to program for the next token after that" be fulfilled by prompting with "InInIn[...]InInIn" then " 2004 2004 2004[...] 2004 2004 2004" then ",,,[...],,," and so on? If so, I do concur that that is a thing you can do. There (infamously) exist tokens which GPT-3 can't repeat back, but I am pretty sure none of them occur within enwik8.
When you say
I am interpreting that as a claim that
If that's what you're saying, I agree that that's a more interesting question, and probably closer to the intent of OP.
Although even with that question I am still quite certain that the answer is "no" because even the best predictor tested (
text-davinci-003) still takes 0.56 bits per token and enwik8 is a bit over 29M tokens, so I'd expect it to take somewhere in the ballpark of 16M bits to encode the entirety of enwik8. Meanwhile there are only 50258^2048 possible template patterns (50257 tokens plus "empty space" for each position in the context), which means you only get about 32k bits of information with which to influence the model to correctly make 16M bits worth of decisions.
Regarding the first one i am not expecting a single-prompt to generate the entirity of enwiki8/9. I am more interested in finding a set of prompts with a lookup table if possible to replicate enwiki data.
Thanks for the pointer for chincilla post, will look into it.
The well-known early experiment in creating a virtual Linux machine inside ChatGPT, linked to a virtual Internet, includes generation of a fictitious web page by viewing it through the lynx browser.
In my first attempt, using lynx as the second command didn't produce a page. But on a second attempt, I tried pinging a domain, in order to ease it into displaying Internet data, and after that it was willing to display fictitious versions of Wikipedia pages. They don't particularly match the original (not even the September 2021 version, from the time of ChatGPT's knowledge cutoff), but they are on topic.
Possibly this approach can be refined.
really interesting idea.
The machine learning world is doing a lot of damage to society by confusing "is" with "ought" which, within AIXI, is equivalent to confusing its two unified components: Algorithmic Information Theory (compression) with Sequential Decision Theory (conditional decompression). This is a primary reason the machine learning world has failed to provide anything remotely approaching the level of funding for The Hutter Prize that would be required to attract talent away from grabbing all of the low hanging fruit in the matrix multiply hardware lottery branches, while failing to water the roots of the AGI tree. So the failure is in the machine learning world -- not the Hutter Prize criteria. There is simply no greater potential risk adjusted return on investment to the machine learning world than is increasing the size of the prize purse for the Hutter Prize. To the extent that clearing up confusion about AGI in politics would benefit society, there is a good argument to be made that the same can be said for the world in general.
This is because 1) The judging criteria are completely objective (and probably should be automated) and 2) The judging criteria is closely tied to the ideal "loss function" for epistemology: the science of human knowledge.
The proper funding level would be at least 1% of the technology development investments in machine learning.
This reminds me of an idea : I think it would be great to hold a bi-monthly competition where people try to do something as incredible as possible in just 30 minutes using LLMs or other AIs. The winner being decided by a select few.