Comments

Easter Island, between its colonization by humans in 1000 CE and its worse colonization by Europeans in 1700 CE, had a maximum population of maybe 12,000. It’s one of the most remote islands in the world. In isolation from other societies, they did develop a written language, in fact Polynesia’s only native written language.

They are claimed to have developed it in isolation; but you are providing a good argument for why some weak evidence for it should not convince us that a fullblown script sprung out of nowhere for what would be the tiniest population by many orders of magnitude to do so.

It's worth noting that in blindfold chess, you by definition don't see the board state, only the history (if you can remember it), and whether you played a legal move. It has a sort of mirror or dual game in the form of Kriegspiel, you can see your pieces but not the enemy's moves/history, and you again can try to play moves and will be told if they are legal or not (and you are expected to 'probe' with possible moves to gain information from the legality thereof). This demonstrates that human players can play satisfying chess with not much more than legality plus some additional information. (It would be interesting to see if with enough practice, humans could play 'blindfold Kriegspiel' reasonably well or if that winds up being too difficult.)

I don't care about strength and have no use for it; several years of lifting later, I have gotten my noob gains and still have no use for strength (with the exception of possibly helping with some occasional back pains I used to have). Nothing in my daily life hinges on my deadlift doubling - not even carrying in the groceries.

(additional confirmation) Amazing. I wonder what completely insane things the other rare BPEs all get interpreted as? Could you loop over the BPE dict from #51k to #1* in a prompt like "Please define $BPE" to see what the most distant ones are? (Since there's 51k, which is a bit much to read through manually, maybe sort by edit-distance from the ordinary ASCII encoding: 'distribute' would have a very high edit-distance from 'SolidGoldMagikarp'.)

On a sidenote, this is yet another good illustration of how we have no idea what we're doing with deep learning - not only did no one predict this, it's obviously another Riley-style steganographic or sidechannel attack: just find rare BPEs and construct a code out of whatever bizarre things the model learned.

* I believe BPEs are supposed to be defined in 'order' of compression improvement, so the strangest BPEs should be at the end of the list.

You're probably thinking of the debate over ENCODE. It was a furious debate over what the ENCODE results meant, whether some mere chemical activity proved non-junkness, and whether they even measured the narrow chemical thing they claimed to measure and then based the interpretations on; I didn't follow it in detail, but my overall impression was that most people were not convinced by the ENCODE claims and continue to regard junk DNA as being pretty junky (or outright harmful, with all the retrotransposons and viruses lurking in it).

Genome synthesis may help answer this in the not too distant future: it's already been used to create 'minimal organism' bacteria genomes which are much smaller, and synthetic genomes without the 'junk DNA' are appealing because synthesis costs so much and you want to cut corners as much as possible, so proving empirically the junk DNA doesn't matter is obvious and valuable.

gwern5dΩ32-1

OpenAI had generated poems in the New Yorker, which suggests they might have had some internal project related to poetry.

I didn't get that impression from that when I read it - the NYer author and his friends prompted most of that, even if their friend Dan Selsam happens to work at OpenAI. (He seems to work on math LMs, nothing fiction or RL-related.) They were set up with the public Playground interface, so the OA insider role here was limited to showing them a few completions and trying to explain it; presumably they did the rest more remote and partially on their own. Specifically, some parts of it, like the choice of Shel Silverstein (a far from obvious poet to pick, even if his fiction is beloved by American children), suggest they (like pretty much anyone interested in GPT-3 poetry) read my page for ideas. Also, again, Leike, who's in charge at OA, denies having done anything poetry-specific or knowing about the apparent capability-gain.

It maybe has a much more subtle version of it.

Yeah, that's a funny thing about mode collapse, it's really hard to see, and the higher-quality the outputs get, the harder it'll be to see with 'the naked eye'. Who knows every literary genre there is and can patiently prompt them one by one to see which genres a model quietly slides away from & tries to avoid generating text in? Like hands in GANs... It takes a while to begin to notice what you aren't seeing.

Of course, I'd also expect Claude to be much subtler simply because it's working off less data and so it's less likely to have gotten rated text or inputs which would push it towards mode-collapsing on easily-recognized rhyming poetry and to avoid harder-to-understand poetry. (Claude is just the 'constitutional prompt' model, right? Hard to see how a list of generic principles would push it towards rhyming-only.)

Does anyone know how much poetry and literary prose is in the pre-training sets aside from stuff in Common Crawl?

OA has been resolutely silent about the composition of the data like Books1/Books2. But it seems safe to say that it would include all the obvious datasets like Project Gutenberg, so there is much more poetry/literary prose available than necessary. Sample size should not be an issue. (Rhyming really is not that complex, if you understand phonetics.)

gwern5dΩ450

A GPT-3 mode-collapse example I can't believe I forgot: writing rhyming poetry!

I and a number of other people were excited by ChatGPT on launch seeming able to do long stretches of flawless rhyming poetry in couplets or quatrains, and where the words rhyming were not hackneyed common pairs of the sort you might see in the lyrics of pop songs charting. Hilarious, but extremely surprising. (davinci-002 had did a little bit of this, but not convincingly the way ChatGPT overnight did.*) Leike on Twitter denied any knowledge of rhyming suddenly working, and especially denied that anything special like adding rhyming dictionaries or IPA-re-encoding text had been done or that GPT-3 had switched tokenizations on the backend. So, had there been some sort of emergence, or 'miracle of spelling'?

After playing around with it for a while, my conclusion was: 'no'. ChatGPT does rhyming poetry in only one way, and it is difficult to make it try any other kind of poetry even with explicit instructions and examples and doing continuations. It doesn't understand novel rhymes or puns if you quiz it, and its explanations of them remain as highly varied and incorrect as the original davinci model's pun explanations were. This is not what any kind of fixed phonetic understanding or genuine rhyming ability would look like.

My conclusion was essentially, 'mode collapse': presumably some poetry examples made it into the training datasets (from my experiments, if nothing else), and because it's easy for any literate Anglophone to judge rhyming but non-rhyming poetry is a lot harder (and generally despised by most people, which is why the prestige & popularity of Western poetry over the past century has collapsed to a degree few people appreciate), it'd be logical for the raters to highly prefer rhyming completions. So ChatGPT mode-collapses onto the subset of rhymes it has memorized & tries to always rhyme no matter what. (This is probably not helped by the fact that due to BPEs, a GPT-3 model struggles to understand what is 'rhyming' vs 'non-rhyming' in the first place.)

The initial false impression that it had learned to rhyme is then because it does such a good job sticking to that subset, and because it has memorized more rhyme-pairs than I thought; so when it controls the output of text and is agentic, doing some degree of RL-incentivized planning to ensure both good lines and also rhymes†, it can fool you indefinitely as long as you don't test the boundaries or pull it 'off-policy', so to speak.

* which is, in retrospect, especially interesting if davinci-002 is trained differently from davinci-003

† I strongly suspect that whatever level of non-myopic token prediction a base GPT-3 model does, the tuned ones are doing more of it. Particularly with rhyming, ChatGPT seems too good to be picking a random plausible word at the end of line A and then scrambling at the last token at the end of line B for a plausible rhyme which both fits grammatically and semantically as well. Nothing is that good at rhyming. It almost surely doing some degree of planning somewhere to make the end of line B match up with the end of line A.

https://twitter.com/volokuleshov/status/1619906183955095558 demos using ChatGPT to run a 'Python notebook' to 'train a neural net model' to 'predict' outputs for various functions like sin() or 5 / (1+x)**2 -1

The 'labels' aren't labels in the sense of being deliberately constructed in a controlled vocabulary to encode a consistent set of concepts/semantics or even be in the same language. In fact, in quite a few of the image-text pairs, the text 'labels' will have nothing whatsoever to do with the image - they are a meaningless ID or spammer text or mojibake or any of the infinite varieties of garbage on the Internet, and the model just has to deal with that and learn to ignore those text tokens and try to predict the image tokens purely based on available image tokens. (Note that you don't need text 'label' inputs at all: you could simply train the GPT model to predict solely image tokens based on previous image tokens, in the same way GPT-2 famously predicts text tokens using previous text tokens.) So they aren't 'labels' in any traditional sense. They're just more data. You can train in the other direction to create a captioner model if you prefer, or you can drop them entirely to create a unimodal unconditional generative model. Nothing special about them the way labels are special in supervised learning. DALL-E 1 also relies critically on a VAE (the VAE is what takes the sequence of tokens predicted by GPT, and actually turns them into pixels, and which creates the sequence of real tokens which GPT was trained to predict), which was trained separately in the first phase: the VAE just trains to reconstruct images, pixels through bottleneck back to pixels, no label in sight.

Load More