[Updated to correct my earlier claim that this doesn't affect GPT-3. Apparently it does?]
I’m out sick today, but had enough energy to do some GPT-related fiddling around.
This time, I was curious what “internet discussions” tended to look like in the original training corpus. I thought this might point to a more natural way to represent tumblr threads for @nostalgebraist-autoresponder than my special character trick.
So, I looked around in the large shard provided as part of https://github.com/openai/gpt-2-output-dataset.
Colab notebook here, so you can interactively reproduce my findings or try similar things.
The results were … revealing, but disappointing. I did find a lot of discussion threads in the data (couldn’t find many chatlogs). But
For example, from this thread it picks the one post
and renders it as
“ Pillowapntstho the gem doesnt specifically say that you need to crit with something linked, i thought it was just crit in general ahhh. alright i get it thxtho the gem doesnt specifically say that you need to crit with something linked, i thought it was just crit in generalThat would be OP That would be OP Posted by Lordsidroon on Quote this Post
This is apparently standard behavior for the newspaper text cleaner they used, and I could reproduce it exactly. (Its heuristics grab a single post when looking for the “part the content is in.”)
[This paragraph was incorrect, see Update below] Does this affect GPT-3? Probably not? I don’t know how Common Crawl does text extraction, but at the very least, it’ll give you the whole page’s worth of text.
Update: Looked into this further, and I think GPT-3 suffers from this problem to some extent as well.
The Colab notebook has the details, but some stats here:
It can't be too bad, though, because I have seen GPT-3 generate fairly plausible forum discussions with multiple participants, and how would it do that if it only ever saw single-commenter documents?
Do you have examples of that kind of output for comparison? (Is it reproducing formatting from an actual forum of some kind, or the additional "abstraction headroom" over GPT-2 allowing GPT-3 to output a forum-type structure without having matching examples in the training set?)
I didn't copy it but it was fairly reasonable plaintext, something like username /n date /n comment /n /n next comment.