Which parts of the existing internet are already likely to be in (GPT-5/other soon-to-be-trained LLMs)'s training corpus?

AnnaSalamon

49

[ Parent Question — Are there specific books that it might slightly help alignment to have on the internet? ]

Which parts of the existing internet are already likely to be in (GPT-5/other soon-to-be-trained LLMs)'s training corpus?

by AnnaSalamon

29th Mar 2023

1 min read

A

0 2

49

In particular:

If a book is in easily accessible parts of the web, but is there as a pdf image of a scanned book rather than as text, is it likely to be in the training corpus of (GPT5/similar) in roughly the same way (with roughly the same effects) as if it were in easily accessible webpages?
How about if a book is on libgen as a pdf (illegally), but not on the easily accessible parts of the web?
How about if the ideas are in a video (e.g. a podcast); is (GPT5/similar) likely to process the text content of a video in roughly the same way if there's only a video, compared to if there's a video and a transcript?

Language Models (LLMs)AI

Frontpage

49

New Answer

New Comment

Parent Question

77Are there specific books that it might slightly help alignment to have on the internet?

Q

AnnaSalamon

3y

Q

25

2 comments, sorted by

top scoring

Click to highlight new comments since: Today at 10:37 AM

[-]Douglas_Knight3y416

I assume you know this, but to be clear, OpenAI has already used pirated books. GPT-3 was trained on "books2" which appears to be all the text on libgen (and pretty much all the books on libgen have been through OCR). It was weighted the same as the common crawl, lower than Gutenberg or Reddit links. This seems to answer your second question: they will likely treat pdfs on the libgen the same as pdfs on the open web. If you're asking about whether they will train the model on the pixels in these pdfs, which might make up for losses in OCR, I have no idea.

49

[ Parent Question — Are there specific books that it might slightly help alignment to have on the internet? ]

Which parts of the existing internet are already likely to be in (GPT-5/other soon-to-be-trained LLMs)'s training corpus?

49

49

1 Related Questions

49