New Answer

New Comment

2 Answers sorted by
top scoring

Dec 12, 2023

"The total size of Sci-Hub database is about 100 TB."

i.e. $1000-$2000 in drive space, or $20 / day to store on Backblaze if you don't anticipate needing it for more than a couple of months tops.

1Shankar Sivarajan2y

You're correct that simply storing the entire database isn't infeasible. But as I understand it, that's large enough that training a model on that is too expensive for most hobbyists to do just for kicks.

2faul_sname2y

Depends on how big of a model you're trying to train, and how you're trying to train it. I was imagining something along the lines of "download the full 100TB torrent which includes 88M articles, extract the text of each article ("extract text from a given PDF" isn't super reliable but it should be largely doable), which should leave you somewhere in the ballpark of 4TB of uncompressed plain text. If you're using a BPE, that would leave you with ~1T tokens. If you're trying to do the chinchilla optimality thing, I fully agree that there's no way you're going to be able to do that with the compute budget available to mere mortals. If you're trying to do the "generate embeddings for every paragraph of every paper, and do similarity searches, and then on matches calculate edit distance to see if it was literally copy-pasted" I think that'd be entirely doable with a hobbyist budget. I personally think it'd be a great learning project.

Dagon

Dec 13, 2023

I think there are two reasons it's not more common to retroactively analyze papers and publications for copied or closely-paraphrased segments.

First, it's not actually easy to automate. Current solutions are RIFE with false-positives and human judgement requirements to make final conclusions.

Second, and perhaps more importantly, nobody really cares, outside of graded work where the organization is basing your credentials on doing original work (but usually not even that, just semi-original presentation of other works).

It would probably be a minor scandal if any significant papers were discovered to be based on uncredited/un-footnoted other work, but unless it were egregious (in which case it probably would have already been noticed), just not that big a deal.

[-]Orual2y10

Distinguishing between a properly cited paraphrase and taking someone's work as your own without sufficient attribution is not trivial even for people. There's a lot of grey area in terms of how closely you can mimic the original before it becomes problematic (this is largely what I've seen Rufo trying to hang the Harvard admin woman with, paraphrases that maintained a lot of the original wording which were nonetheless clearly cited, which at least to me seem like bad practice but not actually plagiarism in the sense it is generally meant) and it comes dow... (read more)

4 comments, sorted by

top scoring

Click to highlight new comments since: Today at 4:37 PM

[-]MondSemmel2y45

Typo: plagerism -> plagiarism (4x, incl. in the title)

[-]Buck2y30

I was just thinking about this. I think this would be a good learning experience!

[-]rotatingpaguro2y1-2

Is plagiarism considered bad everywhere in the world, or is it an American foible? I vaguely recall reading years ago that in China it was not considered bad per-se and this occasionally gave Chinese some problems with American academic institutions. However I did not check the sources at the time nor quantified the effect, I was a naive newspaper-reader.

[-]Lao Mein2y50

Standards have been going up over time, so grad students are unironically subjected to higher standards than university professors. I know of professors who have used google translate on English papers and published them in Chinese language journals.

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

7

[ Question ]

Why No Automated Plagerism Detection For Past Papers?

7

7

2 Answers sorted by
top scoring

Dec 12, 2023

Dec 13, 2023

7

[ Question ]

Why No Automated Plagerism Detection For Past Papers?

7

7

2 Answers sorted by top scoring

Dec 12, 2023

Dec 13, 2023

2 Answers sorted by
top scoring