Automated plagerism detection software is common. But cases like the recent incident with Harvard administrator Gay have shown that egregious cases of plagerism are still being uncovered. Why would this be the case? Is it really so hard to run plagerism checks for every paper on Sci-hub? Has anyone tried?

I am curious since I am currently upskilling for the purposes of technical alignment research and it seems like an interesting project to pursue. 

New Answer
New Comment

2 Answers sorted by

Shankar Sivarajan

Dec 12, 2023

90

"The total size of Sci-Hub database is about 100 TB."

i.e. $1000-$2000 in drive space, or $20 / day to store on Backblaze if you don't anticipate needing it for more than a couple of months tops.

1Shankar Sivarajan5mo
You're correct that simply storing the entire database isn't infeasible. But as I understand it, that's large enough that training a model on that is too expensive for most hobbyists to do just for kicks.
2faul_sname5mo
Depends on how big of a model you're trying to train, and how you're trying to train it. I was imagining something along the lines of "download the full 100TB torrent which includes 88M articles, extract the text of each article ("extract text from a given PDF" isn't super reliable but it should be largely doable), which should leave you somewhere in the ballpark of 4TB of uncompressed plain text. If you're using a BPE, that would leave you with ~1T tokens. If you're trying to do the chinchilla optimality thing, I fully agree that there's no way you're going to be able to do that with the compute budget available to mere mortals. If you're trying to do the "generate embeddings for every paragraph of every paper, and do similarity searches, and then on matches calculate edit distance to see if it was literally copy-pasted" I think that'd be entirely doable with a hobbyist budget. I personally think it'd be a great learning project.

Dagon

Dec 13, 2023

61

I think there are two reasons it's not more common to retroactively analyze papers and publications for copied or closely-paraphrased segments.

First, it's not actually easy to automate.  Current solutions are RIFE with false-positives and human judgement requirements to make final conclusions.

Second, and perhaps more importantly, nobody really cares, outside of graded work where the organization is basing your credentials on doing original work (but usually not even that, just semi-original presentation of other works).  

It would probably be a minor scandal if any significant papers were discovered to be based on uncredited/un-footnoted other work, but unless it were egregious (in which case it probably would have already been noticed), just not that big a deal.

Distinguishing between a properly cited paraphrase and taking someone's work as your own without sufficient attribution is not trivial even for people. There's a lot of grey area in terms of how closely you can mimic the original before it becomes problematic (this is largely what I've seen Rufo trying to hang the Harvard admin woman with, paraphrases that maintained a lot of the original wording which were nonetheless clearly cited, which at least to me seem like bad practice but not actually plagiarism in the sense it is generally meant) and it comes dow... (read more)

4 comments, sorted by Click to highlight new comments since: Today at 11:26 PM

Typo: plagerism -> plagiarism (4x, incl. in the title)

[-]Buck5mo30

I was just thinking about this. I think this would be a good learning experience!

Is plagiarism considered bad everywhere in the world, or is it an American foible? I vaguely recall reading years ago that in China it was not considered bad per-se and this occasionally gave Chinese some problems with American academic institutions. However I did not check the sources at the time nor quantified the effect, I was a naive newspaper-reader.

Standards have been going up over time, so grad students are unironically subjected to higher standards than university professors. I know of professors who have used google translate on English papers and published them in Chinese language journals.