I wouldn't call the dataset comprehensive exactly, but it's plausibly representative - it's Internet Archive snapshots of "pornhub.com" from 2008-2023. You can see the script here https://github.com/dhealy05/semen_and_semantics/blob/main/data_retrieval/fetch_snapshots.py. I wrote an "HTMLParser" base class and e.g. "Parser2010" subclass, to put the data into a common format across years. The data and embeddings are in the repo if you want to use them without running a script.
Interesting ideas - I truncated the readme for LessWrong, but my "Future Work" section is
"Analyze trends by "minutes watched" by weighting for views, view X length; this is more likely a heuristic for content production than actual viewing time"
So while I'm not sure about what those results would look like, I agree there's an angle there.
Hard to parse the reasons for the big clusters with complete certainty but this is a basically plausible story. Other macro factors I have mulled include FOSTA/SESTA - I find the timing interesting, given that it was one of the only major pieces of porn-centric legislation in the last 10 years and it took place right around the time of the big jump - and Nick Kristof's 2020 investigation, which clearly shows up in the data but did not dislodge the main trend.
So we actually see a decline in semantic-relationship to "child" starting around 2020, almost certainly as a result of Nicholas Krisof's "The Children of Pornhub" NYT investigation that year. Pornhub removed many titles and instituted stricter rules. Note that "child" is not a keyword, it's a semantic proxy, so it's capturing some of the "youth" or "teen" trend.
Hi Shankar, I will concede it is editorializing: these are my conclusions based on the data. As to whether or not it is bizarre, I will repost my response to Tao Lin:
""Incest is not a subcategory of sexual violence" is something of a loaded statement. Many "stepsister" videos highlight a certain kind of appearance and context: young, with a backpack, possibly braces, possibly in a setting in which they are still under the authority of a supervising adult ("mom and dad" etc). The implication, left unsaid, is that they are under the age of consent, which qualifies as statutory rape in America. IMO it's sufficient justification to include it in the same category."
You're right that there are other, concurrent trends in tone and quality to measure. "Lighthearted whimsy", which I might call more along the lines of "surreal", is not necessarily in contradiction with violence.
Hi Gram, I'd be interested in your theory if you'd like to offer it! Generally my feeling is that the realm of porn analysis lacks good data, and thanks to the relatively-new surplus of text embeddings we now have more.
Point A: Yes.
Point B: The data here demonstrates that porn has gotten more extreme in a quantifiable way. I would hesitate to ascribe a high degree of "agency" or "intentionality" to the trends. It seems to me you are reasoning backwards: porn has gotten more extreme, so that's what people wanted. In the aggregate you are correct but my point is that market participants drive the market, and most porn viewers are not in fact market participants (or they are in a peripheral way).
You're correctly describing the underlying experience for a certain cohort of porn viewing individual. The mechanism through which it takes place at scale is what I'm interested in: high-spending, high-engagement consumers go through the process you describe and production companies, which rely on their spending, tailor their content accordingly. Assuming "we" and "our" here is wrong IMO - I don't think this is a universal principle of porn viewers, it's just that those viewers shift the market in their direction.
Hi jbash, I dove a little deeper on my title accuracy system here: https://github.com/dhealy05/semen_and_semantics/blob/main/analysis_results/title_accuracy_logs/title_accuracy_readme.md but didn't account for it when I transferred the readme to the LessWrong format.
The rating system is a human in the loop: me. So it was my judgement call as to what the title accuracy entailed. My goal was to provide tools so that other interested parties would be able to make their own assessment, and that they could check my logs to verify. The logs are all included in that folder.
For example, to rate the videos surfaced in https://github.com/dhealy05/semen_and_semantics/blob/main/analysis_results/title_accuracy_logs/incest_title_accuracy.json I visited each URL and ranked on the 1-5 scale. "Cum in panties step sister" did not seem to involve a step sister, so I gave it a 1: SEO effect. "Kinky Family - Home alone with slutty stepsis" does indeed seem to involve a stepsis oriented plot, so I gave it a 5: no SEO effect.
Hi Richard, yes, certain keywords are banned. What I'm measuring is semantic similarity. For example, a video titled "rape" will be banned, but a video suggesting rape may not be. By using text embeddings, we're finding the titles most similar to the concept of rape. To find trends over time, we're counting how many of those titles are found per year, weighted by the total number of titles in a year.
With respect to certain keywords, we see a decline in trends starting after 2020, likely because of Nicholas Kristof's NYT piece "The Children of Pornhub", which led to both stricter keyword standards and a mass removal of videos. "Drunk" and "coma" capture "incapacitation" as a euphemism, which was used as a way to get around explicit keyword policing.
The fact that we do see declines in some areas and we have a known cause leads me to believe the data is reliable - it's not all showing a line straight up.