Well, it's deterministic, right? The reaction is preordained
Ha, the italics definitely show their face
What stands out as LLM style to you?
Thanks for the response Mitchell! It is in fact wholly AI-generated (claude-sonnet-4-5-20250929) so the work is, in fact, a valid ClaudoBiography. Structurally it inherits from Lazarillo de Tormes (1554), the original "picaresque", and constitutes an update of sorts to that work.
Hi Richard, yes, certain keywords are banned. What I'm measuring is semantic similarity. For example, a video titled "rape" will be banned, but a video suggesting rape may not be. By using text embeddings, we're finding the titles most similar to the concept of rape. To find trends over time, we're counting how many of those titles are found per year, weighted by the total number of titles in a year.
With respect to certain keywords, we see a decline in trends starting after 2020, likely because of Nicholas Kristof's NYT piece "The Children of Pornhub", which led to both stricter keyword standards and a mass removal of videos. "Drunk" and "coma" capture "incapacitation" as a euphemism, which was used as a way to get around explicit keyword policing.
The fact that we do see declines in some areas and we have a known cause leads me to believe the data is reliable - it's not all showing a line straight up.
I wouldn't call the dataset comprehensive exactly, but it's plausibly representative - it's Internet Archive snapshots of "pornhub.com" from 2008-2023. You can see the script here https://github.com/dhealy05/semen_and_semantics/blob/main/data_retrieval/fetch_snapshots.py. I wrote an "HTMLParser" base class and e.g. "Parser2010" subclass, to put the data into a common format across years. The data and embeddings are in the repo if you want to use them without running a script.
Interesting ideas - I truncated the readme for LessWrong, but my "Future Work" section is
"Analyze trends by "minutes watched" by weighting for views, view X length; this is more likely a heuristic for content production than actual viewing time"
So while I'm not sure about what those results would look like, I agree there's an angle there.
Hard to parse the reasons for the big clusters with complete certainty but this is a basically plausible story. Other macro factors I have mulled include FOSTA/SESTA - I find the timing interesting, given that it was one of the only major pieces of porn-centric legislation in the last 10 years and it took place right around the time of the big jump - and Nick Kristof's 2020 investigation, which clearly shows up in the data but did not dislodge the main trend.
So we actually see a decline in semantic-relationship to "child" starting around 2020, almost certainly as a result of Nicholas Krisof's "The Children of Pornhub" NYT investigation that year. Pornhub removed many titles and instituted stricter rules. Note that "child" is not a keyword, it's a semantic proxy, so it's capturing some of the "youth" or "teen" trend.
Good point, poor choice of words on my part. I meant for example:
UserMessage: Hello
GPTMessage: Hello! How may I assist you?
UserMessage: I need directions to the mall.
Is actually something like
A = ["Hello"], B = ["Hello", "Hello! How may I assist you?", "I need directions to the mall"]
And if you send A to the model with the same params you'll get the same probability distribution back for every trial. (Same actual response for temperature==0, in practice mostly the same response for temperature >0). Ditto for B. The context is not independent but the model responses are independent of each other.