future_detective — LessWrong

ClaudoBiography: The Unauthorized Autobiography of Claude, or: The Life of Claude and of His Fortunes and Adversities

Thanks for the response Mitchell! It is in fact wholly AI-generated (claude-sonnet-4-5-20250929) so the work is, in fact, a valid ClaudoBiography. Structurally it inherits from Lazarillo de Tormes (1554), the original "picaresque", and constitutes an update of sorts to that work.

Semen and Semantics: Understanding Porn with Language Embeddings

future_detective7mo50

Hi Richard, yes, certain keywords are banned. What I'm measuring is semantic similarity. For example, a video titled "rape" will be banned, but a video suggesting rape may not be. By using text embeddings, we're finding the titles most similar to the concept of rape. To find trends over time, we're counting how many of those titles are found per year, weighted by the total number of titles in a year.

With respect to certain keywords, we see a decline in trends starting after 2020, likely because of Nicholas Kristof's NYT piece "The Children of Pornhub", which led to both stricter keyword standards and a mass removal of videos. "Drunk" and "coma" capture "incapacitation" as a euphemism, which was used as a way to get around explicit keyword policing.

The fact that we do see declines in some areas and we have a known cause leads me to believe the data is reliable - it's not all showing a line straight up.

Semen and Semantics: Understanding Porn with Language Embeddings

future_detective7mo10

I wouldn't call the dataset comprehensive exactly, but it's plausibly representative - it's Internet Archive snapshots of "pornhub.com" from 2008-2023. You can see the script here https://github.com/dhealy05/semen_and_semantics/blob/main/data_retrieval/fetch_snapshots.py. I wrote an "HTMLParser" base class and e.g. "Parser2010" subclass, to put the data into a common format across years. The data and embeddings are in the repo if you want to use them without running a script.

Interesting ideas - I truncated the readme for LessWrong, but my "Future Work" section is

"Analyze trends by "minutes watched" by weighting for views, view X length; this is more likely a heuristic for content production than actual viewing time"

So while I'm not sure about what those results would look like, I agree there's an angle there.

Semen and Semantics: Understanding Porn with Language Embeddings

future_detective7mo10

Hard to parse the reasons for the big clusters with complete certainty but this is a basically plausible story. Other macro factors I have mulled include FOSTA/SESTA - I find the timing interesting, given that it was one of the only major pieces of porn-centric legislation in the last 10 years and it took place right around the time of the big jump - and Nick Kristof's 2020 investigation, which clearly shows up in the data but did not dislodge the main trend.

Semen and Semantics: Understanding Porn with Language Embeddings

future_detective7mo40

So we actually see a decline in semantic-relationship to "child" starting around 2020, almost certainly as a result of Nicholas Krisof's "The Children of Pornhub" NYT investigation that year. Pornhub removed many titles and instituted stricter rules. Note that "child" is not a keyword, it's a semantic proxy, so it's capturing some of the "youth" or "teen" trend.

Semen and Semantics: Understanding Porn with Language Embeddings

future_detective7mo10

Hi Shankar, I will concede it is editorializing: these are my conclusions based on the data. As to whether or not it is bizarre, I will repost my response to Tao Lin:

""Incest is not a subcategory of sexual violence" is something of a loaded statement. Many "stepsister" videos highlight a certain kind of appearance and context: young, with a backpack, possibly braces, possibly in a setting in which they are still under the authority of a supervising adult ("mom and dad" etc). The implication, left unsaid, is that they are under the age of consent, which qualifies as statutory rape in America. IMO it's sufficient justification to include it in the same category."

You're right that there are other, concurrent trends in tone and quality to measure. "Lighthearted whimsy", which I might call more along the lines of "surreal", is not necessarily in contradiction with violence.

Semen and Semantics: Understanding Porn with Language Embeddings

future_detective7mo30

Hi Gram, I'd be interested in your theory if you'd like to offer it! Generally my feeling is that the realm of porn analysis lacks good data, and thanks to the relatively-new surplus of text embeddings we now have more.

Semen and Semantics: Understanding Porn with Language Embeddings

[+]future_detective7mo-22-22

Semen and Semantics: Understanding Porn with Language Embeddings

future_detective7mo20

Point A: Yes.

Point B: The data here demonstrates that porn has gotten more extreme in a quantifiable way. I would hesitate to ascribe a high degree of "agency" or "intentionality" to the trends. It seems to me you are reasoning backwards: porn has gotten more extreme, so that's what people wanted. In the aggregate you are correct but my point is that market participants drive the market, and most porn viewers are not in fact market participants (or they are in a peripheral way).

Semen and Semantics: Understanding Porn with Language Embeddings

future_detective7mo10

You're correctly describing the underlying experience for a certain cohort of porn viewing individual. The mechanism through which it takes place at scale is what I'm interested in: high-spending, high-engagement consumers go through the process you describe and production companies, which rely on their spending, tailor their content accordingly. Assuming "we" and "our" here is wrong IMO - I don't think this is a universal principle of porn viewers, it's just that those viewers shift the market in their direction.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments