LESSWRONG
LW

545
future_detective
643110
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Semen and Semantics: Understanding Porn with Language Embeddings
future_detective5mo50

Hi Richard, yes, certain keywords are banned. What I'm measuring is semantic similarity. For example, a video titled "rape" will be banned, but a video suggesting rape may not be. By using text embeddings, we're finding the titles most similar to the concept of rape. To find trends over time, we're counting how many of those titles are found per year, weighted by the total number of titles in a year.

With respect to certain keywords, we see a decline in trends starting after 2020, likely because of Nicholas Kristof's NYT piece "The Children of Pornhub", which led to both stricter keyword standards and a mass removal of videos. "Drunk" and "coma" capture "incapacitation" as a euphemism, which was used as a way to get around explicit keyword policing. 

The fact that we do see declines in some areas and we have a known cause leads me to believe the data is reliable - it's not all showing a line straight up.

Reply
Semen and Semantics: Understanding Porn with Language Embeddings
future_detective5mo10

I wouldn't call the dataset comprehensive exactly, but it's plausibly representative - it's Internet Archive snapshots of "pornhub.com" from 2008-2023. You can see the script here https://github.com/dhealy05/semen_and_semantics/blob/main/data_retrieval/fetch_snapshots.py. I wrote an "HTMLParser" base class and e.g. "Parser2010" subclass, to put the data into a common format across years. The data and embeddings are in the repo if you want to use them without running a script.

Interesting ideas - I truncated the readme for LessWrong, but my "Future Work" section is 

"Analyze trends by "minutes watched" by weighting for views, view X length; this is more likely a heuristic for content production than actual viewing time"

So while I'm not sure about what those results would look like, I agree there's an angle there.

Reply
Semen and Semantics: Understanding Porn with Language Embeddings
future_detective5mo10

Hard to parse the reasons for the big clusters with complete certainty but this is a basically plausible story. Other macro factors I have mulled include FOSTA/SESTA - I find the timing interesting, given that it was one of the only major pieces of porn-centric legislation in the last 10 years and it took place right around the time of the big jump - and Nick Kristof's 2020 investigation, which clearly shows up in the data but did not dislodge the main trend.

Reply
Semen and Semantics: Understanding Porn with Language Embeddings
future_detective5mo40

So we actually see a decline in semantic-relationship to "child" starting around 2020, almost certainly as a result of Nicholas Krisof's "The Children of Pornhub" NYT investigation that year. Pornhub removed many titles and instituted stricter rules. Note that "child" is not a keyword, it's a semantic proxy, so it's capturing some of the "youth" or "teen" trend.

Reply
Semen and Semantics: Understanding Porn with Language Embeddings
future_detective5mo10

Hi Shankar, I will concede it is editorializing: these are my conclusions based on the data. As to whether or not it is bizarre, I will repost my response to Tao Lin:

""Incest is not a subcategory of sexual violence" is something of a loaded statement. Many "stepsister" videos highlight a certain kind of appearance and context: young, with a backpack, possibly braces, possibly in a setting in which they are still under the authority of a supervising adult ("mom and dad" etc). The implication, left unsaid, is that they are under the age of consent, which qualifies as statutory rape in America. IMO it's sufficient justification to include it in the same category."

You're right that there are other, concurrent trends in tone and quality to measure. "Lighthearted whimsy", which I might call more along the lines of "surreal", is not necessarily in contradiction with violence.

Reply
Semen and Semantics: Understanding Porn with Language Embeddings
future_detective5mo30

Hi Gram, I'd be interested in your theory if you'd like to offer it! Generally my feeling is that the realm of porn analysis lacks good data, and thanks to the relatively-new surplus of text embeddings we now have more.

Reply
Semen and Semantics: Understanding Porn with Language Embeddings
[+]future_detective5mo-22-22
Semen and Semantics: Understanding Porn with Language Embeddings
future_detective5mo20

Point A: Yes.

Point B: The data here demonstrates that porn has gotten more extreme in a quantifiable way. I would hesitate to ascribe a high degree of "agency" or "intentionality" to the trends. It seems to me you are reasoning backwards: porn has gotten more extreme, so that's what people wanted. In the aggregate you are correct but my point is that market participants drive the market, and most porn viewers are not in fact market participants (or they are in a peripheral way).

Reply
Semen and Semantics: Understanding Porn with Language Embeddings
future_detective5mo10

You're correctly describing the underlying experience for a certain cohort of porn viewing individual. The mechanism through which it takes place at scale is what I'm interested in: high-spending, high-engagement consumers go through the process you describe and production companies, which rely on their spending, tailor their content accordingly. Assuming "we" and "our" here is wrong IMO - I don't think this is a universal principle of porn viewers, it's just that those viewers shift the market in their direction.

Reply
Semen and Semantics: Understanding Porn with Language Embeddings
future_detective5mo60

Hi jbash, I dove a little deeper on my title accuracy system here: https://github.com/dhealy05/semen_and_semantics/blob/main/analysis_results/title_accuracy_logs/title_accuracy_readme.md but didn't account for it when I transferred the readme to the LessWrong format.

The rating system is a human in the loop: me. So it was my judgement call as to what the title accuracy entailed. My goal was to provide tools so that other interested parties would be able to make their own assessment, and that they could check my logs to verify. The logs are all included in that folder.

For example, to rate the videos surfaced in https://github.com/dhealy05/semen_and_semantics/blob/main/analysis_results/title_accuracy_logs/incest_title_accuracy.json I visited each URL and ranked on the 1-5 scale. "Cum in panties step sister" did not seem to involve a step sister, so I gave it a 1: SEO effect. "Kinky Family - Home alone with slutty stepsis" does indeed seem to involve a stepsis oriented plot, so I gave it a 5: no SEO effect.

Reply2
Load More
4Watch R1 "think" with animated chains of thought
4mo
0
69Semen and Semantics: Understanding Porn with Language Embeddings
5mo
27
2Claude is More Anxious than GPT; Personality is an axis of interpretability in language models
8mo
2