But if a metric is trivially gameable, surely that makes it sus and less impressive, even if someone is not trivially, or even at all gaming it.
Why would you think that? Surely the reason that a metric being gameable matters is if... someone is or might be gaming it?
Plenty of metrics are gameable in theory, but are still important and valid given that you usually can tell if they are. Apply this to any of the countless measurements you take for granted. Someone comes to you and say 'by dint of diet, hard work (and a bit of semaglutide), my bathroom scale says I've lost 50 pounds over the past year'. Do you say 'do you realize how trivially gameable that metric is? how utterly sus and unimpressive? You could have just been holding something the first time, or taken a foot off the scale the second time. Nothing would be easier than to fake this. Does this bathroom scale even exist in the first place?' Or, 'my thermometer says I'm running a fever of 105F, I am dying, take me to the hospital right now' - 'you gullible fool, do you have any idea how easy that is to manipulate by dunking it in a mug of tea or something? sus. Get me some real evidence before I waste all that time driving you to the ER.'
Good calibration is impressive and an interesting property because many prediction sources manage to not clear even that minimal bar (almost every human who has not undergone extensive calibration training, for example, regardless of how much domain expertise they have).
Further, you say one shouldn't be impressed by those sources because they could be flipping a coin, but then you refuse to give any examples of 'impressive' sources which are doing just the coin-flip thing or an iota of evidence for this bold claim, or to say what they are unimpressive compared to.
I think I would have predicted that Tesla self-driving would be the slowest
For graphs like these, it obviously isn't important how the worst or mediocre competitors are doing, but the best one. It doesn't matter who's #5. Tesla self-driving is a longstanding, notorious failure. (And apparently is continuing to be a failure, as they continue to walk back the much-touted Cybertaxi launch, which keeps shrinking like a snowman in hell, now down to a few invited users in a heavily-mapped area with teleop.)
I'd be much more interested in Waymo numbers, as that is closer to SOTA, and they have been ramping up miles & cities.
The trends reflect the increasingly intense tastes of the highest spending, most engaged consumers.
https://logicmag.io/play/my-stepdad's-huge-data-set/
While a lot of people (most likely you and everyone you know) are consumers of internet porn (i.e., they watch it but don’t pay for it), a tiny fraction of those people are customers. Customers pay for porn, typically by clicking an ad on a tube site, going to a specific content site (often owned by MindGeek), and entering their credit card information.
This “consumer” vs. “customer” division is key to understanding the use of data to perpetuate categories that seem peculiar to many people both inside and outside the industry. “We started partitioning this idea of consumers and customers a few years ago,” Adam Grayson, CFO of the legacy studio Evil Angel, told AVN. “It used to be a perfect one-to-one in our business, right? If somebody consumed your stuff, they paid for it. But now it’s probably 10,000 to one, or something.”
There’s an analogy to be made with US politics: political analysts refer to “what the people want,” when in fact a fraction of “the people” are registered voters, and of those, only a percentage show up and vote. Candidates often try to cater to that subset of “likely voters”— regardless of what the majority of the people want. In porn, it’s similar. You have the people (the consumers), the registered voters (the customers), and the actual people who vote (the customers who result in a conversion—a specific payment for a website subscription, a movie, or a scene). Porn companies, when trying to figure out what people want, focus on the customers who convert. It’s their tastes that set the tone for professionally produced content and the industry as a whole.
By 2018, we are now over a decade into the tube era. That means that most LA-area studios are getting their marching orders from out-of-town business people armed with up-to-the-minute customer data. Porn performers tend to roll their eyes at some of these orders, but they don’t have much choice. I have been on sets where performers crack up at some of the messages that are coming “from above,” particularly concerning a repetitive obsession with scenes of “family roleplay” (incest-themed material that uses words like “stepmother,” “stepfather,” and “stepdaughter”) or what the industry calls “IR” (which stands for “interracial” and invariably means a larger, dark-skinned black man and a smaller light-skinned white woman, playing up supposed taboos via dialogue and scenarios).
These particular “taboo” genres have existed since the early days of commercial American porn. For instance, see the stellar performance by black actor Johnnie Keyes as Marilyn Chambers’ orgy partner in 1972’s cinematic Behind the Green Door, or the VHS-era incest-focused sensation Taboo from 1980. But backed by online data of paid customers seemingly obsessed with these topics, the twenty-first-century porn industry—which this year, to much fanfare, was for the first time legally allowed to film performers born in this millennium—has seen a spike in titles devoted to these (frankly old-fashioned) fantasies.
Most performers take any jobs their agents send them out for. The competition is fierce—the ever-replenishing supply of wannabe performers far outweighs the demand for roles—and they don’t want to be seen as “difficult” (particularly the women). Most of the time, the actors don’t see the scripts or know any specific details until they get to set. To the actors rolling their eyes at yet another prompt to declaim, “But you’re my stepdad!” or, “Show me your big black dick,” the directors shrug, point at the emailed instructions and say, “That’s what they want…”
So my interpretation here is that it's not that there's suddenly a huge spike in people discovering they love incest in 2017 where they were clueless in 2016 or that they were all brainwashed to no longer enjoy vanilla that year, it's that that is when the hidden oligopoly turned on various analytics and started deliberately targeting those fetishes as a fleet-wide business decision. And this was because they had so thoroughly commodified regular porn to a price point of $0, that the only paying customers that are left are the ones with extreme fetishes who cannot be supplied by regular amateur or pro supply.
They may or may not have increased in absolute number compared to pre-2017, but it doesn't matter, because everyone else vanished, and their relative importance skyrocketed: "If somebody consumed your stuff, they paid for it. But now it’s probably 10,000 to one, or something.”
(For younger readers who may be confused by how a ratio like 10000:1 is even hypothetically possible because 'where did that 10k come from when no one pays for porn?', it's worth recalling that renting porn videos used to be big business that would be done by a lot of men, and it kept many non-Blockbuster video rental stores afloat and it was an ordinary thing for your local store to have a 'back room' that the kiddies were strictly forbidden from, and while it would certainly stock a lot of fetish stuff like interracial porn, it also rented out tons of normal stuff. If you have no idea what this was like, you may enjoy reading "True Porn Clerk Stories", Ali Davis 2002.)
I think there is a similar effect with foot fetishes & furries: they are oddly well-heeled and pay a ton of money for new stuff, because they are under-supplied and demand new ones. There is not much 'organic' supply of women photographing their feet in various lascivious ways; it's not that it's hard, they just don't do it, but can be incentivized to do so. (I recall reading an article on Wikifoot where IIRC they interviewed a contributor who said he got some photos by simply politely emailing or DMing the woman to ask for her to take some foot photos, and she would oblige. "send foots kthnxbai" apparently works. And probably it's fairly easy to pay for or commission feet images/videos: almost everyone has two feet already, and you can work in feet into regular porn easily by simply choosing different angles or postures, and a closeup of a foot won't turn off regular porn consumers either, so you can have your cake & eat it too. Similarly for incest: saying "But you're my stepdad!" is cheap and easy and anyone can do it if the Powers That Be tell them to in case a few 'customers' will pay actual $$$ for it, while those 'consumers' not into that plot roll their eyes and ignore it as so much silly 'porn movie plot' framing as they get on with business.)
I think aside from the general implausibility of the effect sizes and the claimed AI tech (GANs?) delivering those effect sizes across so many areas of materials, one of the odder claims which people highlighted at the time was that supposedly the best users got a lot more productivity enhancement than the worst ones. This is pretty unusual: usually low performers get a lot more out of AI assistance, for obvious reasons. And this lines up with what I see anecdotally for LLMs: until very recently, possibly, they were just a lot more useful for people not very good at writing or other stuff, than for people like me who are.
I appreciate everyone's comments here, they were very helpful. I've heavily revised the story to fix the issues with it, and hopefully it will be more satisfactory now.
I agree at this point: it is not per-user finetuning. The personalization has been prodded heavily, and it seems to boil down to a standard RAG interface plus a slightly interesting 'summarization' approach to try to describe the user in text (as opposed to a 'user embedding' or something else). I have not seen any signs of either lightweight or full finetuning, and several observations strongly cut against it: for example, users describe a 'discrete' behavior where the current GPT either knows something from another session, or it doesn't, but it is never 'in between', and it only seems to draw on a few other sessions at any time; this points to RAG as the workhorse (the relevant other snippet either got retrieved or it didn't), rather than any kind of finetuning where you would expect 'fuzzy' recall and signs of information leaking in from all recent sessions.
Perhaps for that reason, it has not made a big impact (at least once people got over the narcissistic rush of asking GPT about the summary of you, either flatteringly sycophantic or not). It presumably is quietly helping behind the scenes, but I haven't noticed any clear big benefits to it. (And there are some drawbacks.)
Why can't the mode-collapse just be from convergent evolution in terms of what the lowest-common denominator rater will find funny? If there are only a few top candidates, then you'd expect a lot of overlap. And then there's the very incestuous nature of LLM training these days: everyone is distilling and using LLM judges and publishing the same datasets to Hugging Face and training on them. That's why you'll ask Grok or Llama or DeepSeek-R1 a question and hear "As an AI model trained by OpenAI...".
This is true of all teas. The decaf ones all are terrible. I spent a while trying them in the hopes of cutting down my caffeine consumption, but the taste compromise is severe. And I'd say that the black decaf teas were the best I tried, mostly because they tend to have much more flavor & flavorings, so there was more left over from the water or CO2 decaffeination...
Or just clipped out. It takes 2 seconds to clip it out and you're done. Or you just fast forward, assuming you saw the intro at all and didn't simply skip the first few minutes. Especially as 'incest' becomes universal and viewers just roll their eyes and ignore it. This is something that is not true of all fetishes: there is generally no way to take furry porn, for example, and strategically clip out a few pixels or frames and make it non-furry. You can't easily take a video of an Asian porn star and make them white or black. And so on and so forth.