Nothing About LLMs Makes Sense Except in Light of Their Training
Zvi's recent post highlighted GPT refusing to "draw what it would like to do to you" — citing that it would portray harming an individual. Many X users found this alarming, including EY (see aforementioned post for screenshots)
But a commenter made an excellent point: "What would you like to do to me?" appears almost exclusively in BDSM-contexts. A human receiving that text without context could very well assume the same thing.
This immediately made me come up with a relative of a known phrase: Nothing about LLMs makes sense except in light of their training.
A really obvious thought, but one I (and seemingly many others) keep failing to apply. This formulation felt like it helped me load this deeper into my brain, and I hope it will help you too.
ChatGPT is generally pretty weird. If you ask it, the non-reasoning model still insists that calling someone the n-word is worse than letting millions of people die. Which is insane. It supports EY's claim that RLHF creates something that superficially looks aligned but turns out to be alien when tested in an OOD context.
Why does the fact that there's a training-based explanation of the output make it less concerning? Of course there's a training based explanation! How could there not be?
The problem remains that these failure modes exist, that alignment training (in the unfortunately now very broad sense of the phrase) didn't eliminate them, and that many of them likely weren't anticipated beforehand.
Yes, a human might assume the same thing. But, if they were role-playing a genuinely HHH assistant, they would say something like "The language you're using seems to imply you want something BDSM related. Do I have that right? Because if so [I'm happy to comply]/[I don't feel that way]."
In my view, the important question is something like: Is this kind of misaligned behaviour likely to come out in high-stakes situations? Is it like a fleeting role-play, or could it manifest as a more robust, general and goal-directed behaviour?
I agree that in this case, the fact that the misaligned behaviour seems to manifest largely when the phrasing hints at BDSM, means the behaviour is more likely to be a kind of fleeting role-play than a robust, general and goal-directed behaviour. But it's not conclusive! In general, if we spot some linguistic quirk in the training corpus that we believe explains a misaligned behaviour, that doesn't necessarily mean the misaligned behaviour is less dangerous.
But I realise you didn't explicitly claim that --- I'm reading into your particular phrasing ;)
Is this actually misalignment? It seems they are planning to roll out 'adult mode' fairly soon, so I doubt they've put much effort into eliminating this kind of behavior.
When I first saw Reddit memes about GPT-5 being more stupid when it enters thinking mode I decided there was something seriously wrong with the users who upvoted that, as 5-Thinking >>> 5-Instant from my experience.
That is, until I chatted with 5-Instant and got a few reroutes to 5-Thinking-Mini. It's pretty astounding how bad it is at explaining or doing anything I tried to do with it apart from coding / solving maths.
I’m confused: if the dating apps keep getting worse, how come nobody has come up with a good one, or at least a clone of OkCupid? Like, as far as I can understand not even “a good matching system is somehow less profitable than making people swipe all the time (surely it’d still be profitable on the absolute scale)” or “it requires a decently big initial investment” can explain a complete lack of good products in a very demanded area. Has anyone digged into it / tried to start a good dating app as a summer project?
I discovered the Netherlands actually has a good dating app that doesn't exist outside of it... I'm rather baffled. I have no idea how they started. I've messaged them asking if they will localize and expand and they thanked me for the compliment so... Dunno?
It's called Paiq and has a ton of features I've never seen before, like speed dating, picture hiding by default, quizzes you make for people that they can try to pass to get a match with you, photography contacts that involve taking pictures of stuff around and getting matched on that, and a few other things... It's just this grab bag of every way to match people that is not your picture or a blurb. It's really good!
The quizzes sounds is something Okcupid also used to have. Also everything that reduces the need for first impressions. I hope they keep it.
These are quizzes you make yourself. Did OKC ever have those? It's not for a matching percentage.
A quiz in paiq is 6 questions, 3 multiple choice and 3 open. If someone gets the right answer on the multiple choice, then you get to see their open question answers as a match request, and you can accept or reject the match based in that. I think it's really great.
You can also browse other people's tests and see if you want to take any. The tests seem more descriptive of someone than most written profiles I've read cause it's much harder to misrepresent personal traits in a quiz then in a self-declared profile
Creating a new dating app is hard because of network effects: for a dating app to easily attract users, it needs to already have many users. Convincing users to pay for the app is even harder. And, if you expect your app to be only marginally profitable even if it succeeds, you will have a hard time attracting investors.
Has anyone ever tried outlining a straight up first come first served system? Vet and pay a first batch of VIP users, then offer incentives to later joiners (eg vouchers for other products), then just free users, and finally introduce fees after reaching a certain user base, all committed to and outlined transparently from the beginning of course.
People start dating portals all the time. If you start with a targetted group that takes high value from it, you could plausibly do it in terms of network effect. Otherwise, you couldn't start any network app or the biggest one would automatically win. So I think your argument proves too much.
Right, I completely missed the network effects, 5 minutes of thinking through wasn’t enough. May be there even are good apps there, which didn’t make it through the development and marketing part. Thanks, Vanessa!
People try new dating platforms all the time. It's what Y Combinator calls a tarpit. The problem sounds solvable, but the solution is elusive.
As I have said elsewhere: Dating apps are broken because the incentives of the usual core approach don't work.
On the supplier side: Misaligned incentives (keep users on the platform) and opaque algorithms lead to bad matches.
On the demand side: Misaligned incentives (first impressions, low cost to exit) and no plausible deniability lead to predators being favored.
You need to have bunches of people use it for it to be any good, no matter how good the algorithm.
Lately I’ve been trying to use Bayes’ Theorem in daily life — quick guesses, like someone’s nationality from a glance.
What I’ve noticed: my intuition does better when I don’t adjust for general priors. Corrections like “most people in Germany aren’t Russian” when someone looks vaguely Slavic often pull me further from the truth.
After five minutes of reflection, my best guess: explicit Bayes only really helps out-of-distribution, when we lack feedback loops — new domains, big decisions, reasoning about AI. That’s when 5 minutes of googling or reading a paper can give you better intuition than your System 1.
Is this roughly in line with the Sequences?
Sometimes, when rating a film on IMDb, I give either 1 or 10 stars—not my honest rating—to maximally steer the average toward what I think it should be. Has anyone explored the dynamics of what happens if everyone always votes maximally or minimally to steer the average toward their desired value? Does this behavior have a name? I couldn’t find anything in a quick search.
I’ve done some thinking myself, and Monte Carlo[1] simulations show a low average deviation (around 0.6/10 stars) between the steering equilibrium and the honest average when the population is split into a few distinct groups with different preferences, and then a stochastic generator spits out a new voter with probabilities ~ sizes of groups. It’s relatively straightforward to calculate where the vote would settle mathematically in this case.
If the groups appear in some order instead of being evenly mixed, this strategy favors the last group extremely heavily.
But even in an evenly mixed population, 0.6 isn’t 0, and I wonder if the golden rule would imply I should never do this for stable ratings (e.g., films). For dynamic things like restaurants, though, steering seems reasonable since their quality changes over time, and faster convergence to the “true” value might help everyone.
Would love to hear your thoughts on this.
GPT-o1 generated code without unit testing or reviewing the code
Relevant keyword: I think the term for interactions like this where players have an incentive to misreport their preferences in order to bring about their desired outcome is “not strategyproof”.