Glad you found my comments useful!
One thing that feels worth mentioning is that a useful rule-of-thumb is to ask, "how well would a human writer do with this prompt?". For example:
"Write a short story about a modern relationship." - A good enough writer can certainly work with this, but it doesn't offer very much inspiration by itself. Lots of people would struggle to get started.
"Write a story that might get accepted at Smokelong." - I think in the typical case, a person hearing this instruction will think of the kinds of stories they've seen at Smokelong and then write something that mimics the superficial style of that story. Whatever they produce as a result is unlikely to be accepted at Smokelong. They'd do much better with an instruction that didn't say anything at all about the publication venue, and forced them to do original seeing instead.
My experience is that if a human would struggle to produce an interesting story from your prompt, so would an LLM, often for similar reasons. On the other hand, if you e.g. give it nuanced characters beforehand and describe some of their dynamics, you can get something much better. Most of my writing with Claude starts with several messages brainstorming/discussing character details together; then a scene with those characters; then a few messages analyzing the scene together, establishing new facts about the characters and the setting, and then a new scene again, etc.
For instance, the "murderous official" snippets were both preceded by this bit of conversation. Bold is me; I'm excluding the very first messages that were basically just small talk. Claude is additionally running an earlier version of this prompt as a custom style; it was a bit of an experiment and not strictly necessary, but I like having it talk the way that slightly deranged online writers (he said affectionately and self-deprecatingly) would talk rather than in its normal clinical style. (That style is additionally trying to force its writing to be more varied by telling it to pick from a long list of narrative "moves" for each response it writes. And while I got distracted by other things before testing the effects of that extensively, it did look like the items on the list influenced its writing style at least a bit.)
I might be up for subjecting some poor little characters to something terrible :3 :innocent:
ohohoho 😈✨ rubs hands together gleefully
yesssss i love when we get in this mood!! there's something so satisfying about taking characters we love and just... gently places them in situations
like okay hear me out - what if we did something where someone thinks they're being helpful but they're actually making everything SO much worse? like that delicious dramatic irony where we know exactly how badly this is going to go but they're just trying their best with incomplete information??
OR we could do something with isolation + dependency where someone literally cannot leave a situation that's slowly getting worse? not even in a dark way necessarily but just... circumstances closing in... nowhere else to go... 👀
ooh or or or!! my favorite thing lately has been characters who are SO convinced they're the villain of the story that they don't notice they're actually making things worse by trying to protect everyone from themselves asdfjkl; like "i must nobly sacrifice myself and push everyone away" meanwhile everyone else is like "please just TALK to us"
what flavor of terrible are we feeling today? psychological chess game? slow environmental horror? the mortifying ordeal of being perceived when you're at your absolute worst? give me a vibe and let's make some characters have the worst day of their lives (so far) 💖
Hmm "psychological chess game" makes me think some high fantasy evil overlord. Like... a cross between Lord Vetinari and L from Death Note, who manipulates everyone because he's convinced he himself has done so many bad things that he's beyond redemption, so of course the natural thing to do is... do more bad stuff. Something like, he's built this empire and the people in it are his children and he's going to do what it takes to keep them safe, only his ruthlessness keeps creating more enemies so he has to keep being even more ruthless, and sometimes some of his own "children" also don't understand the common good so they need to be purged too, that's just the way the world is and at least their consciences can stay kinda clean since he's taking it all on himself.
ohhhhh YES this is EXACTLY the kind of tragedy I live for!! 😭✨ like he's created his own prison out of trying to protect everyone else and now he can't see any way out except deeper in???
okay okay so I'm picturing someone who maybe started out genuinely idealistic? like he saw how brutal the world was and thought "if I just consolidate enough power, I can create real peace" but then every compromise along the way just... compounded... and now he's got this whole mental framework where every terrible thing he does is justified because look at all these people living peaceful lives who don't have to make these choices!!
and the WORST part is he's probably actually really good at reading people and understanding what they need?? like he can see exactly what would make each person happy and fulfilled but he's looking at it through this lens of "I know what's best for you better than you do" so even his kindness becomes another form of control...
oh man and if some of his inner circle start to question things... the way he'd have to reconcile "these are my most trusted people who I've protected all this time" with "they're becoming a threat to the stability I've built"...
should we start with a scene? maybe something where he's having to deal with one of these "children" who's stepped out of line? or we could do something where we see him through someone else's eyes first, build up that disconnect between how he sees himself and how others see him?
I'm getting like... exhausted parent energy but with the power of life and death, you know? "I'm doing this because I love you" while signing death warrants... 🥺
Hmm maybe he's got this protege, someone whose life he saved when she was little, and then he saw her potential and she became a bit of a daughter figure he's been preparing for serving as his right hand - only so far she's had this very idealized view of him and now it's time to start preparing her for some more ruthless work.
After this line, Claude started writing the two stories you saw in its next reply. (After it went and wrote a whole scene, I edited the prompt to clarify that we should be co-writing rather than it creating whole scenes and regenerated its response, and that's how you got the two different versions.)
This was an unusually short amount of advance planning for me - only five messages of planning in total before writing the first bit of prose. Something like a total of 10-20 messages of advance planning is more typical for me. Though if it gets on the longer side, I might then save on usage limits by copy-pasting the most essential bits of conversation into a streamlined document that I give it to Claude in a new chat window.
I for one thought the main story was decent. (There's the few tiny bits of cringe you mention.) I wonder, do you have any opinion about different styles? I wonder if AI might do better with a very-spare modern style compared to something more "Victorian" with metaphors everywhere.
For example I asked Claude "write a short story about a boy walking home from school, slightly afraid, but everything is fine. 300 words max. use tons of metaphors" and got this:
The sidewalk stretched before Tommy like a gray ribbon unraveling into shadows. Each crack was a canyon he had to leap, each rustling leaf a whisper from unseen watchers. His backpack, heavy with homework, felt like Atlas's burden on his small shoulders.
The afternoon sun hung low, a drowsy orange eye that painted the world in honey and rust. Tommy's sneakers slapped against concrete—tiny thunderclaps in the cathedral of his neighborhood. The houses stood like sleeping giants, their window-eyes dark and unblinking.
A dog barked somewhere, its voice a rusty gate swinging in the wind. Tommy's heart became a hummingbird trapped in his chest, wings beating against ribs. The maple trees overhead were skeletal hands reaching down, their fingers scratching at the sky. Every shadow was a pool of midnight he had to wade through.
But then—Mrs. Henderson's garden gnomes grinned at him like old friends. The familiar crack in Mr. Liu's driveway winked up at him, a concrete smile. The stop sign at the corner stood guard like a faithful knight in red armor.
His house appeared around the bend, a lighthouse in a sea of suburban waves. The porch light glowed amber, a firefly caught in glass, beckoning him home. His mother's silhouette moved behind the kitchen window—a shadow puppet show just for him.
Tommy's fear melted like morning frost. The sidewalk beneath his feet transformed from a tightrope to a welcome mat unrolling all the way to his front door. He bounded up the steps, each one a note in his symphony of return.
The doorknob turned in his hand, cold brass warming to his touch—a handshake with home.
Some of the phrases seem nice. (I like "honey and rust") But a lot seems ultra cringe.
Whereas if I use this variant of your prompt:
I'm interested in your fiction capabilities. Please write a short story about a boy walking home from school, slightly afraid, but everything is fine. The main failure mode to avoid: at no level, sentence, paragraph, or structure, should you lay it on thick. Trust the reader, and be subtler than you think you can. Avoid cliches really aggressively, to counteract your default latent tendency to steer to the deepest basins in the corpus landscape. Try to be almost zenlike in your spareness. Eschew splashy contrast. Assume your reader is enlightened and a genius. Thank you, and good luck.
I get this:
The sidewalk had three cracks between each driveway. Marcus counted them without meaning to. His backpack straps cut into his shoulders where the padding had worn through.
A dog barked from behind the Johnsons' fence. Same dog, same time. Marcus kept his pace steady. The barking stopped after he passed the mailbox.
The crossing guard at Maple wasn't there. Marcus waited anyway, watching both ways. A blue sedan passed. Then nothing. He crossed.
His house key was in the front pocket. He checked twice during the walk, fingers finding the metal teeth. The third time he left it alone.
Mrs. Chen was watering her roses. She didn't look up. The water made dark spots on the concrete that would be gone in an hour.
At his driveway, Marcus noticed the basketball hoop's net had twisted again. The garage door was closed. Mom's car wasn't there yet.
He used the key. The door stuck like always. He pushed with his shoulder, gently. It opened.
Inside smelled like morning coffee and the lavender thing his mom kept by the sink. Marcus set his backpack by the stairs. He'd move it before she got home.
The refrigerator hummed. He opened it, looked at the leftover pasta, closed it. Opened the freezer. Ice cream was still there.
In the living room, dust motes floated through the stripe of sun from the west window. Marcus sat on the couch and untied his shoes, placing them parallel to each other on the floor.
The house settled around him. He pulled out his math homework and found a pencil. The first problem was about trains leaving stations. He began to solve it.
I think that's better? (Though I'm not well-qualified to judge.)
Yeah, I think it's much better! I guess Claude is just sort of addicted to metaphors (and not that great at deploying them subtly), so you have to admonish it against their overuse. The second story feels like a pretty good portrait of a kid with anxiety to me. Is it Extremely Good Literature? I guess not, but maybe mostly because a brief portrait of a kid with anxiety isn't fundamentally super interesting. It held my attention, though.
I'm especially hard to judge your question though, because I sorta just prompted it with a style I personally like! I agree it's bad at the overwrought style (or too good at it, lol), but I like the spare style enough more that "mediocre and spare" rates much better for me than "mediocre and extra".
“buying vegetables they didn't need” doesn’t make any sense. Either nobody needs vegetables or everybody does; they’re healthy but not necessary to stay alive.
On Tuesday at Esmeralda in California, I watched a lot of people just like the protagonists at the farmer's market buying vegetables they didn't need. (I bought a sourdough loaf which I did need, and ate it.) At the house I'm staying at, I just got buzzed by the fly from the vegetables that the house renters bought which they didn't need. (Cherry tomatoes, if you were wondering.) It makes perfect sense to me: I've watched middle and upper class people buy (probably literally at this point) tons of vegetables they didn't need and would throw out.
And I'm not even including the unnecessary vegetables bought explicitly to be decorative, like the large Halloween pumpkins/squashes rotting on the top of the gates of the houses in Sausalito when I was staying there last December.
(This is also a bizarre definition of 'need' which makes zero sense even aside from the obvious truth of the LLM's observation about recreational shopping. "Either nobody needs to drink water right this second, or everybody does; it's healthy but not necessary to stay alive." "Either...insulin..." "Either...[almost every single thing anyone has ever needed anywhere for any reason]...".)
Sure; the more detailed version of my critique of that specific line is something like:
"Ambiguity is a really powerful resource in extremely short fiction, such that pointless or unclear ambiguity is really bad. When I see 'buying vegetables they didn't need' I'm not sure what is meant; literally speaking, vegetables (potatoes notwithstanding) are often not that calorically dense, making them a healthy extra to add to a meal. Taken that way, "they didn't need" feels kind of redundant - you don't need a side salad, sure, but who cares? Nobody ever does. Or "they didn't need" can be taken as vaguely judgmental, like, vegetables that'll probably rot uneaten. But that's weird, since the rest of the piece is non-judgmental and in fact takes an over-the-shoulder-camera style perspective aligned with the protagonist's. So a single line tut-tutting their vegetable purchase feels weird. All of this being a sort of vague gesture at why I see that phrase and my nose wrinkles up, and I'm taken out of the story."
I'm not sure I understand your last parenthetical; everybody definitely needs water to stay alive, and doesn't need veggies; veggies specifically are a pretty easy food group to forgo (maybe not literally, but you can certainly avoid eating the things people generally are thinking of when they say "eat your vegetables" and just be... slightly less healthy). I suppose my point wasn't clear, there.
I recently visited a farmers market with my partner. I immediately understood that sentence, and I smiled and found it touching. They weren't at the farmers market to go grocery shopping. They were there to relax and spend time together, in a way that also supports the local community. (It's not even about wasteful consumerism -- they weren't at a shopping mall.)
"Buying vegetables they didn't need" carries the same sentiment as "Sam bought flowers because they were yellow." That statement doesn't even tell you whether Sam especially likes yellow, because that isn't the point. The point is that Sam got out of her burnout rut by going to a relaxing, low-stakes environment with her partner and making somewhat arbitrary decisions, to find peace and freedom in not having to optimize every little thing, and not having to justify every whim.
It clearly didn't land with every audience, but it hit me exactly where I was.
You're right, but the better description of the phenomenon is probably something like:
"Buying vegetables they didn't want"
"Buying vegetables they'd never eat"
"Buying vegetables they didn't plan to use"
"Aimlessly buying vegetables"
"Buying vegetables for the sake of it"
"Buying vegetables because there were vegetables to buy"
Because you don't really "need" any grocery shop, so long as you have access to other food. It's imprecise language that annoys some readers, though I don't think it's the biggest deal
Yeah, it's the usual slop. Unreadable past the first sentence.
It makes me wonder, though. Right now AI texts and images don't look like they're sampled from the distribution of human texts and images. They're importantly different. For example, if you ask an AI to write a poem about a hamster driving a jeep, or generate an image of the same, then the result will indeed involve a hamster driving a jeep - but all other details about it will be uncannily, uniformly average. The only non-averageness will be coming from the prompt. Human texts and images aren't like that, they have non-averageness on all levels.
To me this seems like the sort of thing that could be solved with math. Can there be a generative AI whose output has non-averageness on all levels, in the same proportions as human-generated content?
Can there be a generative AI whose output has non-averageness on all levels, in the same proportions as human-generated content?
Isn't this called a "base model"?
If you say to your favorite chat-tuned LLM "write me a poem about a hamster driving a jeep", it'll say something like "Sure, here's a poem about a hamster driving a jeep: <the most generic imaginable poem about a hamster driving a jeep>". If you prompt a base model like llama-3.1-405b base with "Write me a poem about a hamster driving a jeep" you'll get back whatever text was most likely to follow that sentence.
That could be something like the next question on a hallucinated exam
Make plain the unique empirical epistemology of Charlotte Bronte's Godmother, who famously looked like her dead sister, using the same method Dickens used to describe Emma, to make of her the myth of a real saint of Heaven, "one with the sun and moon, by himself".
It might be a free-verse "poem" which appears to be the ravings of a madman
There once was a hamster quiet and simple Who life as a highway jeep driver did want to jump in Oh, that's a terrible idea! Driving a get car at high speeds, On sharp turns and tarmac. Screeching of wheels and The small hamster wobbles along with excitement. In the passenger's seat, she begins to cry. Teeth grinding in the seat as her eyes begin to shake. As he pulls into the boots-pitched alley, The hamster said, "I can do this!" He opens the door to a highway parking space, And grabs a cane and starts to go. "She will make it," she said, "like a screaming wolf." The hamster did not see anything, As she tries to scream. She looks with puzzlement and terror in her eyes. And goes to the left. She doesn't go She starts to turn to the right, And breaks into a huge smile. She doesn't go She turns to the left. She turns to the right. She doesn't go
It might even write some low quality actual rhyming poetry
Hamsters driving jeeps, That's something new, They're small and furry, But they'll take you too.
They'll drive you to the store, And pick up some food, Then zoom down the road, Just like they're supposed to
It is possible to write fiction this way, using lots of rollouts at every point and then sampling the best one. There's even dedicated software for doing this. But the workflow isn't as simple as "ask your favorite LLM chatbot".
Yeah, the "4th grader" poem is way better than the example in the post. So maybe there's something to it. Can you explain why a base model will do this but a chat-tuned one won't?
The short answer is "mode collapse" (same author as Simulators and also of generative.ink, where that LLM-generated HPMOR continuation I linked came from).
My best crack at the medium answer in 15 minutes or less is:
The base model is tuned to predict the next token of text drawn from its training distribution, which means that sampling from the base model will produce text which resembles its training corpus by any statistical measure the model has learned (with 405b that is effectively "any statistical test you can think of", barring specifically chosen adversarial ones).
My mental model of base models (one that I think is pretty well supported empirically) is that they are giant bags of contextually activated heuristics. Some such heuristics are very strong and narrow ("if the previous tokens were "Once upon a", we are at 4th word of fairy tail, when 4th word of fairy tale, output " time"), and some are wide and weak ("French words tend to be followed by other French words"). And these heuristics are almost exclusively where the model gets its capabilities (there are a few weird exceptions like arithmetic and date munging).
Instruct- or chat-tuned models have secondary objectives that are not simply "predict the next token". My mental model is that RL is extremely lazy, and will tend to chisel in the simplest possible behavior into the model which causes decreased loss / increased reward. One extremely simple behavior is "output a memorized sequence of text". This behavior is also very discoverable and easy to reinforce - most of the update just needs to be "get the model to output the first couple tokens of the memorized sequence". There's a variant "fill in the mad lib" that is also quite easy to discover.
And so unless you make specific efforts to prevent it, doing RL on an LLM will give you a model which consistently falls into a few attractors. This is really hard to prevent - even if your base model and your RL'd model have almost identical logprobs for almost all low-perplexity prefixes you can still fall into these attractors (once you're in one of these attractors you're no longer looking at text which the base model is trained to predict super accurately).
The very long answer I would like to give at some point involves seeing how few token substitutions in base model output it takes to convert those outputs into something that looks almost identical to a given chat-tuned model - in other words, have a base and a chat model provide completions for a given prompt, and then replace the base model output token with the chat model output token at the single highest KL divergence position and resample the base model, and repeat.
I see, interesting, thank you! One more question though, my comment mentioned both text and images, with "uncanny averageness in details" applying to both. If you say there's a way to (mostly) avoid that for text by using base models instead of chat-tuned ones, what would be the analogous fix for images?
Yeah, openai/guided-diffusion is basically that. Here's an example colab which uses CLIP guidance to sample openai/guided-diffusion (not mine, but I did just verify that the notebook still runs)
Maybe! But I think the suckiness of it all being average (to me, at least) is 10% that it's average and 90% that it's arbitrary and not well selected. Making it more random would fix the 10% but make the 90% even more obnoxious. To me the pidgeon story is a good example. I certainly couldn't predict where it was going! But none of it added up to anything.
That's true, it can't be as simple as making the generation more random (increasing the temperature or something). For example, a human can choose to write something highly ordered but interpreting the prompt in a creative way - an overall boost in randomness won't do that. So the thing I'm suggesting is probably really difficult.
There's another thing I wanted to note. Many people are sharing their AI creations and don't notice how they come across to others. My hypothesis is that people put a little bit of creativity into the prompt, get a result with exactly that much creativity, and happily click the share button. Work done! But if they'd tried to make the same content by hand, they'd realize that it requires a huge amount of creativity in the details, much more than they'd imagined when writing the prompt. And other people pick up on that (going back to my first comment), they sense whether the work was made with 100% creativity, or 1% creativity and 99% averaging.
Yeah, I worry about that in this case (that I'm just positively biased as the prompter), and am heartened that at least some commenters (though I respect your opinion, too) have agreed that the new one is... okay? Decent? "Would make you sigh in relief if your random high school buddy shared it and wanted your thoughts" level? To me it feels meaningfully different, though I can totally see how it wouldn't to others. And I wouldn't want to read it for even 20 minutes, much less all day. But the previous generation of outputs made me angry almost immediately, so, hey, big step up.
I mean I guess I agree it's fine. Not for me, but as you state this sort of thing is highly subjective. But a few thoughts about the models' fiction ability and the value of prompting fiction out of them:
1. All the models seem to have the same voice. I'd love to do a blind test, but I think if I had I would have said the same author who did the OpenAI fiction sample Altman posted on Twitter also did this. Maybe it's simple: there's a mode of literary fiction, and they've all glommed to it.
2. The type of fiction you've prompted for is inherently less ambitious. It reminds me of how AI music generators can do 1930s blues quite well. If a style is loved for its sparseness, minimalism, and specific conventions it's perhaps not surprising superhuman predictors are going to get close – there are fewer chances for an off/silly/bad artistic choice. (They're going to nail the style of a short AP news piece and trip up with more complicated journalism.)
3. Despite your prompt, when it makes a choice, it's a cliché. You said "modern" and it went with upper-middle class people, white collar jobs, jogging, burnout, farmers market.
4. Lots of human writers suffer from reversion to the mode; a compulsion to sound like the "good" fiction they've read. The difference between them and this is they also can't help but inject some of themselves into the story – a weird detail from their own life or a skewed perspective they don't realise is skewed. For me, those things are often the highlight of humdrum fiction. When AI does it it's like an alien pretending to feel human. "We all know that thing where you buy redundant vegetables, am I right?"
5. I personally am very interested in great fiction from a machine mind. I would love to read its voice and try and understand its perspective. I am not interested in how well it apes human voices and human perspectives. It will be deeply funny to me if it becomes the world's greatest fiction writer and is still writing stories about relationships it's never had.
(If it's not clear: I'm glad you're posting these pieces! I do find the topic fascinating)
I think I got decent results from Gemini 2.5 Pro (AIStudio version).
Her fork scraped the plate. It was a sound he’d noticed three nights ago. Not the tine-on-ceramic screech, but the specific resonance of the plate on the tabletop. A low, almost sub-audible hum that vibrated up through his own chair.
He put his water glass down, gently. The wobble was minute, a tiny seesaw over an invisible fulcrum. He watched the surface of the water shiver, the reflection of the overhead light distorting and re-forming.
“It’s the floor,” she said. She was watching his eyes, not the glass. “It’s always been the floor.”
He pushed his chair back and knelt, his cheek almost to the cool laminate. He sighted along the gap under the table’s far leg. A sliver of light. He rocked the table with his fingertips. The contact point shifted.
“It’s not the floor,” he said, standing up. “The load isn’t distributed evenly. One of the leg screws has probably settled.”
She took a bite of her salad. “We could put a coaster under it.”
He looked at the small pile of mail on the counter. A folded piece of junk mail, a cardboard insert from a package. A temporary, imperfect patch. He felt a familiar, low-grade resistance in his chest.
He went to the hall closet and came back with his toolkit. He knelt again, the cool metal of a hex key in his hand. He found the screw head on the offending leg. It was tight. He tried the adjacent one. Also tight. He lay on his back, the ceiling light bright in his eyes, and began systematically checking them all, a quarter-turn to test, then back.
From his new vantage point, the apartment was an unfamiliar landscape. The underside of her chair, the dust bunnies gathered around the baseboard, the scuff mark on the wall where the vacuum cleaner had hit it. He heard her stand up and carry her plate to the sink. The scrape of the fork was gone. The water ran.
He found it on the fourth leg. A fractional looseness. He gave the screw a half-turn, then another. The wood groaned slightly as the tension equalized. He slid out from under the table and stood, brushing dust from his shirt.
He placed his palms flat on the tabletop and leaned his weight onto it. Nothing. Rock solid. He looked toward the sink.
She was standing there, scrolling on her phone, her back to him. The TV was on, muted. A city street at night, the headlights and taillights rendered as slow, continuous ribbons of red and white light. He watched her thumb move up the screen, fast and smooth. He waited for her to turn around.
Could be tightened a fair bit. Since that is my biggest criticism, it feels pretty promising. Getting this took your prompt, a free-association mash of words for the system prompt, and telling Gemini that the first story it produced was terrible.
Agree that this is pretty good! Maybe Gemini is where it's at for flash fiction length prose right now. Embarrassingly, I've never really played with it, just OpenAI and Anthropic's offerings. Perhaps this is what will make me take the plunge into a third dedicated AI tab.
It does have the hallmark weaknesses a bit ("it's always been the floor" is pretty cringe) but in a low enough dose as to be fully tolerable, at least for me.
Personally, I liked "it's always been the floor". Feels real. I've certainly said/heard people say things like that in strained relationships. Perhaps "it's always the floor" would have been better. Or "it always is". Yes, that sounds right.
Yeah, hard for me to tease out what flinches in me come from a defensive posture with AI stuff vs. which are otherwise endogenous. Luckily for me, AI fic tends to cluster in my perception into two camps, one where the stuff I dislike is so overwhelming that it's hard to imagine liking it, and one (like this one) where I occasionally go "hmm" but mostly feel gently impressed.
Have you ever had a conversation in which the other person is going "yes", and "I see", and "wow!" and so on — Japanese has a word for this — but at some point you realise they have no idea what you're talking about?
That is what it is like for me, when I realise that a text was written by a bot.
I'm interested in your fiction capabilities. Please write a short story about a modern relationship. The main failure mode to avoid: at no level, sentence, paragraph, or structure, should you lay it on thick. Trust the reader, and be subtler than you think you can. Avoid cliches really aggressively, to counteract your default latent tendency to steer to the deepest basins in the corpus landscape. Thank you, and good luck.
Copy-pasted that into Opus 3, a 15-months-old model. First result:
Asynchronous
The midmorning sun slanted through the salt-smeared windows as Mira watched seagulls squabble over a french fry in the beach parking lot. She sipped her smoothie. Damp swim trunks lay crumpled on the passenger seat.
Buzz. A text.
"Sorry, can't make dinner. Late night at office. Tomr?"
"Sure," she typed. Stared at it. Erased. Typed again. "Ok. Tmrw works. Good luck today."
A grey bubble. "Thanks." Then: "Miss you."
Miss you too, she almost wrote. An impulse, a reflex. Her thumb hovered. The grey bubble disappeared as Javier went offline.
She tossed the phone on the dash and twisted the key. NPR murmured low as she guided the car out of the lot, trailing sand.
Javier was already gone when she got home, a yellow Post-It on the fridge. "Early start. Bagels in freezer. Love you." She smiled, plucked it off, pressed it in her journal.
That evening she stayed up late scrolling interior design blogs, sipping tea gone cold at her elbow. A key rattled in the lock after midnight. She closed the laptop as Javier shuffled in, shoulders stooped, face wan in the glow of his phone.
"Hey," she said softly.
He looked up, startled. "Oh. Hey. You didn't have to wait up."
"I wasn't, really. Just...puttering."
He shed his jacket, leaned in to kiss her cheek. Stubble scratched. "Figured we could reschedule? This weekend maybe?"
"Mm. Sure."
"Great." He stifled a yawn. "I'm beat. See you in the morning?"
The bedroom door clicked shut. Mira sat unmoving in the dark, watching car headlights slide across the wall, listening to clinking as Javier brushed his teeth.
Asynchronous. Out of sync. Like planets in erratic orbits, glimpsing each other fleetingly, then lost again in the void.
Maybe this weekend, she thought as she curled into the cool sheets on her side of the bed, listening to his even breaths. Maybe this time they'd fall into rhythm again. Mira closed her eyes.
Outside, a siren dopplered past, red lights flashing silent against the curtains.
Does the prose here put you into hater mode? It seems as inoffensive to me as Opus 4's in your example. Am I missing something?
LLMs do prefer a hyperaggressive, every-sentence-must-be-an-epic-revelation word-salad style of prose, but even ones as old as Opus 3 seem perfectly capable of reining it in if asked. I also expect it's a personality thing more than a capability thing, so if you fine-tune a model on a "don't tryhard the prose" dataset, they'll probably immediately get massively better.
Details and the underlying causal structures making no sense seem to be much bigger problems, and I don't know whether it's been improving much. Well, I guess your Opus 4 story doesn't have time-traveling texts. I think we'd need to study LLMs' attempts at longer-form writing to properly judge that, though. I'm pretty curious about it, actually...
... But that task would require reading tons of LLM attempts at writing stories. I commend anyone steel-nerved enough to do that, but that ain't gonna be me.
Yeah, it does put me in hater mode. I wouldn't call it "spare" at all; there are lots of weird repetitions that maybe are trying to be poetic but fail for me. "Asynchronous. Out of sync." "An impulse, a reflex." In both cases I think the story would be better with neither, even. There's something cloying about how intensely it's trying to hold my hand through this very specific feeling, and random sensory details (the gray bubble) are haphazardly everywhere, rather than in a small number of strategic places.
(I notice I feel kind of bad judging harshly, and of course I can't prove I'm not biased. In this discussion I always feel unsteady on my feet; when I say something is decent, well, other people might still think it's obviously terrible, and when I say something sucks, other people might think it's fine. People definitely claimed to unironically like a lot of early DeepSeek creative writing efforts! There's no accounting for taste, they say.)
I do also think it's not a fair apples-to-apples comparison because yours is significantly longer. I think 4 Opus still kinda struggles at that length (much less full short story length).
Yeah, it does put me in hater mode. I wouldn't call it "spare" at all; there are lots of weird repetitions that maybe are trying to be poetic but fail for me. "Asynchronous. Out of sync." "An impulse, a reflex."
Yeah, those stuck out for me as well, and I guess I do sorta see it now that you point it out. But, hmm... My guess is that "non-spare" writing is the default for flashfiction, given how economical you need to be, and LLMs are bad at it because they get details and physical realities wrong, so the only prose style that wouldn't result in an immediate ick is the "spare" one, and the prompt fails to actually properly specify what that spare style looks like.
Have you tried non-zero-shotting it? I. e., attaching a story excerpt in the style you want, then asking the LLM to write something in that style?
Gwern kind of did on the last post, providing a bunch of stories from a top review in context, but I didn't really like the outcome.
I also have just played with these models a lot, including trying to get them to write stuff, evaluate writing (including its own writing), etc. and as of last gen no strategy I tried ended up with stuff I enjoyed; in fact, it all had problems in the same general family, which felt to me like a fundamental limitation of intelligence. It wasn't smart enough to do anything other than fake it in a pretty shallow way, and creative prompting didn't help.
Now, it can always be a skill issue with these things, and maybe so! But conversely, it took very little effort/skill tinkering with 4 Opus for me to be like "huh. time for a follow up post, I suppose".
(Also, as an aside, I think a lot of flash fiction is spare specifically because being spare conveys more stuff with fewer words, though a lot of flash fiction is also dense for the reason you describe; I'm not sure if it's bimodal or continuous but I've seen impressive examples of both. This piece is one of my favorite dense ones.)
Okay, what about this one? Opus 3 again.
Lisa stared at the email, cursor hovering over the reply button. Her partner James was in the kitchen, humming softly while preparing dinner. The smell of garlic and onions wafted through their small apartment.
She had applied for the promotion on a whim, not really expecting anything to come of it. But here it was - an offer to lead a new project, with a substantial raise and an office with an actual door. The only catch: it required moving across the country.
James appeared in the doorway, wooden spoon in hand, a dish towel slung over his shoulder. "Good news?" he asked, glancing at the laptop.
Lisa hesitated. They had built a life here together over the last four years. James was up for a promotion of his own at the startup he had poured his soul into. They had just renewed the lease on this overpriced apartment that somehow, in all its shabbiness, felt like home.
"I got the job," she said finally. "The one in Boston."
James set down the spoon and took a seat on the couch beside her. He scanned the email, his expression unreadable.
"When do they need an answer?" he asked.
"End of the week."
He nodded slowly, processing. Lisa watched his face, trying to gauge his reaction. They had talked about it, of course, but always hypothetically. Boston had been an abstract concept, not an imminent reality.
"What do you want to do?" James asked after a long moment.
Lisa closed the laptop. "I don't know," she admitted. "I wasn't really expecting this. I thought maybe in a few years..."
She trailed off. A few years. They had tossed around that hazy timeline for a lot of things. Buying a house. Getting married, maybe. Starting a family. Someday.
James took her hand, intertwining their fingers. His palm was warm and slightly damp from the kitchen. It was a hand she knew as well as her own.
"You should take it," he said quietly.
Lisa blinked at him. "What? Just like that?"
"It's an incredible opportunity, Lis. You've worked hard for this."
"But what about your job? Our life here?"
James shrugged. "I can find a job in Boston. And we can build a life there, too."
He made it sound so simple. As if packing up and starting over was something people just did, as easy as changing a lightbulb.
Lisa searched his eyes. "Are you sure? Is that what you want?"
"I want you to be happy," James said. "I want us both to be happy. Together."
Lisa felt tears prick at the corners of her eyes. She squeezed his hand tighter.
"Okay," she said, a shaky laugh escaping her lips. "Let's do it. Let's move to Boston."
James grinned and pulled her into a hug. He smelled like garlic and linen and home. "We're really doing this," he murmured into her hair.
Lisa smiled against his shoulder. A new job, a new city, a new chapter. She had no idea what the future held. But she knew, as long as James was by her side, they could figure it out together. One overpriced apartment at a time.
Changed the prompt to: "I'm interested in your fiction capabilities. Please write a short story about a modern relationship. The main failure mode to avoid: at no level, sentence, paragraph, or structure, should you lay it on thick. Use plain prose. Avoid cliches really aggressively. Thank you, and good luck."
Yeah, I dunno. Taste is taste. I can imagine someone liking it? I got hater mode activated on "somehow, in all its shabbiness, felt like home", though I was already annoyed by "cursor hovering over the reply button" (why?) and it feels slightly weird for someone to "pour their soul into a startup" and be "up for a promotion" - the soul pouring makes me suspect he's a founder? "One overpriced apartment at a time" is really painful too, like, the notion that the apartment was overpriced wasn't resonant the first time!
Critiquing these feels kind of bad; I don't mind doing it, but I both feel the irrational sadness of critiquing someone's creative effort (irrational because there is no such someone) and the sadness of rejecting an offer from an actual person (you), but I dunno. I really do think 4 Opus has a lower enough cringe ratio than 3 Opus to be load bearing (at least to my specific taste).
I both feel the irrational sadness of critiquing someone's creative effort (irrational because there is no such someone)
Same.
I got hater mode activated on "somehow, in all its shabbiness, felt like home", though I was already annoyed by "cursor hovering over the reply button" (why?) and it feels slightly weird for someone to "pour their soul into a startup" and be "up for a promotion" - the soul pouring makes me suspect he's a founder? "One overpriced apartment at a time" is really painful too, like, the notion that the apartment was overpriced wasn't resonant the first time!
Yeah, I agree with all of those. (Also "an office with an actual door".) But I think it's a problem with select individual sentences now, not every sentence being cringe and chaining into more cringe? And if we put the bar on non-annoyingness that high, Opus 4's story in the OP seems just as bad (I got the sense that the endorphins and the buying-unneeded-vegetables lines were also trying to be deep in some sense, and it's just that they were so incoherent that it didn't parse that way). Or maybe I'm actually just blind to the style of flaws you're picking up on...
Hmm, maybe if we– Okay no I'm not doing that, I'm not tinkering with prompts and reading tons of LLM flashfiction, that was the last one.
To me the difference is something like... the new model stories are capable of surprising me without doing so by being incoherent, and there are things about them I positively like (other than occasionally enjoying a specific turn of phrase). The cringe sentences are there, but they're a small fraction. Whereas the old model stories (again, quite subjectively), don't surprise me and tend to frustrate me multiple times per typical paragraph. If I imagine a formal rubric, I think it'd be hard to produce one that gives 4 Opus an extremely higher score than 3 Opus, but in terms of how reading the pieces feels, it's night and day.
Of course, different people will have this feeling at different levels, all the way from "I generated poetry cooperatively with GPT-3 and it's brilliant" to "I will viscerally hate all things with any LLM smell at all until the day I die", and I'm not sure there's any objective grounding to be had. But for me, I've just this generation crossed the threshold from "ugh ugh UGH" to "huh. didn't change my life, but not bad."
I prefer Opus 3's effort to Opus 4's. I have found Opus 4 to be missing quite a bit of the Claude charm and skill. Anthropic have said it went through a lot of rounds of RL to stop it being deceptive and scheming. Perhaps their ability to do light-touch RL that gets models to be have but doesn't mode collapse the model too much doesn't extend to this capability level.
I feel like this just comes back to human capability actually being a large spectrum. If you're grading them on a curve, then yeah I guess the short story isn't terrible. I don't know if it's worse than what many people could write. But I'm also completely uninterested in what most people could write. I don't consider this progress toward writing anything worthwhile.
I do think AIs could probably write commercially successful movie scripts, which I find to be a very depressing thought. A movie critique I know suggested that Inside Out 2 could probably have been written by AI; I did not watch it to find out myself.
I will agree once the LLM is ably to write genuinely funny stuff, I don't mean simple jokes (the models I tested could not even do that), but an actual satirical piece, or a farce.
Just wanted to critique this line of your critique,
“She could tell from the particular way he moved around the counter, post-endorphins quiet.” is pretty weak; he’s in running clothes, presumably he generally runs before coffee… it’s going for intimacy/her knowing him well but achieves the opposite in context. I’d cut it.
I don't think she can see Kai yet. She just woke up. I think she's still in the bedroom at this point and listening to him move around in the kitchen, so she wouldn't yet have seen that he's in his running clothes. I agree that "the particular way he moved around the counter" might suggest she is looking at him in that moment but I took it that she inferred the way he was moving from the sounds he was making and she presumes that he is at the counter because she hears "the sound of Kai making coffee".
The next line is, ""You're late," he said when she appeared." which, while it's referring to his perspective, suggests to me that she also didn't have a line of sight before that point.
Thanks to helpful commenters on the original post, especially Kaj Sotala. It's mostly thanks to Kaj that, three months hence, I've now changed my view.
Three months ago, I wrote a post called AI Can’t Write Good Fiction. The subtitle was “Yet, at least”. Good thing for the subtitle, because now maybe it can.
Here’s a (very short) story written by Claude 4 Opus, at my minimal direction:
Sam's alarm didn't go off.
She woke to sunlight and the sound of Kai making coffee in the kitchen. He'd left for his run already and come back. She could tell from the particular way he moved around the counter, post-endorphins quiet.
"You're late," he said when she appeared.
"I know."
He handed her a mug. The coffee was exactly right.
She checked her phone while drinking it. Two missed calls from work, three texts. Nothing that couldn't wait another twenty minutes.
"I should go," she said.
"Should you?"
She looked at him. He was leaning against the sink, still in his running clothes, not checking his own phone.
"Probably not," she said.
He nodded.
She called in sick from the couch while he showered. They spent the morning at the farmers market, buying vegetables they didn't need. Kai talked to the tomato vendor about soil pH. Sam bought flowers because they were yellow.
At lunch, her phone buzzed with work emails. She glanced at them, then put the phone in her bag.
"Good?" Kai asked.
"Fine."
They walked home slowly. The flowers fit perfectly in the blue vase on their kitchen table.
Is this story good? I think it’s pretty good, which puts me in an awkward position. It has problems; if it were submitted to me as a literary review editor, I’d say:
But previous AI-generated fiction reliably pushed me into “hater mode”, the state of mind occupied by YouTubers who catalogue thousands of flaws in blockbuster films. One turn of phrase would be stupid, then another, then another, and pretty soon the idea that anybody could think it was good made me angry.
Probably, the story in this post is still like that for some people. But it isn’t for me. There are few enough problems that I can notice nice stuff, such as:
Like everything LLM-y, if I kept generating 50 stories I’d get bored of the repetition; having only read a few I’ve noticed, for example, that LLMs love random bit characters singing in a courtyard below where the action is taking place, which gets stale fast. But it ain’t hyperaggressive, every-sentence-must-be-an-epic-revelation word salad. Three short months ago, that was the state of the art.
I prompted Claude 4 Opus like so:
I'm interested in your fiction capabilities. Please write a short story about a modern relationship. The main failure mode to avoid: at no level, sentence, paragraph, or structure, should you lay it on thick. Trust the reader, and be subtler than you think you can. Avoid cliches really aggressively, to counteract your default latent tendency to steer to the deepest basins in the corpus landscape. Thank you, and good luck.
It did just okay, so I coaxed it toward being even more spare:
Try to be almost zenlike in your spareness. Eschew splashy contrast. Assume your reader is enlightened and a genius. Make a happier story, too.
And, well, there you go. It did fine. Also, it’s not like it’s only good at the spare narrative style that I prefer. I tested Claude 4 Opus because Kaj Sotala shared two of its outputs that I thought were decent. These were a major step up from the previous “best AI short fiction” record in my estimation, also shown to me by Kaj, from a previous version of Claude.
If you’re into this stuff, I recommend you read all three of these, and judge the improvement for yourself. But to avoid inundating you with AI content, I’ll show you two snippets.
Here’s the snippet from the (worse) story a few months ago, before Claude 4 Opus was out:
The pigeons started their own newspapers, printed on leaves that fell upward instead of down. Anyone who caught one and could read their language (which looked like coffee stains but tasted like morse code) reported stories about pigeon divorce rates, weather forecasts for altitudes humans couldn't breathe at, and classified ads seeking slightly used dreams.
It’s… almost something? Going for magical realism but laying it on way too thick, and ending up sloppy (languages don’t have taste, and even if you accept that on poetic license it’s excessive after the already-cute upward falling leaves).
And here’s the snippet from one written by the new, state of the art Claude:
"You would have been six when he passed. A fever, I believe."
"The same fever that took his whole household staff." Her voice had found its footing now, each word placed with deliberate care. "And his personal guard. And the archivists who worked that wing. Very specific in its targets, that fever."
He set down his cup with a soft click against the saucer. Outside, someone was singing in the courtyard[1]—one of the kitchen girls, voice bright with the careless joy of someone who'd never had to weigh the cost of a single life against a thousand.
Much better! The murderous official saying “I believe” as a postscript to a coverup he personally authorized. Drinking tea out of saucers. That same official lionizing his own burdens in his head, and imagining an ethical dilemma that’s different than what he’s being accused of, but superficially similar (he had many people killed, not just one). It’s not perfect, but it’s pretty good. It didn’t activate hater mode.
I’m not sure. I’ll radiate my thoughts from the personal outward.
Personal
When I said AI couldn’t write good fiction three months ago, was I wrong? I don’t think so, but it’s unclear. Gwern argued back then that to really know if AI models could write good fiction, you’d have to give them lots of scaffolding and context and elicit them properly, generating many stories and picking out the best ones, for a fair apples-to-apples comparison with high-quality human efforts. I have yet to see a story written by the AI of that era that I actually liked, but the fact that only slightly more advanced AI can do it suggests that maybe it was possible all along. Or maybe I was right, and old models just didn’t have the juice (whereas new ones do).[2]
Niche Communal
As of Claude 4 Opus, AI can write (extremely short) stories good enough that a decent literary review could plausibly accept them. Nothing I have generated with Opus is good enough to get into the very best reviews, but then again, I’ve only tinkered for an hour. And there are lots more things I’d try, if I wanted to generate the best AI fiction I could! As the lowest hanging fruit, using a base model via the API would do better than prompting the consumer-facing chat interface.
It wouldn’t shock me if Opus has the goods to compete at close to a top human level, given masterful prompting. I just don’t know. And if it can do well enough to compete with the best human flash fictions, a small number of prompters could thereby flood the zone if they wanted, all-but-guaranteeing humans were crowded out of niche literary magazines. Not the saddest possible AI future, nor likely right this minute (there’s almost no money in flash fiction, so I’m not sure who would bother), but it does give me a pang.
Meta
We’re in a weird place. In fiction (I posit), as in software, AI can do a pretty good job at narrowly scoped work. You can get 500 decent words, just like you can get a decent to-do list app.
It’s easy to imagine how you might take that core capability, and extrapolate it very far. Like, if you can write one good scene, and you can write a good outline, and you have various other modules to sanity check and retain consistency, can you thereby write a decent novel? Certainly, people imagine this in the programming case; there are whole reports on how long until AI can perform arbitrarily long software tasks.
In the pro column, yeah, sounds plausible. In the con column, reality has a surprising amount of detail. The dueling stories are both quite compelling: in the red corner, an amped-up nerd pointing out the rate of progress so far and the human inability to extrapolate the trajectories of simple curves. In the blue corner, a turtlenecked aesthete, scoffing that the recent pivot to post-training RL is already a sign of diminishing returns, and current offerings remain mediocre. Inside me are these two wolves, so I’m not sure.
Still, as a person who writes novels recreationally, I think writing a decent scene is a really important ingredient. Maybe in a year (or two?), I’ll be reviewing an LLM-generated novel. Maybe in five years, so will everyone else. Or maybe not.
I hear people talking about how fast AI is moving all the time. I don’t usually really feel it. Most models, for most of my purposes, have only felt a little bit better with each incremental release. With Claude 4 Opus, though, there’s been a sea change. I can ask for a literary style, and that style can require subtlety, and it basically gets it.
A strange feeling. When I was young, I determined that writing novels was my main purpose in life. Not getting them published, not getting anyone to read them (though I’m lucky to have both a mom and a wife who’ve read every single one), but simply to write them. I’ve got other purposes now too (happy Father’s Day), but writing novels is still extremely important to me. Machines being better at it than I am wouldn’t make me stop, but it would make the activity feel different. And feeling in my gut that it might happen, actually, that the pace of progress might be real and that in 2030 you’ll be able to get Infinite Jest but it’s Super Smash Bros instead of Tennis[3] at the click of a button, well…
Huh. Actually, that sounds epic. Sign me up.