I think there aren't enough "A day in the life of..." write-ups out there, so for today's Inkhaven post I decided to write my own. You might have a hard time understanding this if you aren't a software developer that's at least minimally familiar with some common but not universal full-stack web tech choices, particularly GraphQL. (If I was less pressed for time, I might have tried to make it more legible to non-engineers, or at least to engineers who haven't used GraphQL, but alas.)
This post was written in real-time while I was conducting a performance investigation on LessWrong. The contents are mostly unedited. This particular issue wasn't that high priority, but it was something I'd had on my mind for a while and seemed like a pretty reasonable thing to live-stream live-tweet live-post. It's not wildly unrepresentative of the kind of work I do at other times.
It started with an email from Vercel telling me that there was a spike in errors on some route, so I pull up the Observability dashboard.
Ok, the errors look like a temporary blip that's since passed. But the /rationality route error rate (and P75 Duration[1]) catch my eye. Those are pretty horrifying numbers.
/rationality links to our "collection" page for Rationality: A-Z. I already had that page flagged as having performance issues, in the back of my mind. I don't know exactly what's wrong with it, but my guess is that we're doing some unnecessary N+1 queries[2], or something similarly silly.
My first instinct is to just go look at the client-side code, to check what GraphQL queries it's running, and work backwards from there.
Ok, seems straightforward enough. I'm not going to look at CollectionsEditFragmentQuery, since it only gets run if the page is set to an editing state.
Knowing the codebase means I can often skip having to do any reasoning and jump straight to the most likely culprit: the books resolver. (user is also a resolver, but it's a single round-trip with a O(1) lookup. There are some resolver fields inside of the RevisionDisplay fragment used by contents, but those should also all be one O(1) round-trip at worst, with pretty manageable constant time factors.) Everything else is a database field.
So I pull up the books field in the Collections schema:
Well, that could be an N+1 query, if we were ever fetching multiple collections at once, but we're not. I'm pretty confident that the query itself, which compiles down to SELECT * FROM "Books" WHERE "collectionId" = $1 ORDER BY "number" ASC, is fast even if it's not indexed, because the table size is going to be tiny[3]. I guess it's time to look at BookPageFragment:
Now I'm starting to get a bad feeling.
For context, the two units of organization that you[4] might be familiar with are posts and sequences, which are composed of posts. Secretly, however, sequences have "chapters", which are (currently) an admin-only feature. By default, when you create a new sequence, an untitled chapter is created for it, and all the posts you think you're adding to the sequence are actually being added to that chapter. (For an example of a sequence which has more than one chapter, see Mysterious Answers, which has an "Interlude" chapter as well as the untitled default.)
So the true hierarchy, starting with "collections" like R: A-Z, The Codex, and HPMOR, is:
Collection > Books > Sequences > Chapters > Posts
(Mistakes were made when designing the data model, here. Also, you might have noticed that BookPageFragment is fetching postIds and posts directly, in addition to sequences - what's up with that?)
So one obvious guess for what's going wrong here is that we're performing a series of sequential queries (colloquially known as a "waterfall"). However, after thinking about it for a second, I don't expect this to explain most of a P75 of 6 seconds, since 4 sequential round trips to the database should only actually present a performance bottleneck if either the latency of the server talking to the database is substantial[5], or if one or more of the database queries themselves are very slow.
I could keep digging into the code, but at this point I have some guesses about likely contributing factors, and think that I'd be better off turning on our server-side database query logging with a local development instance and checking what queries get run in what order, and how long they take.
It turns out that loading /rationality involves 285 database queries. (Mostly not sequential, to be clear!)
The log output looks like this[6]:
This is kind of annoying to read, so I'm going to take a brief detour and take this excuse to test out Gemini 3.0's coding chops by having it write a tool to generate a waterfall graph, given this kind of log output.
Here is the prompt I gave it:
Prompt for waterfall graph parser and display
Can you write a tool that will allow me to visualize the kind of query log output in @testlog5.txt in a "waterfall graph"? This probably wants two parts: first, a front-end that accepts some structured data for display, and second, a parser that takes in the raw text log and parses it into that structured format.
I don't care about the log lines indicating individual route requests (i.e. POST /analyticsEvent 200 in 5ms (compile: 1531µs, render: 4ms)), those should be excluded/ignored. I care about matching up postgres query starts (indicated by Running Postgres query #, for the query strings or identifiers) and finishes (indicated by Finished query #, for the timing information). Keep in mind that some query strings will span multiple lines in the raw log, and those still need to be handled correctly.
The waterfall graph should display a chronological left-to-right view of "bars" representing queries. Each bar should show the query "number". The query strings/identifiers should be displayed on hover. For now, skip doing anything with the byte size info in the UI, but do parse it.
Please write the parser in Typescript.
While Gemini's pondering, I decide to skim the log to get a high-level sense of the query ordering and batching. I notice some alarming patterns, like this one:
Running Postgres query #348: SELECT "Collections".* FROM "Collections" WHERE "slug" = $1 LIMIT $2: ["rationality",1]
Running Postgres query #349: SELECT "Chapters".* FROM "Chapters" WHERE "sequenceId" = $1 ORDER BY "number" ASC NULLS FIRST: ["5g5TkQTe9rmPS5vvM"]
// Queries 350 - 397 omitted for brevity
Running Postgres query #398: SELECT "Collections".* FROM "Collections" WHERE "slug" = $1 LIMIT $2: ["rationality",1]
Running Postgres query #399: SELECT "Chapters".* FROM "Chapters" WHERE "sequenceId" = $1 ORDER BY "number" ASC NULLS FIRST: ["pvim9PZJ6qHRTMqD3"]This was a sequence of 26 pairs of "fetch the rationality collection" and "fetch an individual chapter by ID" queries, all sent off concurrently. Now, those are all going to be pretty lightweight queries, and they're being run concurrently, but that's still 52 queries that could be 2 queries (the collection queries are redundant and the chapter-by-ID queries can be grouped into one).
Oh, look, Gemini's done:
Not a one-shot success, sadly. (I have no idea why that gap between 335 and 336 is there.)
Alas, looks like Google can't keep up with the release-day traffic, so Gemini doesn't have a chance to rescue itself. Switching to GPT-5.1 Codex (from scratch, with the same prompt).
At this point I oversee "...despite missing timestamps..." in GPT-5.1 Codex's thinking traces, and metaphorically facepalm. I go add a Date.now() to the Finished query log, grab some fresh log output, and kick off GPT-5.1 Codex again, with an extra sentence in the prompt about the timing information. The CoT traces make me suspect it's going down a rabbit-hole, so I concurrently fire off Sonnet 4.5 with the same prompt.
...an hour or two of iteration later, and I have this, which is at least better than scanning through a raw log file:
That 29x expanding to a bunch of "Chapters" queries is what's left of the 26 pairs of "Collection" and "Chapter" queries I noticed above, after I deduplicated the Collection query.
Then, after I cause the "Chapter" queries to be aggregated into a single query, I see this:
We've gone from 264 (total) queries to 39 queries. But there were only ~26 Chapter queries!
And then I see that one of the two Revisions queries (#892) has 240 query arguments being passed into it.
Before, we had a separate set of 6 queries being sent off for each Chapter, including the one to fetch its latest revision:
So it seems like that earlier failure to batch was also breaking a bunch of downstream batching (unsurprising).
The page performance hasn't improved that much, however. A small part of that is that testing locally adds an extra ~70ms to each round trip that isn't present in the deployment environment, since my local machine is on the west coast and our database is on the east coast. But it's clear that a bunch of the performance bottleneck is from fetching ~240 revisions (which we're doing twice, for some reason), and maybe also the tags; I have no idea why we're fetching those.
Those are almost certainly the latest revisions for each post that we're fetching - 240 sounds like about the right number of posts in R: A-Z. I check, and there are actually over 300 posts. Hmm.
Ok, turns out my query logging had been truncating the arguments (and, LLMs being LLMs, the code "helpfully" and silently handles such invalid input, rather than loudly blowing up). There are in fact 338 revisions being fetched (the correct number).
Now, do we need these revisions at all? Unfortunately, given the way the code is currently structured, the answer seems to be yes - we're rendering each post using a shared component which expects at least some information about the revision, for its tooltip hover-preview.
However, there isn't fundamentally a reason we need to load all of that information before returning the rendered page to the client, rather than loading it from the client to shave seconds(!) off the page load time.
So I go start slimming down the relevant fragments, only to discover another thing that the revisions are being used for: the (aggregated) read time and word count estimates for the books and sequences.
At this point, the work becomes a pretty tedious slog of shuffling things around to avoid fetching Revisions while we're on the critical path for rendering the page for the user who requested it. I'm not very happy with the end result, since it involves fetching all of the posts a second time to get their Revisions, and I might just implement a couple custom resolvers for those bits of information instead - they almost certainly aren't going to change, so we can toss them into a long-lived cache and not worry about it.
Anyways, that's it for today. The work definitely took a lot longer than it would have if I hadn't been concurrently writing this post, and I probably spent a little too much time spinning my wheels with the new tooling (though we do go spelunking in those query logs semi-regularly, so I expect it to come in handy in the future). Also, it's not deployed yet :(. Maybe tomorrow.
The 75th percentile response time, meaning 75% of requests complete faster than this.
A common anti-pattern where you make one query to fetch N items, then N additional queries to fetch related data for each item, instead of batching them together.
After writing that, I checked, and yep, we only have 21 "books".
As a LessWrong reader.
They're in the same data center, so the time attributable purely to crossing the relevant distance should be on the order of a millisecond. If we're actually returning large amounts of data from the database, that could also increase the latency.
Starting on #335 because I didn't want any "cold start" delays to muddy the waters, so this was recording the second page load.