The content of this post was a bit hard to follow, so I seriated the sentences based on openai embeddings, for everyone's convenience.

(The filenames may be hash gibberish, and the file-created times arbitrary, or indicate nothing more than when you happened to download a file.)
It is obscure (I had never heard of the term until a year ago or so), but highly useful: it works for everything from seriating Egyptian graves by rough burial time to organizing tag entries by topic.
This could be used to seriate LW2 tag-entries by something more relevant than either just date or upvotes.
It could also be used to present the tags themselves as a big 2D grid.
Then you just seriate them as if they were normal tagged items.)
(A tag embedding is the average of all its members' embeddings.
Since we now have neural embeddings for just about every modality there is, that means you can seriate anything.
I use it on Gwern.net (background) to organize the 'similar links' recommendations in a way much smarter than the naive k-NN embedding distance retrieval approach.
You can do further nice tricks with it, like infer the number of clusters/topics by where distances spike in size, and label them automatically with a LLM.
if you seriate them, however, suddenly you see clear clusters/topics emerge out of the chaos, and it's easier to skim the list.
Seriation also works beautifully for photographs or other big jumbles of images, where usually there is no way to 'sort' them that matters.
The practice of 'sorting by color' can be a simple form of seriation, and better than nothing.
It turns out that it is perfectly possible to loosen the definition of 'sorting' to something more approximate like 'try to minimize how different each item is from the next one'; this approximate or generalized sorting is called 'seriation'.
Seriation [or "ordination"], i.e., finding a suitable linear order for a set of objects given data and a loss or merit function, is a basic problem in data analysis.
The infrastructure comprises data structures to represent linear orders as permutation vectors, a wide array of seriation methods using a consistent interface, a method to calculate the value of various loss and merit functions, and several visualization techniques which build on seriation.
In this paper we present the package seriation which provides an infrastructure for seriation with R.
"Getting Things in Order: An Introduction to the R Package seriation", Hahsler et al 2008:
This is a linkpost for https://www.jstatsoft.org/article/download/v025i03/227
To illustrate how easily the package can be applied for a variety of applications, a comprehensive collection of examples is presented.
What might you use it for?
It works but the data formatting issues are a bit tricky, so I think I may have to scrap my little prototype and start over with a JSON/'function-calling'-style approach, which I am not too familiar with.
I sometimes seriate regular lists of text in my writing, where I can't come up with something meaningful.
I've also played around with an OA API LLM script for doing seriation on short natural language text lists, which could be used to automatically seriate lists in anything you write, or which could be used to clean up messy notes.
(For example, you could seriate your chaotic incoherent notes, then tell a LLM to rewrite your notes strictly line by line, and then, with the organization and grammar/spelling all cleaned up, start working on it yourself.)
But it's not hard, a GPT-4o-scale LLM understands pretty well a prompt like "Reorder them to group similar items, but do NOT rename, add, or remove any items.", so you can easily make your own tool.
Have you ever wondered how to sort a list or a folder of files where no strict sorting comparison operator like 'newer than' is quite right?
Keep this in mind the next time you see a list or a grid anywhere, where there's not an obviously correct way to sort it: it doesn't have to be sorted in a dumb way or left sorted at random, when it could be... seriated!
If you just sort them by 'distance', it is mostly meaningless and produces a jumble (see for example any algorithmic set of recommendations, like YouTube video lists - if I open a cat video, I see cat/anime/Touhou/cat/CS/music/cat/...);
Because it works so well, and is so simple to implement (simple greedy distance minimization, no need for TSP solvers), I initially called it "sort by magic".
Nevertheless, both exact solution methods and heuristics are available.
Caused by the problem's combinatorial nature, it is hard to solve for all but very small sets.

[-]gwern7mo*150

Very funny, but the OA embeddings were always bad at sentence embedding, specifically, compared to other NN sentence-specialized embeddings; and as the original OA embedding paper somewhat defensively argues, it's not even clear a priori what a sentence embedding should do because a sentence is such a cut-down piece of text, and doing well at a sentence embedding task may only be overfitting or come at the cost of performance on more meaningful text embedding tasks. (Similar to a word embedding: they are so poly-semantic or context-dependent that it seems like they have to have substantial limits - which is part of the motivation for Transformers in the first place, after all...)

That's why I was experimenting with prompting a LLM to do seriation rewrites (instead of just splitting on punctuation to reuse my existing greedy-pairwise approach, and having done with it). A prompted LLM is taking full context and purpose into consideration, and avoid the issues with bad embeddings on very small text. So the seriation outputs aren't crazily random, but sensible. This also helps avoid issues like Algon's where a general-purpose embedding, blind to context or purpose, winds up emphasizing something you don't care about; if Algon had been able to prompt a seriation, like 'sort by theme', the LLM would almost certainly not try to seriate it by the 'question formatting', but would organize his little Q&A question set by topic like biology then chemistry then physics, say. And if it doesn't, then it's easy to add more context or instructions. There are promptable embedders, but they are much more exotic and not necessary here.

(Which makes sense, because if you ask a LLM to sort a list of items in a freeform normal way, like a chat session, they are capable of it; in my poetry selection the other day, "Bell, Crow, Moon: 11 Variations", I had Claude/Gemini/GPT suggest how exactly to sort the 11 poems we curated into a pleasing sequence, and they did come up with a much nicer poetry sequence than the original random one. And why wouldn't they be able to do that, when they were good enough to write most of the poems in the first place?)

[-]ArthurB7mo10

Do you prompt the LLM to do the whole rewrite or call it n(n-1)/2 times to get the distances?

[-]gwern7mo50

I was trying out a hierarchical approach when I stopped, because I wasn't sure if I could trust a LLM to rewrite a whole input without dropping any characters or doing unintended rewrites, and aside from being theoretically more scalable and potentially better by making each step easier and propagating the sorting top-down, if you explicitly turn it into a tree, you can easily check that you get back an exact permutation of the list each time and so that the rewrite was safe. I think that might be unnecessary at this point, given the steady improvement in prompt adherence, so maybe the task is now trivial.

There's no explicit distances calculated: just asking the LLM to sort the list meaningfully.

[-]Algon7mo156

Reporting my experience: I tried this using the greedy approach, and it didn't work. The sorted list was ordered by superficial features of the list elements (e.g. many items start with "what is [...]"). I might use a TSP solver or just re-centre the embeddings my removing the average value.

[-]gwern7mo176

Yeah, it's limited by what kind of structure you have. It did seriate your list successfully, sounds like, it's just you have a lot of structure in the list that you don't care about, and so no embedding is going to prioritize the other stuff and the distances aren't useful to you in general. This will hurt any embedding-related use-case, not just seriation - presumably your k-NN lookups aren't terribly useful either and they mostly just pull up hits which have superficial syntactic similarities.

This is probably less of a problem with my annotations because I reformat them before embedding and add in all available metadata (not just the tags or the titles of links in it as a link-bibliography, but also tricks like including the titles of reverse-citations of it, so the more an annotation gets linked, the more the embedding of it reflects its usage), so the formatting is uniform (nothing like "half of them start with 'what is X' and half don't") and there's a lot of very semantic information.

[-]Algon7mo20

Good point, and thanks for the suggestions.

[+][comment deleted]7mo10

Moderation Log

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

217

If you're not sure how to sort a list or grid—seriate it!

217

217