Interactive exploration of LessWrong and other large collections of documents

vpetukhov; FriendlyOwl

TL;DR We made a tool that allows you to interactively explore LessWrong, EA Forum and - potentially - other large collections of documents. It seems like this part of the Information Retrieval field is highly underutilized in practice, so I listed some ideas on where it could be used, including a project on AI Governance research. Some of those I already tried, and it looks promising. Please, let me know if you have ideas on how to apply this to some other fields, especially if you'd be personally interested in doing this.

Intro

While doing my PhD and generally looking for information on areas that I don't understand, I've met the problem that the normal search engines give a very small subset of the information on the topic. This leaves you very little idea of what you miss and even how much missing information is there in total. So, together with Denis Skripov (software development), Elena Plekhanova (ideation part) and Alexander Fedorenko (both) we created a tool for building an overview of large collections of documents. We applied this tool to LessWrong and EA Forum.

The goal of this post is to share the visualizations, to describe the ideas behind the tool, its possible applications, and, hopefully, to gather some feedback. Lots of research on similar topics have been done, so I expect that some of the ideas listed here reinvent the wheel or are known dead-ends. If you have some info on that, please share it in the comments or in PM.

If you're only interested in testing the tool, jump straight to the "Instructions" and "Demonstration on LessWrong" sections. In "Motivation and the Concept description" I give a summary of my views on this part of exploratory search. "Methods" describe the algorithms we used. And in the "Further Applications" I describe my estimates on where these ideas could be applied. Any feedback is highly appreciated, but the "Input you could provide" section gives some details on what would be the most valuable.

Please, keep in mind that the tool is only a proof of concept and that every aspect of it can be improved.

Motivation and the Concept description

Imagine you're a graduate student who just started working in a new field and wants to make some sense of it. Or you just want to learn some new big topic, such as "rationality techniques". There are two general ways to approach the problem. One way, you may start with what you already know, find its borders and build upon that: list the questions you already have and try to answer them. Let's call this approach "incremental" or "bottom-up". Alternatively, you may start by building an overview of the whole field by reading some intros or reviews. Then, having a map of the field, you go to the most interesting/relevant parts or just sample all parts uniformly ("top-down" approach). The book Ultralearning suggests that the top-down approach allows learning waaay faster. Moreover, the purely incremental approach heavily suffers from informational bubbles. Nevertheless, the practical informational retrieval world is dominated by recommendation systems and search engines that operate in a bottom-up manner. Don't get me wrong, I do think that the search engines are around the best invention of the 1990-x, but they assume that I always know the question, which I need to answer. And that is often unreasonable. How am I supposed to know if it's the first time I encounter the area?

The difference between the bottom-up and the top-down approaches can also be explained in terms of the stochastic search methods. When you use search engines, creating new search queries and going through their results is some sampling of pages from the space of relevant information. This sampling gives some estimates about the landscape of the information space. So, in theory, one should wait for the saturation point, where no enough of new information pops up. A big problem with this approach is that samples are quite correlated, and don't have an overview of the whole space. So lots of samples are required, and there is no way to tell whether we're stuck in a local optimum. This is inherited from the Random Walk approaches, which are exactly "incremental". In contrast, having a whole map of the space would allow to sample from this space uniformly, which is a much more efficient strategy.

One revolution in my (and many others') mind was Roam Research that proposed to build unstructured collections of documents and inferring structure automatically from connections between the documents. That seems like the right way to deal with overwhelming amounts of information, but Roam did not proceed further, and I find their Graph Overview to be totally dysfunctional (at the moment of writing this post). Obsidian and Foam do better, but the overviews are still limited.

Which brought the idea of creating a tool that could accept a large collection of documents and show what is actually there, and which of these documents I should read. Effective Altruism Forum was a perfect target, especially given that when I started the project it still didn't have tags. Here are the three key features I defined for the tool:

Showing the full collection of documents, providing a maximally transparent overview
Being able to capture the whole content in a concise way, so I could get some grasp of the content in less than a minute
Searching across the collection to understand which regions I should investigate further

Finally, the prototype is out and is applied to both LessWrong and the EA Forum.

Instructions

So much about theory, let's get to the exploration. Here are the links:

LessWrong: App, Dendrogram, Word clouds clust1 and clust2
EA Forum: App, Dendrogram, Word clouds clust1 and clust2

For both, login is "lw_user" and password "lw1826". I've set it as a way of blocking access for web scrapers.

The exploration starts with a dendrogram. It's a representation of the clusters as a tree. We have the root at the most left - the whole LessWrong dataset of 20748 articles. The next layer of branches (column) represents the main clusters of the LessWrong articles (clust1), with their sub-clusters to the most right (clust2). Each box contains the keywords of the cluster. The same keywords are shown in more detail with the word clouds. Then, we go to the main app, which has three tabs:

Clusters. Here, each point is an article, and you can look at their clusters in the 3D UMAP space. While the axes are not human-interpretable, you can see the relative position of the clusters and articles as well as explore each article by clicking on it. You can choose the color-coding of the space between clust1, clust2 (corresponding to the tree from 1)), karma, and days (#days from the start of the forum). On the left, you can deselect the representation of specific clusters to focus on the most interesting regions.
Text search. At the bottom of the screen, you can search for a word combination and look where it appears most frequently (by clicking "Word Search" at the left). The main representation constructed in the same UMAP space with colors showing the occurrence of the word combination in log scale. On the right, you see the distribution of this word combination (and the documents containing it) between the clusters. There are two options of color-coding: the default is the word count. The second one is smoothed across the UMAP space. Current search implementation is limited to one-word queries. If you want to search for longer phrases, "Semantic Search" is available. It shows articles that (supposedly) have the closest semantic similarity to the query.
Found documents. The representation of the clusters from (1), but only for the documents, matching to the query. There is an additional option of color-coding available: "words", representing the number of the found words in the log scale. Currently, this tab doesn't play well if multiple people access the server. So, please, don't be surprised if it accidentally changes the visualization when you update the tab.

Please keep in mind that the tool is a very raw prototype, so bugs are totally expected.

Demonstration on LessWrong

Looking at the dendrogram, we see that Cluster 0 doesn't have any specific keywords, so it probably has some general articles. Cluster 1 keywords are related closely to the Decision Theory, while Cluster 2 is focused on rationality. All the rest clusters also have meaningful keywords, suggesting that the text grouping works well.

It's also interesting to color the cluster by days, which immediately shows us a new cluster on COVID-19. Also, it looks like the cluster Cluster 1.0 (agent-human-model) gains popularity, while the cluster on Cluster 10 (sequence rerun) consists mostly of the old articles.

One more interesting observation is that while the majority of articles belong to several large clouds of points, there are a dozen tiny clouds around. These clouds mainly belong to Clusters 13 and 14, and they represent different recurrent threads. These threads are announced with approximately the same text, so they form very dense clouds.

Now, here is an example of information, you could retrieve with this tool and could not without it. Let's ask, what are the research projects, relevant for the rational community?

For that, we can run Semantic Search on "research project" (A). It shows matches to various clusters (B), but we can pick only one of them, for example, "7.cryonics-brain-people" (C). Now there are only five articles:

Persistence of Long-Term Memory in Vitrified and Revived C. elegans worms: a link on a research paper, which shows that long-term memory is not affected by slow freezing in C. elegans.
Influence of scientific research asks for an impactful research topic, and the author describing his current project.
Genosets suggests applying some machine learning methods for genetic research to advance precision medicine.
Why SENS makes sense describes the ideas of Strategies for Engineered Negligible Senescence Research Foundation, as well as their active projects.
Food4Me - personalised nutrition initiative suggests a project for meta-analysis of the personalized nutrition field.

You can see that the matches are not perfect, but generally make sense. And this is expected from the very basic algorithm of Latent Semantic Indexing.

Finally, I've run the tool on this very text and found 5 the most similar articles:

Methods

I came from the field of single-cell RNA-sequencing, which has surprisingly much in common with Natural Language Processing. There, graph-based algorithms are a gold standard: they're fast, elegant, and transparent. So I used them as the core of the tool here, as well. So the whole algorithm works as follows:

Documents are encoded as average word2vecs after filtration of stopwords.
A k-NN graph is built on them, using cosine similarities
This graph is embedded using UMAP and clustered using Leiden clustering. Each of the clusters is subclustered again, so we get a two-level hierarchy.
Keywords per cluster are determined using TF-IDF.
Latent Semantic Indexing is used for semantic search.
Relevance scores are smoothed using Graph Fourier Filtering to highlight regions of the graph, which are most relevant to the query.
Dash is used for visualization

We tried various fancy methods for keyword extraction (e.g., WayWard and TextRank), as well as for document embedding (fastText, doc2vec and sentence-transformers). But, surprisingly, none of them could beat the baselines.

The server compute power is limited, so I used some basic password protection to avoid scrappers. If you think there is a better way to limit abuse of the computational power, let me know.

Few things I'm going to include in the near future include:

Better selection of articles by different criteria, such as karma or publishing date.
Visualization of topic evolution. Something close to what newsviz shows.
Adding the dendrogram and wordclouds to the app interactively.
Adding full-text search with Whoosh (we're already working on it)

Further applications

This section lists some large-scale ideas on where such a tool could be applied. If you have more, please write in the comments or in a PM. Any feedback on the listed ideas is also highly appreciated.

Blogs. Similarly to LW, it could visualize any blog with a large number of articles. Something like Medium would be perfect, and, personally, I'm tempted to try RNA-seq blog, as it's the broad topic of my PhD. A Russian tech blog Habr is also a promising candidate (they even had some datasets public).
Scientific articles were the end goal from the start. If this idea would work on arXiv or bioRxiv subsets, it would be a total success. So, in the near future I want to extract AI articles from arXiv, find those relevant to the AI Safety field and build some statistics on them. Visualizing this area together with some statistics on who work there (concrete universities/organization/labs), citation information, and maybe even correlating it with some AI Progress Metrics could be quite cool and also seem to correspond to the FHI AI Gov Agenda. Please, comment if you know someone who's working on this, would like to participate or if you generally have any input! An easier target could be the Emergence: Complexity and Organization journal. Complexity Science is exactly the case of the area where I have around zero understanding and want to learn much more.
Analysis of news is a huge and complicated topic. Originally I thought that it's too hard to even try, but accidentally I got into collaboration with a group of Belarus scientists and ran the tool on their collection of Belarus news (~50k articles). Surprisingly, the tool works, allowing navigating topics there, and the collaborators are quite happy with it. I didn't invest enough time to tell where is the limit, but this area should not be discarded. The ideal outcome would be to determine articles that cover the same topic and to visualize opinions of different sources on this topic. Perhaps, could be solved with Named Entity Recognition or some other kinds of segmentation algorithms. This is not a top priority and looks hard, so I'm not sure if I'll proceed with this in the near future. Anyway, if anyone has a good solution for that, I'd really appreciate it.
User reviews. I find it quite annoying when looking for products that they have hundreds of reviews, and making any sense of them takes ages. Some summaries like "60% of users like the screen" help, but still seem too much like cherry-pocking. As an experiment, we've ran aspect extraction on the public dataset of amazon reviews. Here is one example of the visualization of the subsentences, containing the extracted aspects. It still needs some tuning and summarization, but overall it shows the main branches of comments and allows to capture their proportions. The bottleneck here is subsentence extraction, not visualization.
Personal/corporate knowledge bases. I still don't get why Roam does not make any effort in improving the Graph Overview. But seems like few basic algorithms can make this part much more useable.
Web annotation tools are a separate kind of personal knowledge bases. I personally like Diigo and WorldBrain's Memex. It's quite tempting to store every article you ever found interesting together with some highlights on each of them. However, storing information such an amount of information requires either advanced organization of knowledge (with all overhead) or some means for automated inference of structure. And the tools like I made could be the latter.
(Radical) Collective intelligence is the idea of bringing many experts together for discussion of important topics with automated transcription and analysis of dialogues for generating solutions for mankind's problems. At least it's how I understood the point. Can't say that I'm completely on board with the idea, but aggregation and summarization of information seem to be quite important for this.
Packages for programming languages could be a completely orthogonal thing to visualize. For example, I'm subscribed to the Julia package updates and regularly see what's new there. And I think that looking at all packages together could help to find new views on programming that I didn't think of before. So how about scrapping it together with pypi and CRAN, and seeing what people have invented for the decade?

Among these new directions, my priorities are:

arXiv articles about AI Safety: this is a big project I hope to implement on the next AI Safety Camp.
Emergence: Complexity and Organization journal. Though, maybe I'll get to it first, as testing the tool here is much faster than on the arXiv.
User reviews.

Input you could provide

Of course, any input is highly appreciated. Though I don't think it makes much sense to submit bug reports now, as I'm aware of really lots of them. And here are a few things, which could really help:

Ideas on the user interface, to make it easier to use
Cool NLP / Informational retrieval methods that could be relevant for the tool
Possible applications for the idea, especially those you personally are interested in
Information on related projects
Contacts of people who could be interested in this tool or collaboration
Any ideas on the whole concepts
Is it alright that I scrapped LessWrong? I've read Terms of Use and didn't find anything against it. But if it's prohibited, I'll shut down the tool immediately.

49