Open Source Search (Summary)
Below post may be an older version of this document. Click link for latest version. 2025-06-20 Open Source Search (Summary) Disclaimer * Quick note * I support a complete ban on AI R&D. This app requiring AI doesn't change that. Summary * This document describes how to build an open source search engine for the entire internet, that runs on a residential server * As of 2025 it'll cost between $100k-$1M to build and host this server. This cost will reduce with every passing year, as GPU, RAM and disk prices reduce. * Most expensive step is GPU capex to generate embeddings for the entire internet. * Most steps can be done using low-complexity software such as bash scripts (curl --multi, htmlq -tw, curl -X "$LLM_URL", etc) Main Why? * I realised my posts on this topic are sprawling all over the place, without one post to summarise it all. Hence this post. * If someone donates me $1M I might consider building this. I've written code for more than half the steps, and no step here seems impossibly hard. Use cases of open source search * Censorship-resistant backups * aka internet with no delete button aka Liu Cixin's dark forest * Any data that reaches any server may end up backed up by people across multiple countries forever. * You can read my other posts for more on the implications of censorship-resistant backups and discovery. * Censorship-resistant discovery * Any data that reaches any server may end up searchable by everyone forever. * Currently each country's govt bans channels and websites that they find threatening. It is harder to block a torrent of a qdrant snapshot, than to block a static list of IP addresses and domains. Will reduce cost-of-entry/exit for a new youtuber. * Since youtubers can potentially run for govt, subscribing to a youtuber is a (weak) vote for their govt. * Privacy-preserving search * In theory, it will become possible to run searches on an airgapped tails machine. Search indices can be stored o