Archiving link posts?

by Said Achmiz 1y8th Sep 20186 comments

56


Link rot is a huge problem. At the same time, many posts on Less Wrong—including some of the most important posts, which talk about important concepts or otherwise advance our collective knowledge and understanding—are link posts, which means that a non-trivial chunk of our content is hosted elsewhere—across a myriad other websites.

If Less Wrong means to be a repository of the rationality community’s canon, we must take seriously the fact that (as gwern’s research indicates) many or most of those externally-hosted pages will, in a few years, no longer be accessible.

I’ve taken the liberty of putting together a quick-and-dirty solution. This is a page that, when loaded, scrapes the external links (i.e., the link-post targets) from the front page of GreaterWrong, and automatically submits them to archive.is (after checking each link to see whether it’s already been submitted). A cronjob that loads the page daily ensures that as new link-posts are posted, they will automatically be captured and submitted to archive.is.

This solution does not currently have any way to scrape and submit links older than those which are on the front page today (2018-09-08). It is also not especially elegant.

It may be advisable to implement automatic link-post archiving as a feature of Less Wrong itself. (Programmatically submitting URLs to archive.is is extremely simple. You send a POST request to http://archive.is/submit/, with a single field, url, with the URL as its value. The URL of the archived content will then—after some time, as archiving is not instantaneous—be accessible via http://archive.is/timegate/[the complete original URL].)