I've been scraping data from the sequences recently, where by sequences I mean all of Eliezer's posts up to and including Practical Advice Backed By Deep Theories. I've been doing this mostly to get some fun data out and maybe some more useful things like the Bring Back the Sequences project, but one of the things I found is that there is breakage from the move from OB (and OB's subsequent reorganization) that remains unfixed.

In particular, 96 links either give 404s (not found), used to link to a comment but now only link to the main article, or link under the summary fold for no apparent reason. To avoid overloading this article, I have posted the list on piratepad here:


Note that I have only checked links that went to overcomingbias.com. This is not necessarily a complete list.

Some of these can be fixed by anyone with editing rights, but the ones pointing to comments can be fixed only by Eliezer or someone who knows what comment was meant to be linked. Alternatively, someone can go through the archive.org WayBack machine, figure out which comments were linked to, then find them in the equivalent LessWrong page, and finally provide the corrected link. I may modify the scraper to do this if someone is willing to make the substitution.

Also, a bunch of links (not in the above list) direct the user to OvercomingBias.com only to be redirected back to LessWrong. While this doesn't actually cause any breakage, it's a pity to be burdening OB's server for no real reason. I can produce a list of these if needed.

If I have managed to attract the attention of anyone with editorial rights, I would really appreciate it if you could help me out by removing certain formatting inconsistencies that greatly slow down and complicate my scraper. I can offer more details on demand, but these links to OB are near the top of the list.

I should be back with more interesting data soon. If you have any particular data-mineable queries about the sequences, let me know.

[Edit: The 4 links that point to a #comments fragment are actually processed correctly. That leaves 92 to be fixed.]


27 comments, sorted by Click to highlight new comments since: Today at 3:07 AM
New Comment

Yes, particularly links to particular comments on OB that no longer link to the same comment on LW. This is very annoying when it's part of a point being made by a post.

Then there's dead external links, though fixing those would be slightly more work.

Well done for working on this very annoying problem.

Cheers. If anyone's willing to make the changes to the articles, It's trivial to recinfigure my scraper to test external links for 404s.

Or one could use a regular tool like link-checker to spider LW. I use it on gwern.net to good effect. (I've never thought of running it on LW because it would probably turn up so many broken links.)

If you have any particular data-mineable queries about the sequences, let me know.

I have an idle curiosity about comment-patterns. Roughly speaking, I'd be intrigued to see a graph with posts ordered chronologically along one axis and number of comments along the other, with the ability to superimpose N lines one for each commenter.

This is, I repeat, idle curiosity. I don't expect anyone to actually do this, and don't consider it a particularly valuable use of anyone's time. But, you did ask, so there it is.

That's pretty cool. I only have comment totals for the moment, but I could extract individual commenters I suppose. Will consider.

Thanks Alexandros! I just noticed your post here. You can email me in the future if you have other fixes for LW until we have a more defined process for doing these things and alerting admins of valuable fixes for LW: louie.helm(at)singinst(dot)org

Also, can you tell me how many words/characters are contained in the sequences?

For my definition of 'sequences', which is everything up to 'Practical Advice Backed By Deep Theories' minus the quotes threads (702 posts in total), the wordcount is 917,854. I know this list can be improved, will get a discussion going about what exactly the sequences are soon enough.

Can you give me more details about the character count? Do you mean alphanumerics and numerals for instance? Total characters including spaces and punctuation? Given a precise definition I can probably get it done in an hour or so (given the above caveats about what constitutes the sequences).

Will get in touch with more fixes as I find them.

I wanted to calibrate myself for how ridiculous it is to ask someone to read the sequences. For example, the King James Bible is 788,280 words. So asking someone to read the sequences is quantitively similar to asking them to read 1.16 bibles.

Even Christians won't read the Bible cover to cover. And they believe the Bible is the word of God, contains all the most important wisdom in existence, and even a magical formula for making them live in paradise...forever! Infinite expected utility: Not enough to make someone read a bible.

We're lucky our target audience is more literate and that our text is way more interesting (and not as morally bankrupt) to read... but still. The number of people willing to read a giant philosophical text the size of the bible based on something like a friend's recommendation is... not so big.

I'd add that the bible ranges from dull to unparseable at points, and is generally a much harder read than the sequences if you want to actually understand what you're reading, but your point is a good one. We do need to boil the sequences down to something more accessible.

On of the things I'm thinking of doing with the parser is to make a sequences reader: It will start by giving you access to all the articles that have no internal dependencies, and as you read more articles, it will open up further new articles that are now accessible. It won't make the sequences any shorter to read, but the idea is that this should help manage the 'tab explosion' effect that people have been reporting.

This is a great idea.

Perhaps we should be saying "skim the sequences".

I read a Gideon Bible cover-to-cover once when I was stuck in a hotel room for two weeks without my laptop, but I'm neither Christian nor typical. And I'll admit I skimmed a few chapters starting somewhere around II Corinthians.

"Read the Bible" isn't bad advice for anyone that intends to spend a lot of time talking about religion in English, really. It won't give you any moral insight worth speaking of, and on its own it won't give you a deep understanding of Christian doctrine as it's taught in practice, but detailed knowledge of what the Bible actually says and does not say is remarkably useful both in understanding the religion's evolution and in self-defense against the kind of person that likes to throw chapter and verse at you.

On the other hand, that does presuppose a certain fairly well-established level of expected usefulness. From a lay perspective, the Sequences don't have that.

I wonder if some of the links that are failing to link to comments are caused by the comments in question being deleted. In particular Roko, who used to be one of the top contributors, deleted all his posts and comments for reasons that aren't entirely clear following the incident that must not be discussed.

While we're at it, might I make a feature request? I think it would be very helpful if deleted comments still had "permalink" and "parent" buttons like nondeleted comments, for navigational purposes.

I'm doing something not totally unrelated and will post in a couple days when done. For the moment, the comment from "What Would Convince Me That 2+2=3?" should point HERE.

Perhaps the next step could be to get help figuring out the equivalents for the rest of those links so that the implementation was clear (then: LW article is listed, broken link address is listed, and what-it-should-be is known in all cases).

This is in principle possible to do in an automated way via archive.org, just haven't sat down to do it yet, and would also like to know that someone is actually willing to implement it if done.

Indeed -- if the answer is that it just won't be done... there's no point. Though, shouldn't there be someone to talk to who might allow you managerial access of some sort? I would think that LW could stand to have some who can do things like this... and who actually want to.

This isn't the only thing, either. Take the discussions on getting something simple like a karma point limit implemented to stop people from selling pandora bracelets. There were several discussions on it and solutions proposed... but sometimes it seems that the "IT" types of topics echo into /dev/null.

That's why I gravitated to scraping, It gives me the opportunity to do interesting things without needing to get permission. This post is just a side-effect I fell onto that I think is so close to the core as to have a chance of getting the LW developers activated. And the voting here is also a good signal. I've done my part, what they do is up to them.

I've done my part, what they do is up to them.

Well put.

I don't mean to sound bitter or anything. I myself have a community of which I am supposed to be the technical administrator which I have neglected for years. I suppose its members are feeling the same. It's more fun to do a guerilla project than to have to defend an existing codebase from rot.

I think the ideal here would be to have a group of programmers that were of the community so that the pain they feel as users drives them to write code. Some may say that bugs are open to anyone to contribute but there are subtle/crucial differences to what I have in mind. Mostly we need a hero that can motivate the community and has commit access.

Unrelated to my other comment, I'd be interested in what you're doing to scrape. In the cases I've wanted LW articles, I've been using wget and then a bash script to go change out all of the obvious html stuff into a different markup form...

If you have already been doing something like this, I would be interested in how you're parsing (post-processing) your "scrapes." Or perhaps you're not and just using ping or wget or something similar to iterate through everything?

I've got a scraper up on scraperwiki. (which means it is being rerun automatically on a periodic basis). Check here. You can see the python source in the edit tab. It ain't pretty, but it works. The post-processing is mostly with lxml. You can also download the data as csv directly from the linked page, and you can run arbitrary sql queries on the data from here. Not sure if that covers it, happy to answer any questions.

Well, you're more advanced than me! I really, really, really want to learn python. Seeing it used is just more inspiration... but oh so much else to do as well.

I just hobble along with bash tools and emacs regexp hunting.

I know some java, so I can somewhat follow most code, but the methods available in lxml are unknown to me -- it looks like that gives you quite the abilities when trying to digest something like this? For me, it's been using wget and then trying to figure out what tags I need to sed out of things... and LW seems a bit inconsistent. Sometimes it's and sometimes it's or whatever it is.

Very interesting work!

lxml is a bit of a mindtwister and I only know as much as I need to as more advanced things require XPath. If you're trying to get your head around how all this works, I suggest taking a look at my other two scrapers which are considerably simpler:

http://scraperwiki.com/scrapers/commonsenseatheism_-_atheism_vs_theism_debates/ http://scraperwiki.com/scrapers/hp_fanfic_reviews/

As I learn more I take on more challenging tasks which leads to more complex scrapers, but if you know java and regex, python should be a breeze. I don't mind answering specific questions or talking on skype if you want to go through the code live. Duplication of effort is a pet peeve of mine, and using scraperwiki/python/lxml has been like having a new superpower I'd love to share. Don't hesitate to ask if you're willing to invest some time.

Don't hesitate to ask if you're willing to invest some time.

Deal! I'll read some about this and look into it more. I'm interested in this in that it seems like it's somehow... welll... "scraping" without digging through the actual html? Or is that not right? I have to all kinds of dumb stuff to the raw html, where as this seems like you're able to just tell it, "Get td[0] and store it as =variable= for all the tr's."

It's pretty slick. But... maybe the method itself is actually digging through the raw html and collecting stuff that way. Not sure.

Yeah, lxml processes all the html into a tree and gives you an API so you can access it as you like. It takes a lot of the grunt work out of extracting data from HTML.

Which is awesome, as I just felt the pain of hand pruning a heckuva lot of html tags out of something I wanted to transform to a different format. Even with my find-replacing, line breaks would prevent the tag from getting detected fully and I had to do a lot of tedious stuff :)

New to LessWrong?