LessWrong downtime 2010-05-11, and other recent outages and instability

matt

Incident report and hosting update

In the leadup to 2010-05-11 we (Tricycle) were unhappy with repeated short downtime incidents on the Less Wrong (LW) server (serpent). The apparent cause was the paster process hanging during heavy IO. We had scripted an automatic restart of the process when this problem was detected, but each incident caused up to a minute of downtime and it was obvious that we needed a proper solution. We concluded that IO on serpent was abnormally slow, and that the physical machine at Slicehost that serpent ran on had IO problems (Slicehost was unable to confirm our diagnosis). We requested migration to a new physical machine.

Error 1: We requested this migration at the end of our working day, and didn't nurse the migration through.

After the migration LW booted properly, but was quickly unstable. Since we didn’t nurse the migration through we failed to notice ourselves. Our website monitoring system (nagios) should have notified us of the failure, but it, too failed. We have a website monitoring system monitoring system (who watches the watchers? this system does - it is itself watched by nagios).

Error 2: Our website monitoring system monitoring system (a cron job running on a separate machine) was only capable of reporting nagios failures by email. It "succeeded" in so far as it sent an email to our sysadmin notifying him that nagios was failing. It clearly failed in that it failed to actually notify a human in reasonable time (our sysadmin very reasonably doesn’t check his email during meals).

serpent continued to be unstable through our next morning as we worked on diagnosing and fixing the problem. IO performance did not improve on a new physical server.

2010-05-17 we migrated the system again to an AWS server, and saw significant speed and general stability improvements.

Error 3: The new AWS server didn’t include one of the python dependencies the signup captcha relies on. We didn’t notice. Until davidjr raised an issue in the tracker (#207), which notified us, no-one was able to sign up.

What we have achieved:

LW is now significantly faster and more responsive. It also has much more headroom on its server - even large load spikes should not reduce performance.

What has been done to prevent recurrence of errors:

Error 1: Human error. We won’t do that again. Generally “don’t do that again” isn’t a very good systems improvement… but we really should have known better.

Error 2: We improved our monitoring system monitoring system the morning after it failed to notify us so that it now attempts to restart nagios itself, and sends SMS notifications and emails to two of us if it fails.

Error 3: We’re in the process of building a manual deploy checklist to check for this failure and other failures we think plausible. We generally prefer automated testing, but development on this project is not currently active enough to justify the investment. We’ll add an active reminder to run that checklist to our deploy script (we’ll have to answer “yes, I have run the checklist” or something similar in the deploy script).

ETA 2010-06-02:

Clearly still some problems. We're working on them.

ETA 2010-06-09:

New deployment through an AWS elastic load balancer. We expect this to be substantially more stable, and after DNS propagates, faster.

Three cheers for Tricycle!

How many times does this traditionally iterate?

Carl:

Three cheers for Tricycle!

I can't ever quite remember if the "hips" count as cheers or not.

Nope, only the "Hooray"s.

We improved our monitoring system monitoring system the morning after it failed to notify us

We apologise again for the fault in the subtitles. Those responsible for sacking the people who have just been sacked have been sacked.

What are the current pain points in serving Less Wrong? Reddit's made some significant performance improvements since the code was forked, including implementing Markdown in C rather than Python, and using Cassandra for in-memory caching -- would it be worth it to look through these changes and see if any make sense to apply to Less Wrong?

ETA: For example, I could probably isolate and apply the Markdown changes if it was likely to be useful.

The largest pain point has been instability in the paster process. The new deployment (as of a couple of ours ago (2010-06-09)) should roll out a new application server if that happens again.

Processor load has not been a problem, so improvements to the efficiency of the Markdown parser will have minimal impact unless traffic grows a lot.

(Thanks for your offer of assistance, and sorry about the late reply.)

A side-effect of this deployment worth noting: the integrated Anti-Kibitz is now live.

I'm not sure to what extent that was intentional and readers should expect the feature to stay for good - be advised that it has some issues (incompatible with IE; some minor browser compatibility bugs).

On my copy of Safari - Version 4.0.5 (5531.22.7), Mac OS X Version 10.5.8 Build 9L30 - it has the amusing feature of loading windows with names visible rather than invisible.

Edit: Oh, and the formatting got a little amusing when I posted a reply with names hidden.

My apologies for the version which ended up being deployed, which is well below the level of quality I normally aspire to.

I've just committed a major revision to the AK script which I'd design-sketched a while ago but had shelved pending a better handle on how to write unit tests for LW.

The newer version should be compatible with IE, more responsive than previous versions, and should allow more fine-tuning in future versions.

However I can't offer any guarantees as to when Tricycle are likely to pull the changes and deploy them to the live server. I don't know yet how best to coordinate with the Tricycle crew when working on the LW codebase, and the current situation was also a consequence of that.

Definitely needs more testing and tweaking: as I said above I wasn't expecting these changes to get pulled in.

I've already committed a change that corrects the second issue, I've just reproduced the first on my local Safari, and am looking into it.

Morendil your latest patch has been applied and deployed. We took the liberty of inverting the preference for two reasons: Firstly it made more sense for the checkbox to be checked when the kibitzer was enabled and secondly we added a note that full compatibility requires Firefox for now, the text of which was hard to phrase with the uncheck to enable behaviour.

Great, thanks!

I note that the revision you've pulled is 0.5, which should be a lot more cross-browser than previous ones. I've done some testing under IE, Chrome and Safari.

The original relied on XPath queries to locate page elements corresponding to scores and names. Now that the script is part of the LW codebase, I was able to simplify the script's operation by including a special CSS stylesheet when the Anti-Kibitzer is active, and acting directly on the display rules. This is faster and more compatible.