We should have an annual "thank you Trike" day so we can shower them with appreciation when things silently keep going right.
Looks like it's July 27 this year. I'll try to remember to send Matt a personal message telling him to create a discussion post and collect his karma.
Our investigation into last week's LW downtime is complete: here (Google Docs).
Executive summary:
We failed to update our AWS configuration after changes at Amazon, which caused a cycle of servers being spawned then killed before they could properly boot. Our automated testing should have notified us of this failure immediately, but included a predictable failure mode (identified by us last year but not fixed). We became aware of the downtime when I checked my email and worked on it until it was resolved.
I personally feel very bad about our multiple failures leading to this incident.
ref. the last time I did this to you: http://lesswrong.com/lw/29v/lesswrong_downtime_20100511_and_other_recent/
Actions:
Further actions - site speed:
We're unhappy with the site's speed. We plan on spending some time next week doing what we can to improve it.
(If you upvote this post, please downvote my "Karma sink" comment below - I would prefer not to earn karma from an event like this.)