Tuesday, October 5, 2010

Why so much downtime lately?

If you've been using OneBusAway much in the last two or three weeks, you've probably noticed a lot more connection errors, Fail Bus sightings, and general downtime. I don't want to waste your time with a ton of excuses... but here they are ; ) The general problem is a combination of:

1) An on-going server upgrade and flakiness in the backup server.
2) More users than we've ever had before, especially a new bump when UW came back in session.

I know #2 is a problem any website would like to have and our traffic is a drop in the bucket compared to what some of you engineers out there deal with on a daily basis. That said, we're getting past the point where a single machine can reasonably handle the load. The new server that should help with the traffic is ready and waiting to be put into action, but I'm unfortunately in NYC for the week, so there may be some more bumps this week.

For the more technically-oriented among you who have been curious after I posted a plea for help, here's my theory about what's going on:

1) Terracotta, which I'm using to share session / state information between multiple Tomcat instances, is crashing after a segfault in the JVM. I'm not sure what's causing the segfault, but the last time I had JVM segfaults, it was due to bad memory in the machine. It's something I will check when I'm back in Seattle.
2) The Terracotta crash causes my Tomcat instances to hang as well. In some cases, the Tomcat instance seemed to spin up a bunch of threads in response to the Terracotta crash, which at up the non-heap memory available to my JVM instances and lead to the thread creation error messages I posted earlier.

1 comment:

Leon said...

Thanks for this, I was wondering what was going on.