Don’t wait for your PaaS’ status page to tell you why you’re down

How and why to make your own monitoring

Monday, Mar 23rd, 2015

Mixmax is a communications platform that brings professional communication & email into the 21st century.

When you’re using a platform-as-a-service, it’s tempting to draw a boundary between your application and that platform—to look at your hosting provider’s status page and assume that they’ve got their end all under control. But what do you do when your app goes down and their status page still shows green?

Yesterday morning, we learned the limits of third-party monitoring the hard way. When we started to see errors from our email-sending service, the first thing we did was to check the status page of Compose, our database-as-a-service provider. Our application uses a Redis-based job queue to communicate with the send service, so if our Compose Redis deployment was down, that could have accounted for the problems we were seeing.

But Compose showed all systems green, prompting us to dig through our code in increasing desperation. First, some emails failed to send; then, they all did. We immediately rerouted users to old Gmail so they wouldn’t experience a disruption of their workflow and requeued the emails so they’d send later. Then, we dug into the problem: If the problem wasn’t our hosting provider, it must have been our app… but how?

Only when we tried connecting to Redis one more time did we find that it had gone down completely—despite the status page. The problem wasn’t our app at all!

This was encouraging, but we were still down, and above all: Why hadn’t we realized sooner that our app had difficulty connecting to Redis?

As we traced the timeline of the outage, we saw that Redis hadn’t gone down all at once: the connection failure was at first intermittent. But surely our monitoring system, Pingdom, could have detected even an intermittent outage given that it polls the application once a minute. But of course… we had never told Pingdom to check Redis.

All we had instructed Pingdom to do was to check for a HTTP 200 when it polled our default route. But the app shouldn’t have been signaling that it was healthy if it couldn’t connect to Redis; we needed the monitored route to check whether Redis was responsive. And once we made that change, we figured we might add some other checks, like whether Redis stayed within the expected memory threshold. We even decided to bundle these checks up and publish them to NPM as redis-status.

We decided it would be clearer to add a new route to our service, exclusively for running these sorts of self-checks. Here’s how that “health” route looks now:

We urge you to follow the same sort of practices: set up Pingdom or Webmon. And don’t settle for their default settings, which just test whether your health route gives an HTTP 200—make a health route that checks subsystems and returns a meaningful status. Unless your project is a static site, there’s a lot more to it “working” than whether or not the server renders something to the page.

This post isn’t mean to condemn the use of database-as-a-service providers—we think the benefits vastly outweigh the risks—but to illustrate the power of double-checking your assumptions and collecting as much information as possible. Status pages are great, but you need to define what makes your application available. A third-party can’t ultimately say why your application is down.

Try our redis-status module and let us know what you think. And if you’re looking to tackle similar breakneck engineering challenges and scaling problems, send us a note to careers@mixmax.com. We’re hiring like crazy and would love to grab coffee.