How and why to make your own monitoring
When you’re using a platform-as-a-service, it’s tempting to draw a boundary between
your application and that platform—to look at your hosting provider’s status page and
assume that they’ve got their end all under control. But what do you do when your app goes
down and their status page still shows green?
Yesterday morning, we learned the limits of third-party monitoring the hard way. When we started
to see errors from our email-sending service, the first thing we did was to check the status page of
Compose, our database-as-a-service provider. Our application
uses a Redis-based job queue to communicate with the send service, so if our Compose Redis
deployment was down, that could have accounted for the problems we were seeing.
But Compose showed all systems green, prompting us to dig through our code in increasing
desperation. First, some emails failed to send; then, they all did. We immediately rerouted users to
old Gmail so they wouldn’t experience a disruption of their workflow and requeued the emails
so they’d send later. Then, we dug into the problem: If the problem wasn’t our hosting
provider, it must have been our app… but how?
Only when we tried connecting to Redis one more time did we find that it had gone down
completely—despite the status page. The problem wasn’t our app at all!
This was encouraging, but we were still down, and above all: Why hadn’t we realized
sooner that our app had difficulty connecting to Redis?
As we traced the timeline of the outage, we saw that Redis hadn’t gone down all at once: the
connection failure was at first intermittent. But surely our monitoring system,
Pingdom, could have detected even an intermittent outage
given that it polls the application once a minute. But of course… we had never told Pingdom
to check Redis.
All we had instructed Pingdom to do was to check for a HTTP 200 when it polled our default route.
But the app shouldn’t have been signaling that it was healthy if it couldn’t connect to
Redis; we needed the monitored route to check whether Redis was responsive. And once we made that
change, we figured we might add some other checks, like whether Redis stayed within the expected
memory threshold. We even decided to bundle these checks up and publish them to NPM as
We decided it would be clearer to add a new route to our service, exclusively for running these
sorts of self-checks. Here’s how that “health” route looks now:
We urge you to follow the same sort of practices: set up
Pingdom or Webmon. And
don’t settle for their default settings, which just test whether your health route gives an
HTTP 200—make a health route that checks subsystems and returns a meaningful status.
Unless your project is a static site, there’s a lot more to it “working” than
whether or not the server renders something to the page.
This post isn’t mean to condemn the use of database-as-a-service providers—we think
the benefits vastly outweigh the risks—but to illustrate the power of double-checking your
assumptions and collecting as much information as possible. Status pages are great, but you
need to define what makes your application available. A third-party can’t ultimately say why
your application is down.
Try our redis-status module and let
us know what you think. And if you’re looking to tackle similar breakneck engineering
challenges and scaling problems, send us a note to email@example.com.
We’re hiring like crazy and would love to grab coffee.