Determining why that server is on fire

CPU profiling node.js processes live in production

Friday, Dec 9th, 2016

Mixmax is a communications platform that brings professional communication & email into the 21st century.

This blog post is part of the Mixmax 2016 Advent Calendar. The previous post on December 8th was about the Gmail PubSub API.

tl;dr - We run our node apps so that we can profile their CPU usage on the fly.

The problem

Have you ever watched the CPU usage of your Node.js process spike on your servers? Did it feel like you were helpless to wonder at the reason why?

Servers on fire

One way to figure out what your process is doing is to use a flamegraph—a visualization of the kinds of work its doing (think stack traces), and how long each component of that work is taking (in terms of CPU usage).

There is already a bunch of amazing material on how to build flamegraphs. However, each example is given only in terms of profiling a process locally (once the user has SSH'd into a server)—there isn't a lot of tooling for profiling servers remotely. Today we're going to give you a pair of scripts that allow you to do just that given that you're running on AWS. These scripts also handle some advanced deployment scenarios and mitigate some bugs reported by other material on flamegraphs.

Too many processes

The first issue we faced when building these scripts was having too many processes to profile: we use Node clustering to get the most out of our instances, however there are very few examples of profiling multiple Node processes at the same time in order to build flamegraphs.

Endless perf map growth

To compound the issue, the JIT symbol map required to build flame graphs grows endlessly.

The solution

As it turns out, you can pass multiple process IDs to the -p flag of perf record as long as they are comma separated. To mitigate the symbol map growth issue, we simply run all of our node processes with the --perf_basic_prof_only_functions flag. It does mean that we do get some unknown entries in our flamegraphs, but it also ensures that we won't run out of memory on our servers any time soon #tradeoffs.

Profiling Node apps

To simplify creating these flamegraphs, we wrote a few bash scripts that build the flamegraphs remotely on server of interest and then extract them back to the engineer's machine.

profile.sh

The first script builds the second, by embedding how long we'd like to profile for and if we'd like to profile all node processes on the box or only a single one. That script looks like:

perf.sh

The second script is a template that is run on the remote instance. The profile.sh script places any duration and PID specific options in this script before scping it to the remote EC2 instance. perf.sh downloads all tools necessary to build the flamegraph, profiles the process(es) of interest and then builds the flamegraph.

How big is the fire?

Coincidentally enough, the night before I sat down to write this blog post we had an incident in one of our services. This service in question suddenly began to pin all of its CPUs. No code had been deployed to it recently that could have caused the issue… Neither was there any unusual load on our system… So what could it be? The engineers on call used our profiling script to sample an instance exhibiting the pinned CPU behavior and a healthy one to compare.

Unhealthy server

Unhealthy Server

Healthy server

Healthy Server

Putting out flames

Comparing the two graphs above immediately showed us that the unhealthy server was spending an obscene amount of time trying parsing HTML. This information allowed them to identify a few culprit data pieces that were causing this behavior. They were then able to build a fix and hot-patch the bad data to resolve the issue.

Notes

You may note in the SVGs above that there are many entries of the form: /tmp/perf-$PID.map. This is because we use --perf_basic_prof_only_functions which only outputs the symbols for JS functions and not all symbols (which the --perf-basic-prof flag outputs). We haven't found this to be a large issue in practice however as we still get enough information out of the flamegraphs for our purposes.

Enjoy building greater introspection into infrastructure? Drop us a line.