Determining why that server is on fire
CPU profiling node.js processes live in production
Friday, Dec 9th, 2016
Mixmax is a communications platform that brings professional communication & email into the 21st century.
tl;dr - We run our node apps so that we can profile their CPU usage on the fly.
Have you ever watched the CPU usage of your Node.js process spike on your servers? Did it feel like you were helpless to wonder at the reason why?
One way to figure out what your process is doing is to use a flamegraph—a visualization of the kinds of work its doing (think stack traces), and how long each component of that work is taking (in terms of CPU usage).
There is already a bunch of amazing material on how to build flamegraphs. However, each example is given only in terms of profiling a process locally (once the user has SSH'd into a server)—there isn't a lot of tooling for profiling servers remotely. Today we're going to give you a pair of scripts that allow you to do just that given that you're running on AWS. These scripts also handle some advanced deployment scenarios and mitigate some bugs reported by other material on flamegraphs.
Too many processes
The first issue we faced when building these scripts was having too many processes to profile: we use Node clustering to get the most out of our instances, however there are very few examples of profiling multiple Node processes at the same time in order to build flamegraphs.
Endless perf map growth
As it turns out, you can pass multiple process IDs to the
-p flag of
perf record as long as they are comma separated. To mitigate the symbol map
growth issue, we simply run all of our node processes with the
flag. It does mean that we do get some unknown entries in our flamegraphs, but
it also ensures that we won't run out of memory on our servers any time soon
Profiling Node apps
To simplify creating these flamegraphs, we wrote a few bash scripts that build the flamegraphs remotely on server of interest and then extract them back to the engineer's machine.
The first script builds the second, by embedding how long we'd like to profile for and if we'd like to profile all node processes on the box or only a single one. That script looks like:
The second script is a template that is run on the remote instance. The
profile.sh script places any duration and PID specific options in this script
scping it to the remote EC2 instance.
all tools necessary to build the flamegraph, profiles the process(es) of
interest and then builds the flamegraph.
How big is the fire?
Coincidentally enough, the night before I sat down to write this blog post we had an incident in one of our services. This service in question suddenly began to pin all of its CPUs. No code had been deployed to it recently that could have caused the issue… Neither was there any unusual load on our system… So what could it be? The engineers on call used our profiling script to sample an instance exhibiting the pinned CPU behavior and a healthy one to compare.
Putting out flames
Comparing the two graphs above immediately showed us that the unhealthy server was spending an obscene amount of time trying parsing HTML. This information allowed them to identify a few culprit data pieces that were causing this behavior. They were then able to build a fix and hot-patch the bad data to resolve the issue.
You may note in the SVGs above that there are many entries of the form:
/tmp/perf-$PID.map. This is because we use
which only outputs the symbols for JS functions and not all symbols (which the
--perf-basic-prof flag outputs). We haven't found this to be a large issue in
practice however as we still get enough information out of the flamegraphs for
Enjoy building greater introspection into infrastructure? Drop us a line.