Determining why that server is on fire

SHARE ON

This blog post is part of theMixmax 2016 Advent Calendar. The previous post on
December 8th was about
the Gmail PubSub API.

tl;dr – We run our node apps so that we can profile their CPU usage on the fly.

The problem

Have you ever watched the CPU usage of your Node.js process spike on your servers?
Did it feel like you were helpless to wonder at the reason why?

Servers on fire

One way to figure out what your process is doing is to use a flamegraph—a
visualization of the kinds of work its doing (think stack traces), and how long
each component of that work is taking (in terms of CPU usage).

There is already a bunch of amazing material
on how to build flamegraphs. However, each example is given only in terms of
profiling a process locally (once the user has SSH‘d into a server)—there
isn’t a lot of tooling for profiling servers remotely. Today we’re going to give
you a pair of scripts that allow you to do just that given that you’re running on
AWS. These scripts also handle some advanced deployment scenarios and mitigate some
bugs reported by other material on flamegraphs.

Too many processes

The first issue we faced when building these scripts was having too many processes
to profile: we use Node clustering
to get the most out of our instances, however there are very few examples of profiling multiple
Node processes at the same time in order to build flamegraphs.

Endless perf map growth

To compound the issue, the JIT symbol map
required to build flame graphs grows endlessly.

The solution

As it turns out, you can pass multiple process IDs to the -p flag of
perf record as long as they are comma separated. To mitigate the symbol map
growth issue, we simply run all of our node processes with the --perf_basic_prof_only_functions
flag. It does mean that we do get some unknown entries in our flamegraphs, but
it also ensures that we won’t run out of memory on our servers any time soon
#tradeoffs.

Profiling Node apps

To simplify creating these flamegraphs, we wrote a few bash scripts that build
the flamegraphs remotely on server of interest and then extract them back to
the engineer’s machine.

profile.sh

The first script builds the second, by embedding how long we’d like to profile
for and if we’d like to profile all node processes on the box or only a single
one. That script looks like:

#!/bin/bash
#
# See: https://mixmax.com/blog/determining-why-that-server-is-on-fire for details.
#
# Example usage: ./profile.sh $INSTANCE_ID [$DURATION] [$PID]
# Where:
# – $INSTANCE_ID is the EC2 identifier of the instance to profile the node
# processes on.
# – $DURATION is an optional duration specified in seconds (defaults to 30).
# – $PID is a specific process ID to profile, by default we profile all `node`
# process on the remote instance.
#
# It is also assumed that you have setup your AWS credentials so that the
# `aws` cli functions correctly.
#
# This script currently assumes that you are profiling EC2 instances inside
# a private VPC and that you ahve setup a bastion tier with SSH forwarding
# appropriately. If this is not the case, simply change
# s/PrivateDnsName/PublicDnsName/.
#
##########
set -e
INSTANCE=$1
DURATION=${2:-30}
PID_MATCHER=echo $3
if [ -z ${3+x} ]; then
PID_MATCHER=pgrep node | xargs | sed -e “s/ /,/g”
fi
echo building perf script…
sed -e s/DURATION/$DURATION/g perf.sh > perf.tmp.sh
sed -i -e s/PID_MATCHER/$PID_MATCHER/g perf.tmp.sh
IP=$(aws ec2 describe-instances –instance-ids $INSTANCE –query Reservations[*].Instances[*].PrivateDnsName –output text)
# Collect a sample from the remote node
echo running the debug script on the remote node: $INSTANCE
ssh ec2-user@$IP bash -s < perf.tmp.sh
# Cleanup after ourselves
rm -f perf.tmp.sh perf.tmp.sh-e
# Retrieve the svg from remote node
echo copying the flamegraph back from the remote node…
scp ec2-user@$IP:/home/ec2-user/flame.svg .
# Open the image in the browser
echo opening the flamegraph in Chrome…
open flame.svg -a Google Chrome

view raw
profile.sh
hosted with ❤ by GitHub

perf.sh

The second script is a template that is run on the remote instance. The
profile.sh script places any duration and PID specific options in this script
before scping it to the remote EC2 instance. perf.sh downloads
all tools necessary to build the flamegraph, profiles the process(es) of
interest and then builds the flamegraph.

#!/bin/bash
#
# See: https://mixmax.com/blog/determining-why-that-server-is-on-fire for usage.
#
###########
echo installing perf…
sudo yum install -y perf
echo fixing file perms…
sudo chown root /tmp/perf-*.map
DUR=DURATION # this is replaced by profile.sh
PIDS=$(PID_MATCHER) # this is replaced by profile.sh
echo collecting info for $DUR seconds…
echo collecting info on $PIDS
sudo perf record -F 99 -o perf.data -p $PIDS -g — sleep $DUR
echo running perf…
sudo perf script -i perf.data > out.nodestacks01
echo cloning flamegraph…
git clone –depth 1 http://github.com/brendangregg/FlameGraph
cd FlameGraph
echo creating flamegraph…
./stackcollapse-perf.pl < ../out.nodestacks01 | ./flamegraph.pl > ../flame.svg

view raw
perf.sh
hosted with ❤ by GitHub

How big is the fire?

Coincidentally enough, the night before I sat down to write this blog post we
had an incident in one of our services. This service in question suddenly
began to pin all of its CPUs. No code had been deployed to it recently that
could have caused the issue… Neither was there any unusual load on our system…
So what could it be? The engineers on call used our profiling script to sample
an instance exhibiting the pinned CPU behavior and a healthy one to compare.

Unhealthy server


Unhealthy Server

Healthy server


Healthy Server

Putting out flames

Comparing the two graphs above immediately showed us that the unhealthy server
was spending an obscene amount of time trying parsing HTML. This information
allowed them to identify a few culprit data pieces that were causing this behavior.
They were then able to build a fix and hot-patch the bad data to resolve the issue.

Notes

You may note in the SVGs above that there are many entries of the form:
/tmp/perf-$PID.map. This is because we use --perf_basic_prof_only_functions
which only outputs the symbols for JS functions and not all symbols (which the
--perf-basic-prof flag outputs). We haven’t found this to be a large issue in
practice however as we still get enough information out of the flamegraphs for
our purposes.

Enjoy building greater introspection into infrastructure? Drop us a line.

SHARE ON

Written By

Trey Tacon

Trey Tacon

From Your Friends At