Over the past few months at Mixmax, we’ve been changing the way we track internal metrics. In this series of blog posts, we’re outlining our journey to metrics bliss. In this post, part 1, we’ll explain why we switched away from CloudWatch. In part 2 we describe the architecture of our Graphite cluster. Finally, part 3 dives into how we deploy and maintain that cluster with Terraform.
As we’ve written before, up until this point we primarily used AWS CloudWatch as our primary method for tracking and alerting on metrics for all our production systems. However, in the last six months we began running into limitations of CloudWatch, which led us to explore other solutions that could scale more gracefully with our increased and expanding needs. The two primary limitations we ran into were:
- CloudWatch does not easily support high cardinality metrics. While CloudWatch supports “dimensions” for searching and graphing individual metrics under a topic, it does not easily support aggregations over multiple dimensions. For example, you can display metrics for five dimensions from a single topic on the same graph, but you cannot show the aggregate of all five dimensions.
- CloudWatch has request, payload, and dimension limits that materially affected our implementation choices. These limits meant that each engineer on the team had to consider whether their new feature or infrastructure would publish too many metrics or metrics with too many dimensions to be useful.
Combined, these limitations increased the barrier to publishing detailed metrics for new features, and forced engineers on our team to think twice before adding metrication to new features. Practically, it meant that we had fewer metrics on important systems than we’d like. Feeling that our tooling choices should empower engineers rather than limit us, we decided to explore options other than CloudWatch for many of our internal metrics.
To direct our exploration, we outlined a set of requirements for our new metrics tool: the things we value most in metrics tooling. First, we knew the new solution must:
- Support a high (and increasing) volume of requests without client-side downsampling.
- Allow aggregating metrics on multiple dimensions.
- Allow alerting on graphed metrics.
In essence, we needed a scalable solution for aggregating high-cardinality metrics and alerting on the results. We also considered a few other attributes that were important, but not absolutely required. We preferred solutions that:
- Require minimal maintenance.
- Are easily scalable with our existing tooling.
- Have an easily useable node.js client.
- Are backed by fault-tolerant storage.
- Could be integrated incrementally.
It’s worth keeping in mind that our solution could have some risk of dropping metrics and didn’t need to store or query full, plaintext payloads. We also ruled out most managed solutions since they offered more functionality than we needed from this metrics tool and required higher-impact changes to our existing tooling. As a result, we considered three self-hosted solutions that satisfy our three requirements:
Prometheus is a time-series metric collection and aggregation server. It offered some interesting options for monitoring remote processes, but required restructuring our backend services to adhere to its polling architecture. We also ruled out Prometheus because it lacked durable storage mechanisms and would require Graphite or InfluxDb to back its metrics storage.
InfluxDb is a time-series database that’s part of a larger stack for ingesting, storing, and alerting on time-series data. We ruled it out because the open source version doesn’t have a solution for scaling or high availability and wasn’t well-supported in the node.js ecosystem.
Graphite is a time-series database system for ingesting and storing metrics. We finally decided on Graphite because it best accomplishes our 5 preferred attributes by being flexible, offering the ability to durably store data and scale horizontally, and being well-supported in the node.js ecosystem. Though it required a bit more initial setup, Graphite has well-tested, scalable, and open-source implementations that don’t require constant maintenance. Additionally, since Graphite itself is just an API for storing and retrieving data, we gained the flexibility to swap out implementations or data stores as necessary.
Continue with part two, where we go over our clustered Graphite architecture which handles hundreds of millions of data points per day.
Interested in working on a data-driven team? Join us!