Terraforming all the things


Iteratively migrating to “infrastructure as code”

This blog post is part of the
Mixmax 2017 Advent Calendar. The previous post on
December 3rd Handling 3rd-party JavaScript with Rollup.

tl;dr – We use Terraform for almost everything and we’re never looking back.

The problem

Have you ever tried to navigate around the AWS UI to hunt down a configuration
issue? Perhaps someone accidentally clicked the wrong button, and suddenly one of
your high throughput Elasticache redis deployments is downsizing back down to a
t2.medium just cause… This isn’t the fault of you or your team, you can’t blame
them for being overwhelmed by the sheer amount of UI there is in the AWS web console.

AWS UI everywhere

I’m not even going to talk about the hours upon hours it took me to reorient myself
with the AWS services dropdown when it was reorganized earlier this year. However,
managing infrastructure is
hard, and if you’re managing it through UI can you really expect it to get any
easier? Absolutely not. The sheer amount of knobs and dials you can turn when
configuring any system is not only overwhelming, but it makes it easy to
miss a confirmation modal or to not realize that there wasn’t one. This means
that configuration mistakes aren’t only very simple to make, but you’re only
one click away from a mistake that you might not ever notice (such as making
an S3 bucket publicly available).

Is it all hopeless?

Fear not! Configuration can also be done with, well you know, configuration
files. But why stop there? Why not drink a little more of the Kool-Aid and begin
to version your infrastructure? While “Infrastructure as code” can seem
terrifying and daunting to implement, we’re here to tell you that it’s very easy
to incrementally roll out across your infrastructure.

Side note: What is Infrastructure as code?

Infrastructure as code, technically, means configuring infrastructure with an
automated system instead of configuring it manually. So instead of manually
going to the AWS UI and clicking some buttons, or instead of hopping on a server
and fiddling with some config files, you make changes to machine readable files
that your automated system can then use to apply those changes for you. The
utility of such a system becomes very apparent when a few additional lines of
configuration code can be used to modify your entire server fleet.

Moving from a manually managed system to a fully automated one can seem daunting
because it can be incredibly difficult to identify how to even begin the
migration process. Not only that, but it can be difficult to find a low stakes
environment in which to begin to test the waters without committing your entire
infrastructure to the new process.

Infrastructure as code: start with the little pieces

At Mixmax, we use many of AWS‘s services – from Elastic Beanstalk and Elasticache
all the way through CloudWatch and DynamoDB. We knew it wouldn’t be feasible to
move our entire world to a versioned configuration system in one fell swoop, so
we wanted to use a tool that would allow us to incrementally bring our
infrastructure under version control. For us, this meant that Hashicorp’s Terraform
was a no brainer as we could easily begin to use Terraform to manage small
deployments of non-application level systems before committing to managing
our application services with Terraform. In order to incrementally migrate
to using Terraform, we began to move components of our infrastructure that
were fairly static to be under Terraform’s control. There are many other tools
in this space, but most are primarily application configuration systems that
were retroactively bootstrapped in order to also be used as provisioning tools
whereas Terraform has been a flexible provisioning tool from the start.

First we moved our CloudWatch alarms and our SNS topics and subscriptions to be
controlled via Terraform. Using Terraform modules
for this was so successful that engineers who previously never wanted to touch
CloudWatch alarms began to create them with glee! We’d turned a painful part of
our development process into something our team found to now be a joy to work
with. After that success, we decided to try something with higher stakes, and so
we moved our Elasticache redis deployments to be provisioned and managed via
Terraform. Again, using Terraform modules made this a breeze.

Why do terraform modules make this so simple? Well let’s look at an example.
We utilize CloudWatch alarms across our entire infrastructure in many different
applications, but one specific one is tracking the number of delayed, inactive
and failed jobs in our job queueing system, bee-queue.
Before, engineers would have to either manually make alarms in the AWS UI or run
a script that wasn’t fully intuitive to use. More than once, we’d ended up with
only two of the alarms existing, the third having been forgotten. With Terraform
modules though, creating three alarms is super simple:

module "process-cool-event-job-queue-alarms-bee-queue" {
  source = "./modules/job_queue_alarms"

  alarm_name   = "process-cool-event"

  # Note that the `alarm` and `ok` actions are SNS Topic ARNs that we use to hook
  # these alarms up to PagerDuty.
  ok_action    = "${var.high-priority-ok-action}"
  alarm_action = "${var.high-priority-alarm-action}"

It’s really that simple though – one segment of code for three alarms! How
does this work though? Well, let’s look the structure of the job_queue_alarms


The root main.tf in the job_queue_alarms directory then looks like:

variable "alarm_name" {}
variable "ok_action" {}
variable "alarm_action" {}

variable "delayed_threshold" {
  default = "100.0"
variable "failed_threshold" {
  default = "100.0"
variable "inactive_threshold" {
  default = "100.0"

module "too-many-failed-jobs-bee-queue" {
  source = "./failed"

  alarm_name = "${var.alarm_name}"
  ok_action    = "${var.ok_action}"
  alarm_action = "${var.alarm_action}"
  threshold = "${var.failed_threshold}"

module "too-many-delayed-jobs-bee-queue" {
  source = "./delayed"

  alarm_name = "${var.alarm_name}"
  ok_action    = "${var.ok_action}"
  alarm_action = "${var.alarm_action}"
  threshold = "${var.delayed_threshold}"

module "too-many-inactive-jobs-bee-queue" {
  source = "./inactive"

  alarm_name = "${var.alarm_name}"
  ok_action    = "${var.ok_action}"
  alarm_action = "${var.alarm_action}"
  threshold = "${var.inactive_threshold}"

While each main.tf inside one of the children directories, looks like:

variable "alarm_name" {}
variable "namespace" {
  default = "bee-queue"

variable "threshold" {
  default = "100.0"

variable "ok_action" {}
variable "alarm_action" {}
variable "treatMissingData" {
  default = "missing"

# Note that the `delayed` part here is different between this `main.tf`
# and the `failed` and `inactive` `main.tf` files. Sure, we could refactor
# this into a single module (we're actually doing that ;) ), but this was
# one of our first forays into Terraform and we wanted to show our actual
# first steps.
resource "aws_cloudwatch_metric_alarm" "too-many-delayed-jobs-bee-queue" {
  alarm_name          = "${format("too-many-delayed-%s-jobs-bee-queue", var.alarm_name)}"
  comparison_operator = "GreaterThanOrEqualToThreshold"
  evaluation_periods  = "5"
  metric_name         = "delayed"
  namespace           = "${var.namespace}"
  period              = "60"
  statistic           = "Maximum"
  threshold           = "${var.threshold}"
  alarm_description   = "${format("Too many delayed %s jobs", var.alarm_name)}"
  ok_actions          = ["${var.ok_action}"]
  alarm_actions       = ["${var.alarm_action}"]

  treat_missing_data = "${var.treatMissingData}"

  dimensions {
    # Hardcoded as we don't create the alarms on our staging environments.
    environment = "production"

Phew! There’s a lot going on here! The general gist of this though is that
through using variables, we can create reusable components that we can then
combine to create multiple resources at a time! In our previous example of using
the job_queue_alarm module, we used the default threshold values, what if we
wanted to use custom threshold values? In that case, we’d do something similar
to this:

module "process-cool-event-job-queue-alarms-bee-queue" {
  source = "./modules/job_queue_alarms"

  alarm_name   = "process-cool-event"
  ok_action    = "${var.high-priority-ok-action}"
  alarm_action = "${var.high-priority-alarm-action}"

  delayed_threshold  = "500.0"
  failed_threshold   = "50.0"
  inactive_threshold = "120.0"

Et voila! By using variables with default values, we can provide overriding
values to the module at any time, allowing for a very high degree of control
over otherwise very similar resources.

But wait there’s more!

As we began to use Terraform for more and more across our
AWS infrastructure, we realized Terraform can be used to provision and configure
anything as long as there’s a Terraform provider for it. Giddy with excitement,
we began to quickly Terraform our PagerDuty schedules and service alarms!
While on the surface this seems excessive, it has huge benefits. By Terraforming
our PagerDuty alarms, we’re able to create brand new alarms for new services in CloudWatch
at the same time that we make those new alarms in PagerDuty – meaning that we can
programmatically connect them, all at the same time!

What should I take away from this?

Infrastructure as code is incredible, but you shouldn’t feel like you have to
migrate the world all at once. We’ve found that by incrementally moving our
infrastructure to a versioned provisioning system, we’ve had not only widespread
adoption internally but also an increase in interest in getting involved with
infrastructure work. At Mixmax, we’re not using Terraform for everything yet,
but we’re enjoying the process of seeing how it’s making everyone’s lives easier
while we continue to roll its usage out across our systems.

Enjoy building smarter infrastructure in an intelligent way instead of wrangling the AWS UI? Drop us a line.


Written By

Trey Tacon

Trey Tacon

From Your Friends At