Terraforming all the things

Iteratively migrating to “infrastructure as code”

Monday, Dec 4th, 2017

Mixmax is a communications platform that brings professional communication & email into the 21st century.

This blog post is part of the Mixmax 2017 Advent Calendar. The previous post on December 3rd Handling 3rd-party JavaScript with Rollup.

tl;dr - We use Terraform for almost everything and we're never looking back.

The problem

Have you ever tried to navigate around the AWS UI to hunt down a configuration issue? Perhaps someone accidentally clicked the wrong button, and suddenly one of your high throughput Elasticache redis deployments is downsizing back down to a t2.medium just cause… This isn't the fault of you or your team, you can't blame them for being overwhelmed by the sheer amount of UI there is in the AWS web console.

AWS UI everywhere

I'm not even going to talk about the hours upon hours it took me to reorient myself with the AWS services dropdown when it was reorganized earlier this year. However, managing infrastructure is hard, and if you're managing it through UI can you really expect it to get any easier? Absolutely not. The sheer amount of knobs and dials you can turn when configuring any system is not only overwhelming, but it makes it easy to miss a confirmation modal or to not realize that there wasn't one. This means that configuration mistakes aren't only very simple to make, but you're only one click away from a mistake that you might not ever notice (such as making an S3 bucket publicly available).

Is it all hopeless?

Fear not! Configuration can also be done with, well you know, configuration files. But why stop there? Why not drink a little more of the Kool-Aid and begin to version your infrastructure? While "Infrastructure as code" can seem terrifying and daunting to implement, we're here to tell you that it's very easy to incrementally roll out across your infrastructure.

Side note: What is Infrastructure as code?

Infrastructure as code, technically, means configuring infrastructure with an automated system instead of configuring it manually. So instead of manually going to the AWS UI and clicking some buttons, or instead of hopping on a server and fiddling with some config files, you make changes to machine readable files that your automated system can then use to apply those changes for you. The utility of such a system becomes very apparent when a few additional lines of configuration code can be used to modify your entire server fleet.

Moving from a manually managed system to a fully automated one can seem daunting because it can be incredibly difficult to identify how to even begin the migration process. Not only that, but it can be difficult to find a low stakes environment in which to begin to test the waters without committing your entire infrastructure to the new process.

Infrastructure as code: start with the little pieces

At Mixmax, we use many of AWS's services - from Elastic Beanstalk and Elasticache all the way through CloudWatch and DynamoDB. We knew it wouldn't be feasible to move our entire world to a versioned configuration system in one fell swoop, so we wanted to use a tool that would allow us to incrementally bring our infrastructure under version control. For us, this meant that Hashicorp's Terraform was a no brainer as we could easily begin to use Terraform to manage small deployments of non-application level systems before committing to managing our application services with Terraform. In order to incrementally migrate to using Terraform, we began to move components of our infrastructure that were fairly static to be under Terraform's control. There are many other tools in this space, but most are primarily application configuration systems that were retroactively bootstrapped in order to also be used as provisioning tools whereas Terraform has been a flexible provisioning tool from the start.

First we moved our CloudWatch alarms and our SNS topics and subscriptions to be controlled via Terraform. Using Terraform modules for this was so successful that engineers who previously never wanted to touch CloudWatch alarms began to create them with glee! We'd turned a painful part of our development process into something our team found to now be a joy to work with. After that success, we decided to try something with higher stakes, and so we moved our Elasticache redis deployments to be provisioned and managed via Terraform. Again, using Terraform modules made this a breeze.

Why do terraform modules make this so simple? Well let's look at an example. We utilize CloudWatch alarms across our entire infrastructure in many different applications, but one specific one is tracking the number of delayed, inactive and failed jobs in our job queueing system, bee-queue. Before, engineers would have to either manually make alarms in the AWS UI or run a script that wasn't fully intuitive to use. More than once, we'd ended up with only two of the alarms existing, the third having been forgotten. With Terraform modules though, creating three alarms is super simple:

module "process-cool-event-job-queue-alarms-bee-queue" {
  source = "./modules/job_queue_alarms"

  alarm_name   = "process-cool-event"

  # Note that the `alarm` and `ok` actions are SNS Topic ARNs that we use to hook
  # these alarms up to PagerDuty.
  ok_action    = "${var.high-priority-ok-action}"
  alarm_action = "${var.high-priority-alarm-action}"
}

It's really that simple though - one segment of code for three alarms! How does this work though? Well, let's look the structure of the job_queue_alarms folder.

  job_queue_alarms/
      main.tf
      delayed/
          main.tf
      failed/
          main.tf
      inactive/
          main.tf

The root main.tf in the job_queue_alarms directory then looks like:

variable "alarm_name" {}
variable "ok_action" {}
variable "alarm_action" {}

variable "delayed_threshold" {
  default = "100.0"
}
variable "failed_threshold" {
  default = "100.0"
}
variable "inactive_threshold" {
  default = "100.0"
}


module "too-many-failed-jobs-bee-queue" {
  source = "./failed"

  alarm_name = "${var.alarm_name}"
  ok_action    = "${var.ok_action}"
  alarm_action = "${var.alarm_action}"
  threshold = "${var.failed_threshold}"
}

module "too-many-delayed-jobs-bee-queue" {
  source = "./delayed"

  alarm_name = "${var.alarm_name}"
  ok_action    = "${var.ok_action}"
  alarm_action = "${var.alarm_action}"
  threshold = "${var.delayed_threshold}"
}

module "too-many-inactive-jobs-bee-queue" {
  source = "./inactive"

  alarm_name = "${var.alarm_name}"
  ok_action    = "${var.ok_action}"
  alarm_action = "${var.alarm_action}"
  threshold = "${var.inactive_threshold}"
}

While each main.tf inside one of the children directories, looks like:

variable "alarm_name" {}
variable "namespace" {
  default = "bee-queue"
}

variable "threshold" {
  default = "100.0"
}

variable "ok_action" {}
variable "alarm_action" {}
variable "treatMissingData" {
  default = "missing"
}

# Note that the `delayed` part here is different between this `main.tf`
# and the `failed` and `inactive` `main.tf` files. Sure, we could refactor
# this into a single module (we're actually doing that ;) ), but this was
# one of our first forays into Terraform and we wanted to show our actual
# first steps.
resource "aws_cloudwatch_metric_alarm" "too-many-delayed-jobs-bee-queue" {
  alarm_name          = "${format("too-many-delayed-%s-jobs-bee-queue", var.alarm_name)}"
  comparison_operator = "GreaterThanOrEqualToThreshold"
  evaluation_periods  = "5"
  metric_name         = "delayed"
  namespace           = "${var.namespace}"
  period              = "60"
  statistic           = "Maximum"
  threshold           = "${var.threshold}"
  alarm_description   = "${format("Too many delayed %s jobs", var.alarm_name)}"
  ok_actions          = ["${var.ok_action}"]
  alarm_actions       = ["${var.alarm_action}"]

  treat_missing_data = "${var.treatMissingData}"

  dimensions {
    # Hardcoded as we don't create the alarms on our staging environments.
    environment = "production"
  }
}

Phew! There's a lot going on here! The general gist of this though is that through using variables, we can create reusable components that we can then combine to create multiple resources at a time! In our previous example of using the job_queue_alarm module, we used the default threshold values, what if we wanted to use custom threshold values? In that case, we'd do something similar to this:

module "process-cool-event-job-queue-alarms-bee-queue" {
  source = "./modules/job_queue_alarms"

  alarm_name   = "process-cool-event"
  ok_action    = "${var.high-priority-ok-action}"
  alarm_action = "${var.high-priority-alarm-action}"

  delayed_threshold  = "500.0"
  failed_threshold   = "50.0"
  inactive_threshold = "120.0"
}

Et voila! By using variables with default values, we can provide overriding values to the module at any time, allowing for a very high degree of control over otherwise very similar resources.

But wait there's more!

As we began to use Terraform for more and more across our AWS infrastructure, we realized Terraform can be used to provision and configure anything as long as there's a Terraform provider for it. Giddy with excitement, we began to quickly Terraform our PagerDuty schedules and service alarms! While on the surface this seems excessive, it has huge benefits. By Terraforming our PagerDuty alarms, we're able to create brand new alarms for new services in CloudWatch at the same time that we make those new alarms in PagerDuty - meaning that we can programmatically connect them, all at the same time!

What should I take away from this?

Infrastructure as code is incredible, but you shouldn't feel like you have to migrate the world all at once. We've found that by incrementally moving our infrastructure to a versioned provisioning system, we've had not only widespread adoption internally but also an increase in interest in getting involved with infrastructure work. At Mixmax, we're not using Terraform for everything yet, but we're enjoying the process of seeing how it's making everyone's lives easier while we continue to roll its usage out across our systems.

Enjoy building smarter infrastructure in an intelligent way instead of wrangling the AWS UI? Drop us a line.