this is how i terraform

I started using terraform at work in 2016 . In 2018 I wrote a talk called "7 Ways Terraform Will Kill You" which I mothballed because I was certain every one of the gotchas I wanted to highlight would soon be fixed in an upcoming release. Many of them were addressed in 2019 with the major language overhaul of terraform 0.12, but I probably should have given that talk. I'm sure it would have helped someone.

I say all this to say that in the roughly nine years that I have been using terraform to manage large infrastructure projects I have banged my shins on many sharp corners and the bruises have left strong opinions.

If you already have strong opinions about terraform, this post is not for you. Your approach is great, keep using it. My opinion doesn't matter. Don't read another word. Go see what's happening on mastodon or something.

If you do not have strong opinions about terraform and are just getting started, you are my target audience.

Assumptions: everything here is based on the assumption that you're using terraform for something long term and important. If you just want to get something done quickly, or the thing you're working on can be burned down and recreated without causing problems, none of this applies to you.

(aside: all this applies to opentofu as well).

Rule 1. Complexity is the enemy

In a regular software project one can optimize for any number of things; memory or CPU efficiency, ability to easily refactor and add features, correctness of the implementation (eg. rigorous tests) and so on. With a nontrivial terraform project you should optimize for one thing: ability to reason about the project.

It is incredibly easy to build a terraform project which is extremely difficult to reason about.

Yes the plan helps but do NOT believe the plan! Apply is all that matters and plan != apply. If you have not yet seen a plan that looked good which turned into an apply that went bad don't worry, you will.

But before you can even see the plan you have to implement your change. You have to look at the existing codebase, decide how to structure your change and what needs to be modified, and execute. In a nontrivial terraform codebase this can be a daunting task!

Sources of complexity are numerous:

layers of indirection (eg. nested modules)
- Speaking of which, do not nest modules.
disparate inputs (eg. env vars, tfvars files)
complex logic

At every turn, with every PR, you must push back on anything that increases complexity. When you inevitably cave and add another layer of indirection or piece of logic, leave a comment that explains why.

The comment audience is your future self, who will have long forgotten what you were thinking.

The fewer places you have to look to figure out what's happening, the better.

Don't think of it as "IaC"

The problem with the phrase "Infrastructure as Code" is in the word "code". As soon as you call it code people start to cargo cult in all these software engineering principles that have NO BUSINESS in a terraform project (see rule 1). If you want to control your infrastructure with actual code go use Pulumi or AWS CDK and implement an AbstractLoadBalancerFactoryBaseClass() or whatever.

Terraform is "infrastructure as config files".

Sure, go ahead and write reusable modules. Add loops. Use conditionals. But do so sparingly. It is much easier to reason about 27 github_repo resources with slightly different configs than one module called 27 times or worse, looping over one module using a list that contains 27 data structures representing each repo config.

Don't use community modules

A terraform module is an opinion about how to do things with respect to some group of resources. The problem with community modules is that they need to support the many opinions present in the community, making them more complex by necessity (see rule 1).

Adam Jacob brilliantly explains this using what he calls the "200% knowledge problem" in this talk (deep link to the specific moment, just watch for 40 seconds).

Read them for inspiration and then write your own, with fewer resources and using fewer variables and conditions. Instead of swallowing community opinions wholesale, capture the opinions of your organization within your own modules. A module should contain opinions like "This ALB configuration suits the needs of our app" or "replicating data across two zones in a single region is sufficiently robust for our needs".

Disregard anyone who argues this point by invoking phrases like "reinventing the wheel". Those people are not on the hook for maintaining your infrastructure, you are.

Don't add tools until you're forced to

TF Cloud is fine.
Spacelift is fine.
Atlantis is fine.
Terragrunt is fine I guess, I've never used it.
I do not understand the point of terratest at all.
- Please do not try to explain it to me, I don't care.
I'm sure the others I've forgotten are cool too.

Don't use them unless you ABSOLUTELY NEED THEM (see rule 1). For some reason everyone is happy to say "Kubernetes is overkill for most teams" but nobody wants to say "TF Cloud is overkill for most teams".

Well I'm sayin it!

Many platform / SRE teams are 1-4 people and you can get most of the value of collaboration tools like Atlantis from a little team communication and good PR hygeine.

Iterating on a broken plan/apply cycle with a tool like Atlantis in the developer loop sucks. Just tell the team you're planning / applying from your laptop while you iterate, trust the state lock to avoid collisions, and communicate the results when you're finished.

With respect to the tools that don't focus on collaboration; you can get very far using just terraform and some project structure. If you do not have a rock solid reason to adopt them, don't.

Pursue consistency when "threading"

A big terraform project involves repeatedly passing information from one place to the next, and then when terraform is complete it is often necessary to pass outputs into something else like an ansible playbook or a helm chart.

eg.

variable "domain" {  
  type = string
  default = "foo.com"
}

module "dns" {  
  root = var.domain
}

module "loadbalancer" {  
  hostname = module.dns.registered_domain
}

output "monitoring_endpoint" {  
  value = module.loadbalancer.external_name
}

Here the domain for a project gets "threaded" from a variable called domain into the DNS module as a parameter called root which outputs a value called registered domain (maybe having prefixed www or something) which is passed to the LB module as a parameter called hostname, then output again as external_name (maybe having added https://) and finally output from terraform into some monitoring tool as external_name. This (not as contrived as you might think) example passes the same data around with seven different labels.

Could you make the argument that each use of this data happens in its own domain and has its own internal data model that makes sense for that use case? Sure.

Fight back against that argument.

Try to use a consistent name across the entire lifecycle of the data as it gets passed around. Reduce the number of things you have to look at to understand what's happening (see rule 1).

Don't rely on memory

Avoid anything that requires you to remember to do something before you apply. Don't set it up so you have to export the correct AWS_SECRET_ACCESS_KEY. Instead hard code a named profile or a specific IAM role in the provider config.

Don't require your team to switch to the correct workspace before applying, in fact don't use workspaces at all.

It's too easy to accidentally bulldoze production this way. Make it impossible to fuck up.

Style

This is just a random smattering of preferences which help improve maintainability.

Do not put everything into one huge pile, separate things into separate terraform states. There are lots of guides on how to structure this. This one is fine but there are plenty of others. Choose one.

Never use count to create more than one of something. count should only ever be used when you want to create something if some condition is true, and not create it otherwise. If you want to create more than one of something, use for_each.

I searched for a better source to explain why but this was the most succinct thing I could find.

Arguments to resources are the domain of the provider and you are beholden to them. AWS tags can only contain certain characters while K8s labels can only contain others. Don't fret about what goes in there, but terraform resource names are yours. You decide how they look and feel.

Never capitalize them.
Never use anything but underscores as separators.
Never put the resource type in the name.
- The type is right there next to it. This tip brought to you by the department of redundancy department.

🤮

resource "aws_ec2_instance" "Nginx-Instance"

🤩

resource "aws_ec2_instance" "nginx"