How We Used Terraform to Craft Sym's Onboarding Experience

By
Ari Spikerman
November 28, 2023

Terraform is not a particularly opinionated tool. It behaves like both a programming language and a framework: 

  • Like most languages, Terraform provides first class abstractions that can be used to encapsulate shared resources and avoid repetition.
  • Like most frameworks, Terraform provides a ton of optionality in how those abstractions can be fit together.

This means that you’re fully enabled to do anything “wrong,” and still achieve your desired end state; it also means that beyond syntax, correctness is largely about secondary concerns like readability, testability, and durability. Terraform is an incredibly flexible tool.

In practice, this means that we, the users, are left to make a lot of decisions ourselves when building out infrastructure in Terraform. With so many decisions to make, the seemingly lower-level details of what to organize into modules may not seem like a top priority. Underutilize modules, however, and your Terraform may get out of control quickly. On the other hand, over-abstract into too many modules too soon and you may find yourself staring down a surprise refactor.

Terraform is the primary language for configuring the Sym platform, the deployment of which relies on the correct configuration of a number of different, co-dependent resources. That means that as our goals have evolved and we've figured out the right level of abstraction to present to our customers, each decision we've made along the way has materially affected the developer experience for every new Sym customer. And of course, we use Terraform for our own infrastructure as well, so the learnings have flowed in both directions.

What follows is the story of Sym’s journey as we evolved our understanding of our own complex Terraform inputs into a simple, understandable, and efficient deployment and onboarding process.

Starting Out

A bit of important context before we dig in: to set up Sym, a user has to configure a mixture of Sym and AWS resources in Terraform. For example, a Terraform configuration might contain an `aws_iam_role` as well as a `sym_target` that points at that role (to see a full example of what this might look like, check out our AWS IAM Access example on GitHub). One of the most important Sym resources is the "Flow". It represents an approval workflow in Sym, allowing users to request temporary and auto-expiring access to sensitive resources. It's declared using the `sym_flow` resource in Terraform, and managed like any other Terraform resource.

With this in mind, our initial goal was relatively simple: get customers onboarded – by any means necessary. This meant that we were very hands-on with our customers, going as far as pairing with them on their Terraform configuration, sometimes even writing it for them. We wanted people to get Sym up and running so we could start getting feedback on the rest of the product. We also started to form some opinions about how people should manage their Sym Terraform:

  • Flows should be isolated from one another.
  • There should be at least two different "environments" in a Sym configuration: a sandbox that provides access to non-essential resources for experimentation, and a production deployment that people will use every day.
  • Different people care about and have access to configure different parts of a system. The person who installs Sym's app for Slack or wires it up to the production AWS account is different from the person who writes the Python rules that determine who can have access and when.

With those opinions in mind, we ended up with a Terraform structure made up of several modules, split up by what their Sym purpose was. It looked roughly like this:

This structure separated the different environments as well as isolated Sym Flows (one Flow module configures access management to one service) from Integrations (where all of the AWS bits and bobs as well as more general Sym configuration lived).

Down to Essentials

This structure worked for a while, but as we learned and started to see cracks, our goals changed:

Operationally, owning our customers' Sym Terraform code wasn't sustainable in the long term for anyone, so we wanted to move away from that. But even for our customers who already knew Terraform, the level of abstraction of this initial structure made it difficult to understand what was going on. New to Terraform and Sym? Forget it. We needed to figure out how to make our onboarding easy enough for customers to manage on their own, while ideally softening the learning curve for Sym's resource hierarchy.

We also learned that – for onboarding – a quick proof of concept (POC) was more important than giving different people fine-grained control over different parts of their system. And, if you’re just one person, running `terraform apply` in five different directories ends up being unnecessary, tedious work.

With simplicity, education, and a quick POC in mind, we moved from a very nested, module-based structure to one that used less modules, and had a little less going on at the start. It looked something like this:

Using the `environments/prod` folder combined with a couple of major modules meant that adding a second environment later would be relatively easy, if needed. There were fewer directories to traverse to understand what was going on. Modules like `aws-sso-flow` combined AWS resources with Sym resources when they served the same purpose (e.g. providing access to AWS SSO), which meant that everything needed to manage access to an external system could be viewed in one place. The `sym-core` module contained resources that might need to be shared across Flows.

This setup was quite a bit easier to understand and work with. However, it still wasn't enough to make onboarding as easy as we wanted it to be for our users. 

Enter: Codegen

At this point, Sym's onboarding experience had gone through several overhauls. We'd cut out everything we could (reduced the steps it took to onboard, removed as much unnecessary boilerplate as possible), and made significant improvements to our documentation. However, we were still left with quite a bit of unavoidable Terraform configuration, and most of it wasn’t important for users to understand. So rather than having resources to help new users write that code, we thought: what if we just... write it for them? (But smarter this time.)

Taking inspiration from Ruby on Rails, we planned to use our existing `symflow` CLI to generate the boilerplate on behalf of our users. Out of 100 lines of Terraform, there were only a handful of values that depended on the user's input. So we would just ask for those, and generate the rest of the Terraform files ourselves. This would save time for the user – since they wouldn't need to find the right place in the documentation to copy – and it would also reduce errors, since `symflow` CLI would always generate a Flow that worked right out of the box. You want an AWS IAM access Flow? Done. Okta? Also done. All you have to do is run a couple `symflow` CLI commands, supply a few required values, and `terraform apply`.

There was just one problem: generating code is hard.

Well, generating code to go from 0 to 1 isn't too hard, but going beyond that? Reading existing code that someone may have modified and generating code that works with it? That's very hard. And it's especially difficult if the Terraform configuration being generated is complex. Since we had already cut down the actual code as much as possible, we weren't left with many ways to make it easier. Our two variables were the structure of the Terraform and the code we would write to generate it. So as we thought about the code we'd write, we re-evaluated the Terraform structure yet again. Our goal this time was to find a balance between Terraform that was easy to understand and Terraform that would be easy to generate.

To achieve this, we restructured the Terraform to be broken into many files, each of which served a different purpose. For example: all of the very basic, always-required configuration for an environment like "prod" would go into an `environment.tf`, all the code required for secrets management would go into a `secrets.tf`, and each Flow would get its own `<name>_flow.tf`. Since a lot of these files would just need to be generated and then never touched again, we could relatively safely assume that if you had a `secrets.tf`, that secrets were set up properly and we could reference its contents when generating a new Flow that needed to use secrets. Not a foolproof approach, but certainly a start.

We also wanted to focus on how generating code could help educate users about how Sym's Terraform worked. To do that, we decided not to use any modules to start. All resources would be in plain sight. The idea was that if we put everything in front of the user, they would learn slowly by tweaking things as needed to fit their specific use case.

At this point, we had some Terraform that didn't use any modules, and was laid out something like this:

But wait, what about people who don't use codegen?

Codegen is pretty cool, but that doesn't mean everyone can or wants to use it. To make sure it's not required to use codegen, we still wanted to document how to configure things yourself from scratch. If you're doing it yourself, though, it should still structurally match the generated code, so that there's only one recommended way to structure Sym Terraform. We quickly found ourselves writing documentation that required users to copy over quite a bit of boilerplate, putting us right back where we started. 

The solution to this, unsurprisingly, was to reintroduce modules. Instead of generating everything in plain sight, some things would be provided by Terraform modules published by Sym.

The big question then became: if some things are plain resources and some are modules, how do we decide what's what?

The answer turned out to be pretty simple: boilerplate. What we would consider "boilerplate" in a Sym Terraform configuration is code that is required (e.g. for Sym to access a customer's AWS account) but doesn't need much tweaking to work for a particular organization. Everybody's going to end up with the same thing, and education is less important because it doesn't need to be touched. Boilerplate should be in a module. It's Sym-owned code, it can be versioned, and organizations can update it as they require new features – without adding more Terraform resources manually. 

The other type of Terraform is organization-specific configuration. This is the bulk of what makes Sym work for a particular customer. This is the Terraform that says, "I want employees in group X to be able to access resource Y, and here is a set of rules to manage that access".  This part of the configuration is where the Sym SDK comes in. The SDK is what makes Sym special, and it's what we want to educate people about. It's important to be able to tweak it, because every organization has their own infrastructure and specific rules around how to access it.

So this is where we are now:

Boilerplate lives in modules published on the public Terraform registry by Sym (like our runtime-connector module) and is generated by our CLI in `connectors.tf`. The rest of the configuration is organized by what it allows access to, and all the resources are easy to find. This way, if someone doesn't want to use codegen, they only need to copy a small module and then write their own Terraform for exactly what they want Sym to do. If they do want to use codegen, all they have to do is run a couple commands in a terminal, and they'll end up with a working Flow. Since it's so easy to get started, users can slowly learn how Sym works by making tweaks based on what they can already see happening as they make Sym requests. 

Final Thoughts

The lesson we’ve learned about separating boilerplate from what really matters also extends beyond our public Terraform. Our platform’s infrastructure is configured using Terraform, and it contains both boilerplate that rarely gets updated (e.g. our database configuration) as well as configuration that needs to be changed frequently (e.g. what version of our platform gets deployed). By applying what we’ve learned through iterating on our product and creating modules for our boilerplate, we can make our own infrastructure easier to work with, too.

While this answer seems obvious in retrospect, we had to learn a lot and iterate to get to the right level of abstraction. This won’t be the last iteration, either. We care deeply about developer experience, and are always trying to make Sym easier to use without taking away any of the platform’s power.

Recommended Posts