This is the first part of a series of articles on the Terraform E+A Pattern. Before getting into the details, let's establish some terminology. For the purpose of this article, the overall system consists of a number of environments that provide some shared, foundational infrastructure, and then several applications that are separately deployable into those environments.
The goal of environments is to create a situation where some or all of the system components are duplicated so that changes (whether to infrastructure or application code) can be tested without risking the primary "production" environment. For the sake of our discussion here we will consider just two environments, which I'll call "QA" and "Production". Other teams might refer to the former of these as "staging", "test", or "pre-production"; whatever you call it, it's a duplicate of production that is entirely separate from it, so that changes to QA can be made with the confidence that the "real system" (used by customers) won't be affected.
Applications are system components that usually implement some business logic for your product, that are deployed once for each environment, and that may interact both with shared environment infrastructure and with other applications. Other teams might refer to applications as "services", "micro-services", etc. The intent is just that each application is deployed independently of the others, and interacts with the environment and other applications via well-defined interfaces to manage risk as things change over time.
It's easier to discuss the pattern with practical examples, so for the sake of this article I'm going to use network and compute resources from AWS, Consul as a configuration store, and some other ancillary providers here and there to illustrate different ideas. The general pattern is provider-agnostic however, and should map onto other products that have comparable capabilities; the only strong requirement is that there must be a data store for which Terraform has both read and write capabilities.
I'm also going to use a ficticious placeholder system for my examples. This system is a typical web content management system that consists of a public-facing "renderer" application that produces the pages used by end-users, an internal-team-facing "editor" application that is used by content authors to produce content, and a "store" application that provides a common backend API for the renderer and editor that acts as the interface to the system's core data.
Limits of Scale
No pattern is a magic bullet, so rather than promising you huge success I will first discuss the limits of my experience with this pattern and the challenges I'd anticipate with growing beyond those limits.
This pattern has been successfully deployed as the organizational principle behind a system consisting of tens of applications deployed across three environments.
The scaling limits hit in that system were largely those of underlying technology used by those specific applications rather than of the pattern itself.
This pattern has not yet been applied to a system with hundreds or thousands of distinct applications sharing common infrastructure. Many new challenges emerge at this scale which often warrant a higher level of abstraction for applications interacting with infrastructure, and it's not known to what extent those new challenges can be addressed within this pattern.
My suspicion -- not yet tested in practice! -- is that the pattern can scale to the extent that the underlying technology choices are appropriate. For example, although this article uses AWS EC2 compute infrastructure directly for application deployment, the environments could instead provide a smarter workload scheduler such as Kubernetes or Nomad and yet still use Terraform to deploy workloads to that scheduler.
Regardless of technology choice, I'd encourage prototyping with some real applications adapted to this pattern before going all-in. Fundamentally how well this pattern suits your system will depend on many different concerns, such as how your team is structured and what other technology choices you have already made.
Environment-Agnostic Infrastructure
Although the ultimate goal is for each environment to be distinct, practical concerns will usually lead to at least some components being shared across environments. This might include your AWS account itself (though a separate account per environment is certainly possible too), some user accounts used by engineers to access the system, and most importantly some infrastructure that you use to manage the environments themselves.
When getting started it's often fine and pragmatic to set up these foundational services manually, or just stand up the environments by running Terraform on your local computer. You can always come back and put automation around this later.
Since the global infrastructure is not the focus of this article I will just assume that you have already created the accounts you need on any cloud infrastructure you intend to use, or you've racked and powered on physical network infrastructure and datacenter management infrastructure. I'll also assume you have somewhere to run Terraform that has the appropriate access to those resources.
As you get your system more firmly in place and want to better automate it, it's possible to start running Terraform through tools such as Jenkins or Hashicorp's own Terraform Enterprise, but when starting entirely from scratch I find it's easiest to work locally to get the basics in place and then retroactively automate the setup of the foundational elements. We won't go into this any deeper for now.