In this article, we explain how you can deploy your Terragrunt infrastructure with this paid Terraform pipeline module.
Shortcomings in Vanilla Terraform
When you have a relatively complex Terraform infrastructure you will find certain shortcomings in vanilla Terraform:
- Module dependencies are a pain, often you must apply certain modules before you can finish your plan or apply.
- Module versioning is a pain which makes it harder to test changes in pre-production environments.
- You will find you are repeating yourself a lot (provider blocks or backend blocks for instance).
- Plans and applies take a very long time as all state has to be refreshed. A plan for your full AWS environment can take up till 60 minutes. Even if you have only changed one module.
- You will find that 95% of the code surrounding your infrastructure is the same. Maybe some small changes depending on production and pre-production.
Terragrunt is a layer to extend Terraform which aims to fix these things.
However when you want to deploy Terragrunt in your own AWS infrastructure you need to figure this out on your own. Or you can use external services like env0. You can not use Terraform Cloud right now as this does not support Terragrunt.
Furthermore you should know that all these external services need access to your Terraform statefile. This Terraform statefile can hold a lot of sensitive information (parameter store values, database passwords, etc). In my opinion it is better that you keep this state information within your organisation.
Next to that running using the run-all command in a CI/CI pipeline in Terragrunt actually has some shortcomings as well.
- You will find you need to mock outputs for dependencies in your terragrunt.hcl files. This is because the outputs of certain modules are required to run a plan or validate step
- Run-all deploys all modules and this includes modules that might not have changed
This is why we constructed a Terraform module specifically for this use. It includes an example repository so you can hit the ground running.
Terraform module features
The Terraform module is packed with features that allow you to deploy Terragrunt code in a consistent manner:
- Support for all major VCS (GitHub, Gitlab and Bitbucket).
- Initial deploy mode that used run-all to deploy the initial infrastructure (optionally automatically resolving dependencies).
- Includes two repositories with example Terragrunt code and modules.
- Uses AWS best practices to separate workloads to multiple AWS organisations.
- Only deploys changed modules for subsequent runs making for lightning fast deployments and shorter plan outputs.
- Uses existing AWS infrastructure (AWS CodePipeline).
- Keeps your state internal to their own AWS accounts (versioned S3 with DynamoDB locking).
- Approval of deployments optional.
- Is able to detect deleted modules and plan a destroy for them.
- Uses AWS roles to deploy the code into your accounts.
- No limitations on the number environments you can deploy.
- One time payment instead of monthly recurring costs.
- Includes an optional destroy pipeline to easily destroy infrastructure in not needed accounts.
- Plan outputs are shared between steps so you always approve the plan you’ve seen made.
- Includes an ECS clone module that speeds up Docker Hub clones and circumvents 429 rate limits.
- Ongoing costs for this module are mostly the CodeBuild costs, so in the order of dollars not thousands of dollars.
- Includes a Docker based CLI to prepare your own repositories and to provision the organisation structure in AWS
Two GitHub (private) repositories
Starting requires two private GitHub repositories:
- An infrastructure repository (in our case elasticscale_infrastructure), this holds the environments in folders like dev, prod.
- An modules repository (in my case elasticscale_infrastructure_modules), this holds all the modules and their version tags you want to deploy
In my opinion your infrastructure code should be stored in separate repositories from your application code. This is because the rate of change of these repositories is vastly different. After some time your infrastructure code stabilises, but your application gets changed on a daily basis. Deploying them together does not give you direct benefits but more operational overhead.
Further down you will find a CLI tool that can automatically generate the right repositories for you (based on these two example repositories).
The infrastructure repository has several branches (one branch per environment). In the repository there are folders for every environment. You can not use a single folder for all your infrastructure because AWS CodePipeline does not support this.
Furthermore you will find that you tend to use other services for your production environments. For example in production environments you would run a MongoDB Atlas cluster but in staging you run MongoDB on ECS Fargate.
And you would have a infrastructure folder and infrastructure branch (the folder and branch name always need to match). But in this infrastructure account you would just have some IAM information and this specific pipeline module (and maybe a VPC).
Having separate folders per environment allows you to do just that. It also allows you to have a common configuration in the root of the repository (for instance the AWS region).
Having a separate modules repository allows you to test modules locally before you push them into the repository. When you push the commit you can also add a semantic versioning tag to that specific commit.
This allows you to re-use modules but switch versions on an environment by environment basis. It also applies the concept of DRY (Do Not Repeat Yourself). You can make a module behave differently based on production usage or not. Often most of the code actually overlaps (it might just use bigger instances and enable backups in the production setup).
Chicken and egg problem
Setting up the module is straightforward. But we need to overcome the so called chicken and egg problem. We need to deploy this Terragrunt CI pipeline module but ideally, we would like to use a pipeline to deploy this pipeline.
We are going to solve this by using a setting up the module manually in a brand new infrastructure account. By using regular terraform commands on our local machine. After that we will deploy the infrastructure account with this pipeline that we have just created. Once that deploys successfully we can run terraform destroy again on our local machine.
Using the CLI
You can also use the CLI tool we’ve built to create the organisation structure and roles for you, directly in your own AWS account:
Just run the following command and follow the steps outlined:
docker run -it alexjeen/terragrunt-helper-cli:latest php cli.php aws
It will create a new OU, five new AWS organisations and will also create the required role in the accounts that need it (infrastructure, security, staging and production).
Make a note of the account IDs as you will need them in the next step.
Doing these steps manually
If you do not want to use the CLI tool you can execute these steps manually.
AWS best practices advise us to use separate organisations to split our workloads. This might seem a bit daunting if you are not used to this but with switching roles you can easily switch between AWS organisations.
The following structure is a good one to start with and corresponds with the module examples:
- security account – This would hold all the IAM users (or SSO solution) that can login to this account, from there they can switch roles into the other accounts (this is also the admin account for services like GuardDuty, Security Hub and Macie)
- infra account – This holds everything needed to deploy the infrastructure (for example this pipeline)
- prod account – This holds all production resources
- staging account – This holds all the staging resources
- local account – This account is purely for local ran Terragrunt commands, you can see it as a sandbox where you can test your full infrastructure and then tear it down
You also have the option to add other accounts, which are not included in this example.
- log account – This collects logs from all accounts
- backup account – This collects backups from all accounts
Ideally your root organisation is empty and is not used for anything except billing. You can additionally use AWS Service Control Policies (SCP’s) to restrict what can happen in certain AWS accounts (ie. you might make a SCP that restricts the usage of certain regions or instance sizes in staging).
Create the following accounts
To get started with this structure login to your AWS account as an IAM user (not as root user) and go to AWS Organisations, tick the box in Root and create a new Organisational Unit. You can name it any way you want.
Now go ahead and create the following accounts:
- security account
- infra account
- production account
- staging account
- local account
Make sure not to change the IAM role name (keep it as OrganizationAccountAccessRole). This role allows you to login to the newly created accounts.
You need to use a unique email address per account, but most providers support aliasing meaning you can use an email address like this: email@example.com
The account will then be created and the email will be sent to firstname.lastname@example.org.
After creating the accounts move them into the OU you just created. You are now the proud owner of five brand new AWS accounts!
Creating the Docker Hub credentials
Sign in to Docker Hub to get docker hub credentials (for the ECS clone module), it will make the pipeline more reliable and prevent 429 errors from Docker hub:
Setting up the repositories
There is a lot of replacement going on to get the repository that you need yourself. You must replace certain things in certain files and create branches. Hence I created this simple Docker image that does the heavy lifting for you.
First create a empty folder and note the absolute path (mine is /Users/alex/Desktop/output).
Then run the following Docker command:
docker run -it --volume /Users/alex/Desktop/output:/var/app/output alexjeen/terragrunt-helper-cli:latest
If you want to review the code of this CLI tool you can do that here.
Follow the instructions in the CLI and it will generate a repository structure tailor made for your organisation:
Make sure to review the output before continuing because making a mistake here can send you a wild goose chase!
When both repositories have been generated feel free to review them. But otherwise just publish both to to your private GIT repos:
For the infrastructure repository make sure to publish all branches to the remote.
Deploying the temp pipeline
We will set up a temp pipeline from our local machine. This will allow us to deploy our definitive pipeline into the infrastructure account. After this deployment is complete we can delete the temp pipeline.
Now we will create temporary security credentials for an IAM user to deploy into the infrastructure account. My infrastructure account ID is 136431940157 and I will be using eu-west-1 going forward. I will also assume you are using the standard prefix of the module going forward (terragruntci). If you change the prefix variable of the module, some steps might need to be changed to reflect that.
Click on your name in the AWS console and click switch role, fill in these details of course replacing the Account with your infrastructure account ID and the Role is OrganizationAccountAccessRole:
We have now landed in the infrastructure account
We have now landed in the infrastructure account. Create an IAM user in this account with the following details:
- Username: temp-terragrunt
- No need for management console access
- Attach the AdministratorAccess policy directly
After adding this user we can generate some security credentials for this user. Go to security credentials of the newly created user and create some access keys. Ignore the warnings as we will only use this account for the initial run and afterwards remove this user and it’s credentials.
And fill in the access keys details (note this will overwrite your default profile). After running this command check if your credentials are working properly by running:
aws iam get-user
You should get an output of IAM stating your user details (which should be temp-terragrunt).
Now extract the .zip file with the module (location does not matter for now) and navigate to examples/getting-started.
If you used my CLI tool you should have a temp.tfvars in the infrastructure_modules/pipeline folder. You can use this file as it has already been configured properly for your account.
Otherwise, create a temp.tfvars file in this folder with the following contents, replacing your infrastructure account_id, Docker Hub and repository details below:
region = "eu-west-1"
docker_hub_username = "dockerhubusername"
docker_hub_access_token = "dckr_pat_xxxxx"
repository_name_infrastructure = "elasticscale/acmesystems_infrastructure"
repository_name_infrastructure_modules = "elasticscale/acmesystems_infrastructure_modules"
full_modules_url = "git::ssh://email@example.com/elasticscale/acmesystems_infrastructure_modules.git"
infrastructure_account_id = "120789697310"
Now let us deploy the temporary pipeline by running:
terraform apply --var-file temp.tfvars
Approve the apply and we will have some pipelines in AWS now:
However the pipelines will fail because we have to connect our VCS connection (it will be created in the pending state):
Open the connection and finalise the connection to your VCS system so it can access the repositories that contain the infrastructure and modules.
The last step before we can deploy
The last step before we can deploy is to create an IAM roles for the temp and infrastructure pipeline.
Go to IAM > Roles and create two IAM roles with the following names:
- temp-infra-role with the AdministratorAccess attached, set the trusted entity type to AWS account and this AWS account (the infrastructure account)
- terragruntci-infra-role with the AdministratorAccess attached, set the trusted entity type to AWS account and this AWS account (if you used the CLI this role should already be in the account)
We can now deploy the infrastructure branch so it can deploy its own permanent pipeline (via the temp pipeline we’ve just created).
Extract the pipeline module .zip into your new infrastructure_modules repository to the pipeline folder, the structure should look like this:
Commit these changes to the repository and add a tag named 1.0.1:
Push this change to the repository.
Now on the infra branch in the infrastructure repo, go to infra/pipeline/terragrunt.hcl and bump the version to 1.0.1:
Now push to the infra branch and the temp-apply-infra pipeline should start running the changes to create the new pipelines.
Note that using the destroy pipeline is off by default because this will destroy all your infrastructure when approved (this can be dangerous). It can be enabled on a per environment basis.
Now we have a working pipeline we want to get it to run it successfully on itself. Because that will tag the pipeline with a Github commit hash. This way the pipeline knows what infrastructure is available in the target AWS account.
First update the new code connection so it is not in a pending state anymore. This is the last time this is needed.
Then trigger a deployment on the terragruntci-apply-infra pipeline and wait until it succeeds, approve the (not known) plan for it to complete.
When the pipeline succeeds we can now see a commit hash on the pipeline tags:
From now on, only the changed infrastructure will be deployed
There will be a git diff-tree ran between the terragrunt.hcl files that have changed. And only these modules will be planned and applied.
If the CodePipeline ever gets recreated, for instance the prefix changes, the commit hash might get lost and a run-all will be executed. This should be harmless, just make sure to set initial_auto_apply to false so you can inspect the plan. The plan and apply should take a bit longer in this case.
Optionally after deploying you might want to set initial_auto_apply=false for the pipelines. Because this is mainly used to force apply to new environments. Without making any mock dependencies between the modules (it makes initialisation of a new infrastructure so much easier).
Cleanup the temp pipeline
Now it is time to clean up the pipeline we do not need anymore. Go to the getting-started folder you began in and run:
terraform destroy --var-file temp.tfvars
Accept the destroying of infrastructure. Then go to the IAM console and delete:
- The IAM user with username temp-terragrunt
- The IAM role with name temp-infra-role
Our account is now free of the temporary details we required to overcome the chicken and egg problem.
Creating the other account roles
If you used the CLI tool you can skip this step. The required IAM roles are already created in the accounts.
We now need to create the IAM roles for the other accounts. Switch roles to the other accounts via the OrganizationAccountAccessRole and create:
- In the prod account create an IAM role named terragruntci-prod-role with AdministratorAccess and a trust policy with the infrastructure account
- In the security account create an IAM role named terragruntci-security-role with AdministratorAccess and a trust policy with the infrastructure account
- In the staging account create an IAM role named terragruntci-staging-role with AdministratorAccess and a trust policy with the infrastructure account
An example in the prod account:
And attach the permissions:
And the name:
Please note that the trust policy allows everybody to login to this account if they can assume roles (from the infrastructure account). Check out the FAQ below if you want to lock this trust policy down more. This way only the CodePipeline can assume this role.
Your own modules and terraform modules
After creating these IAM roles we can now also deploy the other pipelines of the other environments. You should the demo infrastructure appear in all the accounts.
You can now develop your own modules and terraform modules from this starting point.
There are quite a lot of steps involved to get this module to work. The main reason being that we wanted to give you an example infrastructure as well.
If you get stuck during the setup of this module do not hesitate to contact us! Onboarding support for the pipeline included in the price.
Testing changes to your infrastructure
There is the need to separate two workflows:
1. Developing a new module in the modules repository
We like to create this module separately by using regular terraform. We then test the module locally with a vars.tfvars to see it is working properly. Building modules blindly is not a good idea because you can run into AWS API issues (constraints that seem to work fine but fail when they are deployed).
So create a new folder in the modules repository, then cd into this folder and run:
terraform apply --var-file vars.tfvars
This way you can determine the module inputs, outputs and make sure it works properly. You can use the local account for the testing.
When you are done developing the module run:
terraform destroy --var-file vars.tfvars
Then commit and add a tag to the commit. Then deploy the module in via the infrastructure pipeline.
Some tips for your module development:
- Add a prefix so your resources have a unique prefix and names do not conflict
- Do not include provider or configuration (backend) blocks in your module
- Use semantic versioning to communicate module breaking changes
- Try to infer data instead of passing it in directly to the module (ie. just pass in the vpc_id and infer the subnets via a data call)
- On production resources use flags like deletion_protection so an accidental destroy does not cause you to lose data
- Add a variable named is_production to your modules so you can differentiate instance sizes and production setup per module.
2. Integration testing
First you must role switch into the local account and create some IAM credentials and configure them on your machine. The state will be stored in S3 so you can work with multiple people on this sandbox environment.
You can go to the infrastructure repo, then go to the local/tf folder and run this command:
terragrunt run-all apply
This way you can test your full infrastructure deployment and dependencies into the local account. It is also helpful to generate .terraform.lock.hcl files you can use in the other environments. These lock files make your deployments more consistent as the providers are locked to a certain version. Copying the lock files to other environments is optional but advised, otherwise your pipeline might have issues deploying with newer provider versions in the future.
When you are done testing your infrastructure run:
terragrunt run-all destroy
You might need to mock some outputs to get the destroy to work.
Frequently asked questions
I get the error BUILD_CONTAINER_UNABLE_TO_PULL_IMAGE
If you get the errror: BUILD_CONTAINER_UNABLE_TO_PULL_IMAGE: Unable to pull customer's container image. CannotPullContainerError: Error response from daemon: manifest for 136431940157.dkr.ecr.eu-west-1.amazonaws.com/dockerhub/devopsinfra/docker-terragrunt:aws-tf-1.5.5-tg-0.50.1 not found: manifest unknown: Requested image not found
Find the elasticscale-infra-clone-codebuild AWS CodeBuild project and run it manually, this will pull the images from Docker Hub and push them to the private ECR registry. It should run automatically but sometimes it has issues. If the images are still not appearing check the output logs of this Build job (could be your Docker Hub credentials are wrong).
I need to make a emergency fix and can't go through the pipeline
You can just go to the repository infrastructure, then to the prod/tf folder and run terragrunt run-all plan or terragrunt apply in the module folders (ie. s3). This works because the state is stored in S3. Of course you need to have access to the Organization to run these commands because there is no assume role of Terragrunt in place.
Can I also deploy in other regions than eu-west-1?
Sure, the env_vars.yml file is merged with the common_vars.yml file. This means that you can also overwrite the region in env_vars.yml to something else. This way you could have a prodeu and produs branch and deploy this in either the same AWS account or two different ones.
I am trying to destroy a pipeline and terragrunt tells me that certain dependencies have to be applied before it can fully destroy
This is a bug in Terragrunt where it does not destroy in the right order. You can add some mock_outputs to the dependencies to get the destroy to succeed.
I think AdministratorAccess policy on the IAM roles is too permissive
Your right, you can analyse the role with IAM Access Analyser to determine a correct set of permissions for the infrastructure deployment roles.
Is it wise to further lock down the infrastructure role principle in the accounts?
Yes it is wise to do that, for the sake of simplicity I did not do that. You could change the trust policies for the IAM roles in the account that provision the infrastructure to include the CodeBuild project ARN:
Make sure the ARN is set to the ARN of the CodeBuild project that deploys the infrastructure, something like:
That way only the CodeBuild project can assume the role. If you do not lock it down and your infrastructure account gets compromised they might be able to switch into the other accounts through this role (this is also why you should lock down the sts:assumeRole API call).
Note that if the attacker is able to change the buildspec.yml file they could still run code in the other accounts!
I want to redeploy a slightly older version of a branch
Simply change the commit tag of the pipeline to the commit hash you want to compare too and then run a deploy. This is also helpful if the deploy errored out but the tag was changed anyway.
Can I stop the pipeline from running on creation?
Unfortunately this (weird) behaviour can not be controlled so the pipelines for destroy and apply will always be run once they have been created.
I run into Error calling startBuild: Cannot have more than 1 builds in queue for the account (Service: AWSCodeBuild; Status Code: 400; Error Code: AccountLimitExceededException; Request ID: eaf60c43-12a6-4a0c-afb0-379e97b0949a; Proxy: null)
Request a higher limit for CodeBuild and "Concurrently running builds for Linux/Small environment" in the Service Quotas menu (you just need to do this for the infra account).
I run into Error: Error acquiring the state lock
This can sometimes happen when pipelines error out (maybe a permission issue so it is able to lock but not unlock after) and Terraform can't not unlock the DynamoDB state file. Go into the account that you are deploying into (for example via the root account or OrganizationAccountAccessRole) and find the DynamoDB table.
Then remove the locking row and try again:
I do not want to have my docker public token in the infrastructure repository
The damage an attacker could do with this token is limited hence why I did not make it private (and it made the steps a lot easier without adding another dependency). But you could add an parameter_variable that points to the SSM parameter that the ECS Docker Hub clone module creates:
parameter_variables = [
name = "TF_VAR_docker_hub_access_token"
value = "elasticscale-infra-clone-docker"
Rotate it and then remove it from the infrastructure branch. You can only do this after the CodePipeline for infra has successfully deployed for a first time.
How can I decide who can approve plans and deploy infrastructure?
You can use IAM policies for that. In this case you must change the IAM role in the infrastructure account that gets assumed by the IAM users from the security account. You need to make multiple roles whereby you have users that use a role named:
- Role X can be assumed and has codepipeline:PutApprovalResult permissions (you can set that on certain pipelines)
- Role Y the default role has a implicit deny on codepipeline:PutApprovalResult meaning they can not approve deployments
In my application code I need to know a bucket name, how can I communicate that if it is in separate repositories?
There are situations where you have dependencies between your application code and your infrastructure. If you settle on a common naming scheme you could communicate these dependencies through this naming scheme (or use a AWS SSM parameter structure to communicate these infrastructure settings). For example your S3 bucket name would be:
- elasticscale-prod-bucket in production
- elasticscale-staging-bucket in staging
You could put the prefix to SSM or the full bucket name to SSM and take it from there once the application starts.
I want to add another environment
This consists of several steps for an environment named qa:
- Add a new account under the OU named qa
- Switch to the organisation and add the required role for the terraform infrastructure provisioning (see "Creating the other account roles"), the name of this role should be terragrunt-qa-role
- Change to the infra branch and add the account ID and name of the environment in the list
- Deploy the infrastructure pipeline so you get a new pipeline for the environment
- Copy one of the environments like staging in the infrastructure repo to the folder qa and create a new branch named qa
- Change the vars.yaml to reflect the new environment info
- Push to the branch and it will run the initial apply for you so the account would be provisioned
- Add the account to the security account configuration (in the infrastructure repo security/tf/security/terragrunt.hcl) so you can switch into this account from the security account
Make sure your environment name is not too long otherwise you will run into length issues with your resources, for instance elasticscale-preproduction-connection exceeds 32 characters and can cause issues further on).
Can I use this module without the ECR clone module?
Yes but your builds will be unreliable (you will get a lot of 429 errors) and the Docker containers for the builds can take up till 3 minutes to start.
How do I login to the accounts via the security account?
First you need to get the username and password of the IAM user that was created in the Security Account. We can do this by using the OrganizationAccountAccessRole again. After logging in to the security account navigate to S3 and to the state bucket (there should only be one). Then find the terraform.tfstate file in the security folder and download that to your computer. You can see the password in the state file and use that to login to the account.
I also included this step to show you why you need to protect your statefile! It holds a lot of sensitive information!
After logging in to the security account you can assume a role in the other accounts, for example switching to my staging account via the elasticscale-staging-security role (or acmesystems-staging-security role depending on your prefix):
I have another question
Send us an e-mail on firstname.lastname@example.org, we are glad to assist you!