Shh… It’s a Secret: Managing Secrets at Betterment
Opinionated secrets management that helps us sleep at night.
Secrets management is one of those things that is talked about quite frequently, but there seems to be little consensus on how to actually go about it. In order to understand our journey, we first have to establish what secrets management means (and doesn’t mean) to us.
What is Secrets Management?
Secrets management is the process of ensuring passwords, API keys, certificates, etc. are kept secure at every stage of the software development lifecycle. Secrets management does NOT mean attempting to write our own crypto libraries or cipher algorithms. Rolling your own crypto isn’t a great idea. Suffice it to say, crypto will not be the focus of this post.
There’s such a wide spectrum of secrets management implementations out there ranging from powerful solutions that require a significant amount of operational overhead, like Hashicorp Vault, to solutions that require little to no operational overhead, like a .env file. No matter where they fall on that spectrum, each of these solutions has tradeoffs in its approach. Understanding these tradeoffs is what helped our Engineering team at Betterment decide on a solution that made the most sense for our applications. In this post, we’ll be sharing that journey.
How it used to work
We started out using Ansible Vault. One thing we liked about Ansible Vault is that it allows you to encrypt a whole file or just a string. We valued the ability to encrypt just the secret values themselves and leave the variable name in plain-text. We believe this is important so that we can quickly tell which secrets an app is dependent on just by opening the file. So the string option was appealing to us, but that workflow didn’t have the best editing experience as it required multiple steps in order to encrypt a value, insert it into the correct file, and then export it into the environment like the 12-factor appmethodology tells us we should.
At the time, we also couldn’t find a way to federate permissions with Ansible Vault in a way that didn’t hinder our workflow by causing a bottleneck for developers. To assist us in expediting this workflow, we had an alias in our bash_profiles that allowed us to run a shortcut at the command line to encrypt the secret value from our clipboard and then insert that secret value in the appropriate Ansible variables file for the appropriate environment.
alias prod-encrypt="pbpaste | ansible-vault encrypt_string --vault-password-file=~/ansible-vault/production.key"
This wasn’t the worst setup, but didn’t scale well as we grew. As we created more applications and hired more engineers, this workflow became a bit much for our small SRE team to manage and introduced some key-person risk, also known as the Bus Factor. We needed a workflow with less of a bottleneck, but allowing every developer access to all the secrets across the organization was not an acceptable answer. We needed a solution that not only maintained our security posture throughout the software development lifecycle, but also enforced our opinions about how secrets should be managed across environments.
Decisions, decisions…
While researching our options, we happened upon a tool called sops. Maintained and open-sourced by Mozilla, sops is a command line utility written in Go that facilitates slick encryption and decryption workflows by using your terminal’s default editor. Sops encrypts and decrypts your secret values using your cloud provider’s Key Management Service (AWS KMS, GCP KMS, Azure Key Vault) and PGP as a backup in the event those services are not available. It leaves the variable name in plain-text while only encrypting the secret value itself and supports YAML, JSON, or binary format. We use the YAML format because of its readability and terseness.
We think this tool works well with the way we think about secrets management. Secrets are code. Code defines how your application behaves. Secrets also define how your application behaves. So if you can encrypt them safely, you can ship your secrets with your code and have a single change management workflow. Github pull request reviews do software change management right. YAML does human readable key/value storage right. AWS KMS does anchored encryption right. AWS Regions do resilience right. PGP does irreversible encryption better than anything else readily available and is broadly supported. In sops, we’ve found a tool that combines all of these things enabling a workflow that makes secrets management easier.
Who’s allowed to do what?
Sops is a great tool by itself, but operations security is hard. Key handling and authorization policy design is tricky to get right and sops doesn’t do it all for us. To help us with that, we took things a step further and wrote a wrapper around sops we call sopsorific. Sopsorific, also written in Go, makes a few assumptions about application environments. Most teams need to deploy to multiple environments: production, staging, feature branches, sales demos, etc. Sopsorific uses the term “ecosystem” to describe this concept, as well as collectively describe a suite of apps that make up a working Betterment system. Some ecosystems are ephemeral and some are durable, but there is only one true production ecosystem holding sensitive PII (Personally Identifiable Information) and that ecosystem must be held to a higher standard of access control than all others. To capture that idea, we introduced a concept we call “security zones” into sopsorific. There are only two security zones per GitHub repository — sensitive, and non-sensitive — even if there are multiple apps in a repository. In the case of mono-repos, if an app in that repository shouldn’t have its secrets visible to all engineers who work in that repository, then the app belongs in a different repository. With sopsorific, secrets for the non-sensitive zone can be made accessible to a broader subset of the app team than sensitive zone secrets helping to eliminate some of bottleneck issues we’ve experienced with our previous workflow.
By default, sopsorific wants to be configured with a production (sensitive zone) secrets file and a default (non-sensitive zone) secrets file. The default file makes it easy to spin up new non-sensitive one-off ecosystems without having to redefine every secret in every ecosystem. It should “just work” unless there are secrets that have different values than already configured in the default file. In that case, we would just need to define the secrets that have different values in a separate secrets file like devintest.yml
below where devin
test
is the name of the ecosystem. Here’s an example of the basic directory structure:
.sops.yaml app/ |_ deployment_secrets/ |_ sensitive/ |_ production.yml |_ nonsensitive/ |_ default.yml |_ devin_test.yml
The security zone concept allows a more granular access control policy as we can federate decrypt permissions on a per application and per security zone basis by granting or revoking access to KMS keys with AWS Identity and Access Management (IAM) roles. Sopsorific bootstraps these KMS keys and IAM roles for a given application. It generates a secret-editor role that privileged humans can assume to manage the secrets and an application role for the application to assume at runtime to decrypt the secrets.
Following the principle of least privilege, our engineering team leads are app owners of the specific applications they maintain. App owners have permissions to assume the secret-editor role for sensitive ecosystems of their specific application. Non app owners have the ability to assume the secret-editor role for non-sensitive ecosystems only.
How it works now
Now that we know who can do what, let’s talk about how they can do what they can do. Explaining how we use sopsorific is best done by exploring how our secrets management workflow plays out for each stage of the software development lifecycle.
Development
Engineers have permissions to assume the secret-editor role for the security zones they have access to. Secret-editor roles are named after their corresponding IAM role which includes the security zone and the name of the GitHub repository. For example, secreteditorsensitive_coach
where coach is the name of the repository. We use a little command line utility to assume the role and are dropped into a secret-editor session where they use sops to add or edit secrets with their editor in the same way they add or edit code in a feature branch.
assuming a secret-editor role
The sops command will open and decrypt the secrets in their editor and, if changed, encrypt them and save them back to the file’s original location. All of these steps, apart from the editing, are transparent to the engineer editing the secret. Any changes are then reviewed in a pull request along with the rest of the code. Editing a file is as simple as:
sops deployment_secrets/sensitive/production.yml
Testing
We built a series of validations into sopsorific to further enforce our opinions about secrets management. Some of these are:
- Secrets are unguessable — Short strings like “password” are not really secrets and this check enforces strings that are at least 128 bits of entropy expressed in unpadded base64.
- Each ecosystem defines a comprehensive set of secrets — The 12-factor app methodology reminds us that all environments should resemble production as closely as possible. When a secret is added to production, we have a check that makes sure that same secret is also added to all other ecosystems so that they continue to function properly.
- All crypto keys match — There are checks to ensure the multi-region KMS key ARNs and backup PGP key fingerprint in the sops config file matches the intended security zones.
These validations are run as a step in our Continuous Integration suite. Running these checks is a completely offline operation and doesn’t require access to the KMS keys making it trivially secure. Developers can also run these validations locally:
sopsorific check
Deployment
The application server is configured with the instance profile generated by sopsorific so that it can assume the IAM role that it needs to decrypt the secrets at runtime. Then, we configure our init system, upstart, to execute the process wrapped in the sopsorific run command. sopsorific run
is another custom command we built to make our usage of sops seamless. When the app starts up, the decrypted secrets will be available as environment variables only to the process running the application instead of being available system wide. This makes our secrets less likely to unintentionally leak and our security team a little happier. Here’s a simplified version of our upstart configuration.
start on starting web-app stop on stopping web-app respawn exec su -s /bin/bash -l -c '\ cd /var/www/web-app; \ exec "$0" "$@"' web-app-owner -- sopsorific run 'bundle exec puma -C config/puma.rb' >> /var/log/upstart.log 2>&1
>Operations
The 12-factor app methodology reminds us that sometimes developers need to be able to run one-off admin tasks by starting up a console on a live running server. This can be accomplished by establishing a secure session on the server and running what you would normally run to get a console with the sopsorific run command. For our Ruby on Rails apps, that looks like this:
sopsorific run 'bundle exec rails c'
What did we learn?
Throughout this journey, we learned many things along the way. One of these things was having an opinionated tool to help us manage secrets helped to make sure we didn’t accidentally leave around low-entropy secrets from when we were developing or testing out a feature. Having a tool to protect ourselves from ourselves is vital to our workflow. Another thing we learned was that some vendors provide secrets with lower entropy than we’d like for API tokens or access keys and they don’t provide the option to choose stronger secrets. As a result, we had to build features into sopsorific to allow vendor provided secrets that didn’t meet the sopsorific standards by default to be accepted by sopsorific’s checks.
In the process of adopting sops and building sopsorific, we discovered the welcoming community and thoughtful maintainers of sops. We had the pleasure of contributing a few changes to sops, and that left us feeling like we left the community a little bit better than we found it. In doing all of these things, we’ve reduced bottlenecks for developers so they can focus more on shipping features and less on managing secrets.