How we use Ansible for automated, quick and reliable deployment
July 22, 2021
Deploying code across all of NordVPN’s infrastructure is a considerable feat. Here's how we use Ansible to create unattended deployment for all of our major applications.
Deploying code across all of NordVPN’s infrastructure is a considerable feat. Some time ago, I presented one cornerstone of our infrastructure - Gitlab Runners. Now I want to tell you about Ansible and how we use it to create unattended deployment for all of our major applications.
Deployment is fundamentally easy - you just need to take code and copy it somewhere else, like to a web server. But there are a lot of strategies for how to do this, like “Multi-Service Deployment”, “Rolling Deployment”, “Blue-Green Deployment”, “Canary Deployment” and many more. Some can be hard to implement because they don’t come with clear instructions for your specific case. That’s why I want to explain our process.
You’ll find out:
How we went from theory to practice
Where we failed
What we learned
And how we do it now - hundreds of times per day without downtime
We have failed or had to stop to fix problems at every step in our deployment process. At each of those steps, we found a solution for how to improve and proceed.
Who should deploy?
Before we answer “How?”, let’s answer “Who?” Who should copy files? Usually, you’ll be choosing between a System Administrator/Operator (someone with access to your servers) or a Developer (someone who codes or tests applications). On the one hand, most developers can’t (or shouldn’t) access the production environment for security reasons (with the exception of debugging scenarios). On the other hand, administrators don't usually know anything about application code changes or testing processes. They probably won’t be familiar with future business plans or what is currently deployed or planned for deployment.
The answer is that no one should do it. It should be automated. This can be difficult to implement while avoiding configuration drifts and keeping your systems up to date, but it is possible. We did it.
After more than 5 years of experimentation, we created a strong background and currently use four environments.
Production: What you can see when using our services.
Pre-Production: You can only see this environment when we run A/B-Tests. Otherwise, this environment is only open to internal traffic. It can access production data and can be used as an integration environment.
Review: This pre-production and staging hybrid lets website front-enders quickly review their work.
Staging: Production on a small scale and without production data.
We always test everything in at least one additional environment before sending it to production, depending on the application in question. And this is just code testing before rollout to production. It doesn’t include QA or Secure Coding, which are huge additional fields.
Application health check
If your application fails, you need a developer who remembers all of its dependencies and can quickly check all of them. But what if you have 50+ developers and 100+ applications? The problem quickly becomes unmanageable, so we must ask the application to check itself. Whenever the application starts to use something new, or we find a part that can impact availability, we just add selfcheck functionality to the code.
Status checks help make our deployment process more reliable and avoid downtimes. This check is used for final tests before switching to the new version and for faster debugging. Any application we deploy knows about all of its dependencies and can check all of them. It’s like documentation, but you don't need to waste time reading.
Depending on the application, we use one of two methods - either switch all servers to the new application version at the same time or use rolling deployment. Rolling deployment can be done in any manner you choose: one by one, in 30% segments, etc... And this is where we always use application health checks. If there are problems, we need to find them as soon as possible and stop deployment so we can quickly fix them or roll the deployment back.
We use maintenance mode for our applications to make this process invisible for the end user. In the event of a problem, the state balancer stops sending traffic to the server, allowing us to do what we need to do. After our changes are done, if the health check comes up OK, the balancer automatically returns traffic to the server.
Auto-scaling, or why we double our capacity
To meet our growing plans, fight scrapers, enjoy invisible rolling deployment and to always be ready for anything, we use auto scaling and maintain a minimum of twice the required infrastructural capacity for our applications. During deployments, we can remove 30% of our resources without impacting service or performance and finish everything backstage.
Roll-back simply returns an app to an older version that we know worked, but what if an application is in a fail state and we have a network or CDN problem? Or what if we need to download many megabytes of code to many servers? We always keep a few old versions ready on our infrastructural servers so we can switch versions in less than a minute. Or we can redeploy as needed with a regular process.
Application password management
An additional challenge is keeping production passwords and sensitive datasecure. GitLab theoretically has everything we need in the form of Variables. But this functionality is totally useless on a larger scale. We created our own placeholder replacement mechanism to fetch all passwords from our internal password management software and replace them on the end server before the health check.
With this mechanism, we don’t need to keep configuration files in Ansible or use the template functionality. No problems with new configurations in the code and old fails in Ansible - we just keep everything in one place.
We use Semantic Versioning for our applications. After building something, we send an archived package to our CDN. This package is fully tested and ready for extraction on the end servers. It’s also fully ready for password replacement.
As I mentioned before, we use Ansible for automating all tasks, but Ansible doesn't have an API and we can not provide access for everyone to run playbooks everywhere. That’s when we found Rundeck, which works as an API wrapper. We send all deployment params directly from GitLab to Rundeck, which prepares and runs the process on our Ansible servers. With Rundeck, we also have redundancy between a few Ansibles and regions.
We sometimes need to perform certain additional post-deploy or pre-deploy actions. Our deploy configuration script can be unique for each application and can do anything, from correcting file permissions to flushing redis caches or processing webhooks at any point in the deployment process.
The current process is pretty comfortable and has worked without major changes for many years. We’ve used it to deploy 300+ applications written on PHP, Go, Ruby, NodeJS, Python... and for Docker as well. However, we do run up against Ansible restrictions and limitations when supporting many different servers. And there are additional components like Rundeck... Maybe we need to think about the next level of automation and start to use systems with support orchestration like SaltStack.