What to do when your app is down

Dmitrijus Glezeris

May 26, 2021

NordVPN uses a range of software to improve user experience and make sure their services run smoothly. However, if something goes wrong, it can take time to fix the problem and debug the website. How can you bring your app to life and where should you start?

Someone debugging at their keyboard

Into the fire 

You get a phone call in the middle of the night: your website is broken. You kiss goodbye to a good night’s sleep and open up the app in your browser. Everything seems to be working fine: there are no visual glitches, the assets are loaded — and yet, the complaints from users are still coming in.

First and foremost, you need to check whether all the services are up and running. You start your investigation as a diligent IT detective. 

MySQL seems to be working, as is RabbitMQ. Then you see it: Redis appears to be down for one of the server nodes. You restart Redis and expect everything to go back to normal. 

Unfortunately, the problem still persists.

Debugging the problem. . . Again. 

You check meticulously whether all the services are available. A sudden realization occurs — there could be a problem with the code! 

After digging through the logs, it seems that the last deployment broke the code that connects to RabbitMQ. 

So you decide to do what any sane person would do: rollback to a previous version. However, unbeknownst to you, there’s a little issue with the previous release.

Surprise! The previous release is completely broken 

You’ve just deployed a completely broken app to production. Your customers are unhappy, your managers are furious, and you’re frustrated and exhausted. 

You rollback the app by one more release and finally “fix” the issue. In some alternate universe things could’ve gone differently. 

Meanwhile, in an alternate universe 

You get woken up by a phone call. You visit the /status page of your website — {"status":"OK"}.

This means the main functionality of the app is available. You then get more 

details from the /status/details page:

1
{
2
"database.default": "OK",
3
"redis.local": "FAIL",
4
"rabbitmq.server": "FAIL"
5
}

You first reload the Redis server and check the status page again: 

1
{
2
"database.default": "OK",
3
"redis.local": "OK",
4
"rabbitmq.server": "FAIL"
5
}

Then, you take a look at the RabbitMQ server — it is definitely online. It looks like those pesky developers broke the code! Well, no worries, things like that happen sometimes.

You rollback the code using a rolling deployment method. Your deployment fails during the first phase because the load balancer health check detects your app is down. A broken release? Well then, just rollback to an even earlier version. 

The rollback is a success, everything is working as expected. You get praise from management and go back to bed to catch some sleep. 

If only... 

If only there was a php library for making this parallel universe a reality. Well, now — there is! 

Introducing the NordSec/StatusChecker library!

Main features 

The library comes with a controller that provides you with a status page: 

  • /status — provides the overall status of the app, useful for load balancer health checks. 

  • /status/details — provides more detailed information. 

We’ve also included a cli command that outputs those details to stdout: php bin/console status:check

Configuration 

So, let’s turn our parallel universe scenario into reality using the following basic configuration:

1
$container[StatusCheckerService::class] = function (
2
Container $container
3
) {
4
$configuration = $container['config'];
5
return new StatusCheckerService([
6
new DatabaseChecker(
7
'database.default',
8
$configuration['database']['default']
9
),
10
new RabbitMqChecker('rabbitmq.server', $configuration['queue']),
11
new RedisChecker('redis.local', $configuration['redis']),
12
]);
13
};

Then, configure your load balancer health check to expect a {"status": "OK"} from the /status page. 

There is a caveat though. 

If your RabbitMQ server goes down, the status page will change to {"status": "FAIL"}. This will cause the load balancer to assume that all your nodes are broken. 

Let’s fix that by marking some services as non-critical:

1
$container[StatusCheckerService::class] = function (
2
Container $container
3
) {
4
$configuration = $container['config'];
5
$redisChecker = new RedisChecker('redis.local', $configuration['redis']);
6
$redisChecker->setCritical(false);
7
$rabbitMqChecker = new RabbitMqChecker(
8
'rabbitmq.server',
9
$configuration['queue']
10
);
11
$rabbitMqChecker->setCritical(false);
12
return new StatusCheckerService([
13
new DatabaseChecker(
14
'database.default',
15
$configuration['database']['default']
16
),
17
$rabbitMqChecker,
18
$redisChecker,
19
]);
20
};

Now, whenever non-essential services go down, your app will still function, while still providing your monitoring software with useful output. 

Great! What’s next? 

Here at NordVPN, we’ve been using open source to build exceptional products, and we now wish to give back to the community!

Expect more open source libraries and helpful tips from us in the future!