Reason For Outage for October 25th, 2019

This page describes the circumstances of the outage on October 25th, 2019 and actions that we have taken and will be taking to prevent recurrence. This article was published on 2019-10-30.

Background

Early on 2019-10-25 we observed unusual Client traffic which was causing large numbers of transactions per second, and correspondingly high load, on one of our database servers. We also received a number of support requests from customers who were experiencing delays in printing. We concluded that the unusual Client traffic could not be generated by a correctly-functioning Client and was therefore either a bugged Client or of a malicious nature.

We analysed the API logs but didn't find anything out of the ordinary. In particular we didn't observe that our API was responding slowly compared to its baseline. We inquired with customers who were experiencing problems but didn't receive any solid information on the nature of the delays. However, we observed a number of long-running database transactions that appeared to be linked to the problems the customers were experiencing.

We analysed the server logs to pinpoint the origin of the unusual Client traffic and decided that the correct course of action was to reconfigure the server to disallow that traffic.

Reason for outage

In order to block the unusual Client traffic, at 2019-10-25 19:19 UTC the affected application server was erroneously reconfigured. This led to loss of control over the server. We decided against failing over to the back-up application server as that would result in loss of some of the print jobs that had been sent to us. Instead we took steps to regain control of the server. When we succeeded in doing so, original settings were restored and normal service resumed.

Remediation

The following actions are being taken:

The actions that are being taken will enable us to dramatically reduce both the impact of problematic Client traffic and the impact of loss of an application server. In particular, we are confident that this set of circumstances can no longer result in a service outage.