Root Cause Analysis of August and September 2019 outages

This is the root cause analysis of service disruptions that occurred on 2019-08-18, 2019-09-10 and 2019-09-22. This analysis was published on 2019-10-08.

Timeline of events

2019-08-18

00:18 UTC – Our monitoring system reported disconnection of a small number of Clients; initial analysis pointed to a wider Internet outage.
11:54 UTC – We raised a ticket with our data centre provider.
12:16 UTC – The data centre's analysis mistakenly pointed to a BT network outage which was also occurring at the time.
15:04 UTC – After more analysis, the data centre team identified a problem within their network. A fix was implemented by the data centre team; connectivity was restored for some of the previously disconnected Clients.
15:08 UTC – Another fix was implemented by the data centre team; full connectivity was restored.

2019-09-10

12:22 UTC – Our monitoring system reported disconnection of a large proportion of Client traffic: around 27% of Client traffic was lost.
12:38 UTC – After initial analysis, a ticket was raised with the data centre team.
13:18 UTC – The data centre team applied a fix; around 13% of connections were regained.
13:53 UTC – The data centre team applied a second fix; a further 13% of connections were regained.
14:11 UTC – The data centre team applied a third fix; there was intermittent connectivity for two minutes for 11% of Client software instances and the final 1% of connections were regained.
22:11 UTC – The data centre team reported further diagnostics of the issue and a plan for a long-term solution.
23:30 UTC – The data centre team reported a long-term solution was in place; monitoring of hardware to continue.

2019-09-22

10:35 UTC – Our monitoring system showed a significant proportion of Client traffic lost.
11:07 UTC – After initial analysis, a ticket was raised with the data centre team.
11:48 UTC – The data centre team identified the issue and made changes to routing tables.
12:05 UTC – Connectivity was restored following application of changes to routing tables and core router restart.

Root cause

The three failures had one root cause which was to do with an issue regarding a bug in Cisco routers causing them to handle routing tables incorrectly (https://quickview.cloudapps.cisco.com/quickview/bug/CSCvb42724).

This was difficult for the datacentre to track because each time it affected only a small portion of traffic and a different physical appliance. It was sufficiently subtle that their monitoring systems didn't pick it up, which meant that their responses to the outages were not as rapid as they should have been.

Short term remediation

The data centre team first decided to remove a large number of IPv6 entries from their routers by moving traffic to a default route only – this was to keep the routing tables small and avoid triggering the bug.

The PrintNode team decided to add more monitoring infrastructre to be able to identify any future issues as quickly as possible and minimise their impact.

Long term remediation

The data centre team decided to bring forward its planned upgrade of Cisco routers, from 12 months to ASAP. The change (to the Juniper MX platform) is now planned for the coming weeks. It is expected that this will completely eliminate the possibility of recurrences of this issue.

The PrintNode team is exploring options, including enhanced monitoring, geographic redundancy, automatic failover and changing data centre provider, to make PrintNode more redundant and resilient against outages of this kind.