Major Outage

Incident Report for Discord

Postmortem

All times are PST.

Summary

During a 24 hour period on November 17th and 18th, Discord service was completely offline for two 30 minute periods and suffered fourteen additional minor interruptions of service for portions of the userbase. The root cause was a bug introduced to the Google Cloud Platform networking stack. This bug caused 16 separate loss of network events on nodes in the Discord 'sessions' cluster. Diagnosis and resolution (by rolling back the offending change) took Google engineers over 24 hours. Contributing greatly to the severity of the two major incidents were performance and stability issues in the Discord system. The particular network failure mode was able to trigger a cascading failure of our service, necessitating a full restart of services twice. The rest of this postmortem will present the timeline, detail the stability issues and actions taken on our end, and finally discuss changes we're making to limit the impact of events of this class in the future.

Sequence of Events

For the sake of brevity and clarity some networking incidents that did not provide a significant or noticeable impact have been omitted from this timeline.

November 17th

11:39 - A node of the sessions cluster, sessions-1-16, experiences a networking incident, triggering an application bug which caused the service to die and automatically restart.
11:50 - Another node, sessions-1-9, observes a similar networking incident briefly separating it from members of another cluster (guilds). This incident did not trigger the aforementioned application bug, thus the node did not restart and was left in an invalid state.
11:52 - A service impact is noticed by engineers and escalated within our incident chat.
11:53 - During investigation, engineers notice the earlier loss of clients on sessions-1-16.
11:56 - Engineers discover that the service running on sessions-1-16 restarted at 11:40.
11:56 - Service Stabilizes.
11:58 - Through investigation engineers identify the root cause of the issues on sessions-1-16 to be a networking level problem that triggered an application bug.
12:01 - Discord’s primary on-call engineer is paged for a low memory alert on sessions-1-9.
12:02 - Engineers identify that the “Erlang distribution port”, which nodes use to communicate between each other, appears to have failed -- a state which has been observed in previous outages.
12:02-07 - Engineers attempt to safely decommission the impacted sessions node without causing adverse effects on the rest of Discord clusters.
12:11 - An engineer notes that the observed behavior appears to resemble a network split, something that has been previously observed on Google’s compute engine network.
12:14 - The safe decommission process fails, and we decide to execute a forceful termination of the node. Engineers brace for a possible repeat of the cascading service failure observed during a prior outage.
12:18 - Engineers declare an all-clear, having observed no adverse effects of the forced termination.
12:27 - An engineer opens opens a P2 ticket with Google’s support team to investigate the networking issues they’ve observed.
12:38 - The on call engineer is paged for memory being low on sessions-1-14.
12:39 - Engineers confirm that this node has also observed a failed distribution port caused by networking issues and must be forcefully terminated.
12:42 - Discord escalates the Google support ticket to P1, the highest level available, and notifies our Technical Account Manager.
12:48 - Engineers declare another all-clear, yet again having observed no adverse effects of the forceful stop.
14:06 - The on-call engineer is paged for another memory issue, this time affecting sessions-1-13.
14:09 - Engineers confirm that this failure resembles the previous issues and execute another forceful termination.
14:13 - This forceful termination triggers the cascading service failure. We update the status page to communicate these problems.
14:15 - The emergency communications bridge is started to coordinate service recovery. All Discord backend engineers and SREs are working on the incident now.
14:21 - Engineers begin an operation which will restart several core Discord services. This operation is generally referred to as “rebooting” Discord, as it forces the millions of connected Discord clients to reconnect over a period of around 20 minutes.
14:33 - Engineers complete the restart operation and began to monitor and manage the recovery process. Service begins recovering.
14:33 - 15:07 - Engineers continue to monitor and tackle individual issues that crop up during the reboot phase.
14:59 - Discord gets Google on the phone and confirms that they’re investigating, but Google does not have any indications that there is a problem on their end. Discord engineers are certain this is a Google networking issue.
15:07 - The initial all-clear is given and the status board is updated.
15:08 - Engineers safely decommission two misbehaving nodes of the sessions cluster.
15:33 - 15:37 - Engineers identify another misbehaving node of the sessions cluster and forcefully terminate it.
16:00 - Google engineers request more data and work with Discord engineers to attempt to identify a root cause.
17:10 - Discord engineers observe another networking incident, now on sessions-1-15, and forcefully terminate the node.
22:00 - 23:30 - Google continues to investigate the problem but does not believe there is an issue on their end. Discord staff notices that no machine has failed again after a forceful termination (and reboot) and begins executing a plan to terminate and reboot the entire sessions cluster.

November 18th

11:38 - The on-call engineer is paged for another networking incident, now on sessions-1-11.
11:40 - Discord’s customer experience team escalates issue reports from support channels.
11:41 - Engineers update the status page to note they are investigating another incident.
11:44 - Engineers observe similar behavior to the incidents observed the day before, and note that they may be forced to reboot the system again.
11:46 - The system cascades its way towards failure. Engineers spin up another communications bridge to again coordinate service recovery.
11:48 - Engineers begin to restart members of the sessions cluster.
11:56 - Engineers observe a failure on a node of the guilds cluster.
11:59 - Discord SRE spins up a voice bridge with Google engineering.
12:04 - Engineers globally disable message sending to aid in service recovery.
12:04 - Engineers restart all members of the sessions cluster.
12:11 - As engineers observe an abnormally slow recovery, they begin to investigate and discover what they believe to be another application level bug delaying normal recovery.
12:11 - 12:31 - Engineers continue to monitor the system as service recovers.
12:31 - Engineers give an initial all clear and re-enable message sending.
13:30 - Google engineers advise that they have begun to roll back a software change made to their networking stack on Friday.
~15:30 - Google confirms that the rollback of the component they believe may be causing issues has completed.

Investigation and Analysis

The root cause of this sequence of outages was instability in the networking layer on our VMs hosted within Google’s Cloud Platform. This instability presented itself as a loss of network connectivity for short periods and was caused by an upgrade Google rolled out. The bug specifically affected only a subset of our nodes (the sessions cluster) due to the way this particular cluster heavily utilizes the network.

On top of the networking instability caused by Google’s rollout, Discord engineers identified multiple issues that contributed to or exacerbated the impact of the root cause, including:

The specific network instability that was observed caused a component of the Erlang VM which handles internode communication (net_kernel) to become overloaded due to the way our clusters and nodes talk to each other. This overload would cause the Erlang distribution port to become “wedged”, leading to runaway memory growth of the process from message queueing.
An internal component used for service discovery experienced degradation due to performance issues in the implementation. This degradation contributed to the issues observed during recovery.
The HTTP library which Discord uses internally in our Erlang stack exhibited various levels of instability in these network conditions, which drastically slowed our recovery time and exacerbated other problems.

Action Items / Response

This outage was one of the longest and most severe in Discord’s history. As such our internal investigation and analysis has resulted in a multitude of items we are addressing in the coming weeks. The major components are:

During a previous outage when we were forced to reboot Discord, we identified some issues with the way our Erlang clusters handle node failure. We began a project to improve this and we are continuing work on that project. To ensure we can complete the roll out of this upgrade safely, we are working on expanding the way we load test upgrades to these services.
We are working on replacing the HTTP library Discord uses for its Elixir/Erlang services. The current library has caused numerous issues over time and we have lost confidence in the stability of it.
We have reproduced and implemented a fix for the instability we saw with the net_kernel component of Erlang.
We are working to replace pieces of the service discovery component which degraded during the outage with more performant implementations.
We are actively working with Google to improve on the issues we observed when working with them on this issue. Additionally, we’ve expanded our internal procedure for communicating with Google to ensure we properly track the progress on P1 issues.

We believe the steps outlined above will resolve the majority of problems which escalated these issues from causing isolated issues for subsets of our user bases to two full system outages. We’ve placed all of the items above at the top of our priorities and will be working to complete them in the coming weeks.

Finally, we’d like to apologize for any instability or interruption you experienced due to this incident. Outages like this which span multiple days and cause noticeable issues for a large number of users are particularly painful for us. Everyone at Discord cares greatly about the uptime and reliability of our service, and we hope to demonstrate this over the coming weeks.

Posted Nov 30, 2017 - 16:24 PST

Resolved

At around 12 noon yesterday Google began executing a rollback of an update they had recently pushed to the GCE networking stack. Google believes this change is unrelated to the networking issues Discord was seeing. At approximately 3PM yesterday the rollback of this change completed. Since this point we have not observed a single networking issue. At this time we believe the change may have either directly or indirectly been causing the issues Discord saw, and thus we'll be tentatively calling an all clear.

We'll be continuing to dive into this issue both with Google and within our own engineering team to make sure we understand the full scope of the problem, and work to push out updates that will improve Discords resiliency as a whole. We hope to follow up with a full post-mortem in the next few days.

As always, we're very sorry for any interruption these issues brought you and we'd like to thank everyone for their patience and understanding as we work through these issues with Google.

Posted Nov 19, 2017 - 12:38 PST

Monitoring

At this time we believe the majority of service has recovered for users. That said, we'd like to provide a more in-depth update on the issues users have been experiencing over the past few days.

We're currently working with Google on a priority 0 ticket for their Google Cloud Platform (which we use to bring you Discord) related to networking. Over the past day we've observed multiple major network partitions and issues on the nodes of our real time system responsible for keeping your Discord clients up to date. These networking "blips" are causing issues within various layers of our software, and many of the issues we've diagnosed will require development and testing to improve our resiliency (something we will be focusing on).

Unfortunately despite the dialog we've had with Google throughout this process, they currently haven't narrowed it down to a clear root cause. We deem the quality of service our users are getting through this process unacceptable, and have communicated this to Google's support and SRE teams. We're working around the clock to ensure Google properly diagnoses and resolves the issues we're seeing, while also monitoring and supporting our infrastructure in the hopes we can quickly catch and prevent these issues from spreading.

As always, apologies for the interruptions you've experienced and thanks for using Discord in your day to day, We hope you understand how much the performance and reliability of our service matters to us, and we hope you see improvements as we work through these issues with Google.

Posted Nov 18, 2017 - 12:56 PST

Update

We've restarted some core services to assist in getting users online, and we're simultaneously working on implementing and deploying some code changes that should improve the reconnect process for users. Additionally we're actively communicating with members of Google's SRE team while they diagnose and debug the networking problems we're seeing. Finally, we're hoping to have a full update for users within the next 30 minutes to help explain the severity and frequency of issues they've been seeing this week.

Posted Nov 18, 2017 - 12:28 PST

Identified

We're yet again investigating a major outage causing offline guilds and connection issues. We're still working both internally and externally with Google to resolve this issue.

Posted Nov 18, 2017 - 11:42 PST