All times are PST.
During a 24 hour period on November 17th and 18th, Discord service was completely offline for two 30 minute periods and suffered fourteen additional minor interruptions of service for portions of the userbase. The root cause was a bug introduced to the Google Cloud Platform networking stack. This bug caused 16 separate loss of network events on nodes in the Discord 'sessions' cluster. Diagnosis and resolution (by rolling back the offending change) took Google engineers over 24 hours. Contributing greatly to the severity of the two major incidents were performance and stability issues in the Discord system. The particular network failure mode was able to trigger a cascading failure of our service, necessitating a full restart of services twice. The rest of this postmortem will present the timeline, detail the stability issues and actions taken on our end, and finally discuss changes we're making to limit the impact of events of this class in the future.
Sequence of Events
For the sake of brevity and clarity some networking incidents that did not provide a significant or noticeable impact have been omitted from this timeline.
- 11:39 - A node of the sessions cluster, sessions-1-16, experiences a networking incident, triggering an application bug which caused the service to die and automatically restart.
- 11:50 - Another node, sessions-1-9, observes a similar networking incident briefly separating it from members of another cluster (guilds). This incident did not trigger the aforementioned application bug, thus the node did not restart and was left in an invalid state.
- 11:52 - A service impact is noticed by engineers and escalated within our incident chat.
- 11:53 - During investigation, engineers notice the earlier loss of clients on sessions-1-16.
- 11:56 - Engineers discover that the service running on sessions-1-16 restarted at 11:40.
- 11:56 - Service Stabilizes.
- 11:58 - Through investigation engineers identify the root cause of the issues on sessions-1-16 to be a networking level problem that triggered an application bug.
- 12:01 - Discord’s primary on-call engineer is paged for a low memory alert on sessions-1-9.
- 12:02 - Engineers identify that the “Erlang distribution port”, which nodes use to communicate between each other, appears to have failed -- a state which has been observed in previous outages.
- 12:02-07 - Engineers attempt to safely decommission the impacted sessions node without causing adverse effects on the rest of Discord clusters.
- 12:11 - An engineer notes that the observed behavior appears to resemble a network split, something that has been previously observed on Google’s compute engine network.
- 12:14 - The safe decommission process fails, and we decide to execute a forceful termination of the node. Engineers brace for a possible repeat of the cascading service failure observed during a prior outage.
- 12:18 - Engineers declare an all-clear, having observed no adverse effects of the forced termination.
- 12:27 - An engineer opens opens a P2 ticket with Google’s support team to investigate the networking issues they’ve observed.
- 12:38 - The on call engineer is paged for memory being low on sessions-1-14.
- 12:39 - Engineers confirm that this node has also observed a failed distribution port caused by networking issues and must be forcefully terminated.
- 12:42 - Discord escalates the Google support ticket to P1, the highest level available, and notifies our Technical Account Manager.
- 12:48 - Engineers declare another all-clear, yet again having observed no adverse effects of the forceful stop.
- 14:06 - The on-call engineer is paged for another memory issue, this time affecting sessions-1-13.
- 14:09 - Engineers confirm that this failure resembles the previous issues and execute another forceful termination.
- 14:13 - This forceful termination triggers the cascading service failure. We update the status page to communicate these problems.
- 14:15 - The emergency communications bridge is started to coordinate service recovery. All Discord backend engineers and SREs are working on the incident now.
- 14:21 - Engineers begin an operation which will restart several core Discord services. This operation is generally referred to as “rebooting” Discord, as it forces the millions of connected Discord clients to reconnect over a period of around 20 minutes.
- 14:33 - Engineers complete the restart operation and began to monitor and manage the recovery process. Service begins recovering.
- 14:33 - 15:07 - Engineers continue to monitor and tackle individual issues that crop up during the reboot phase.
- 14:59 - Discord gets Google on the phone and confirms that they’re investigating, but Google does not have any indications that there is a problem on their end. Discord engineers are certain this is a Google networking issue.
- 15:07 - The initial all-clear is given and the status board is updated.
- 15:08 - Engineers safely decommission two misbehaving nodes of the sessions cluster.
- 15:33 - 15:37 - Engineers identify another misbehaving node of the sessions cluster and forcefully terminate it.
- 16:00 - Google engineers request more data and work with Discord engineers to attempt to identify a root cause.
- 17:10 - Discord engineers observe another networking incident, now on sessions-1-15, and forcefully terminate the node.
- 22:00 - 23:30 - Google continues to investigate the problem but does not believe there is an issue on their end. Discord staff notices that no machine has failed again after a forceful termination (and reboot) and begins executing a plan to terminate and reboot the entire sessions cluster.
- 11:38 - The on-call engineer is paged for another networking incident, now on sessions-1-11.
- 11:40 - Discord’s customer experience team escalates issue reports from support channels.
- 11:41 - Engineers update the status page to note they are investigating another incident.
- 11:44 - Engineers observe similar behavior to the incidents observed the day before, and note that they may be forced to reboot the system again.
- 11:46 - The system cascades its way towards failure. Engineers spin up another communications bridge to again coordinate service recovery.
- 11:48 - Engineers begin to restart members of the sessions cluster.
- 11:56 - Engineers observe a failure on a node of the guilds cluster.
- 11:59 - Discord SRE spins up a voice bridge with Google engineering.
- 12:04 - Engineers globally disable message sending to aid in service recovery.
- 12:04 - Engineers restart all members of the sessions cluster.
- 12:11 - As engineers observe an abnormally slow recovery, they begin to investigate and discover what they believe to be another application level bug delaying normal recovery.
- 12:11 - 12:31 - Engineers continue to monitor the system as service recovers.
- 12:31 - Engineers give an initial all clear and re-enable message sending.
- 13:30 - Google engineers advise that they have begun to roll back a software change made to their networking stack on Friday.
- ~15:30 - Google confirms that the rollback of the component they believe may be causing issues has completed.
Investigation and Analysis
The root cause of this sequence of outages was instability in the networking layer on our VMs hosted within Google’s Cloud Platform. This instability presented itself as a loss of network connectivity for short periods and was caused by an upgrade Google rolled out. The bug specifically affected only a subset of our nodes (the sessions cluster) due to the way this particular cluster heavily utilizes the network.
On top of the networking instability caused by Google’s rollout, Discord engineers identified multiple issues that contributed to or exacerbated the impact of the root cause, including:
- The specific network instability that was observed caused a component of the Erlang VM which handles internode communication (net_kernel) to become overloaded due to the way our clusters and nodes talk to each other. This overload would cause the Erlang distribution port to become “wedged”, leading to runaway memory growth of the process from message queueing.
- An internal component used for service discovery experienced degradation due to performance issues in the implementation. This degradation contributed to the issues observed during recovery.
- The HTTP library which Discord uses internally in our Erlang stack exhibited various levels of instability in these network conditions, which drastically slowed our recovery time and exacerbated other problems.
Action Items / Response
This outage was one of the longest and most severe in Discord’s history. As such our internal investigation and analysis has resulted in a multitude of items we are addressing in the coming weeks. The major components are:
- During a previous outage when we were forced to reboot Discord, we identified some issues with the way our Erlang clusters handle node failure. We began a project to improve this and we are continuing work on that project. To ensure we can complete the roll out of this upgrade safely, we are working on expanding the way we load test upgrades to these services.
- We are working on replacing the HTTP library Discord uses for its Elixir/Erlang services. The current library has caused numerous issues over time and we have lost confidence in the stability of it.
- We have reproduced and implemented a fix for the instability we saw with the net_kernel component of Erlang.
- We are working to replace pieces of the service discovery component which degraded during the outage with more performant implementations.
- We are actively working with Google to improve on the issues we observed when working with them on this issue. Additionally, we’ve expanded our internal procedure for communicating with Google to ensure we properly track the progress on P1 issues.
We believe the steps outlined above will resolve the majority of problems which escalated these issues from causing isolated issues for subsets of our user bases to two full system outages. We’ve placed all of the items above at the top of our priorities and will be working to complete them in the coming weeks.
Finally, we’d like to apologize for any instability or interruption you experienced due to this incident. Outages like this which span multiple days and cause noticeable issues for a large number of users are particularly painful for us. Everyone at Discord cares greatly about the uptime and reliability of our service, and we hope to demonstrate this over the coming weeks.