Traffic and latency are fully restored, Discord is fully operational.
The underlying issue was due to an operation we performed on our authentication database cluster which resulted in too much reduced capacity in the database cluster. We performed some load shedding to enable some traffic to recover, but due to the nature of this cluster's topology and the incident, only about 50% of our userbase was able to get back online in the interim and we had to wait for the maintenance to complete on the impacted database nodes before we were able to fully restore the service.
We'll be following up internally with a postmortem and will plan to publish a public blog post describing the incident and our followups to ensure that this issue cannot recur. Our goal is that Discord just works when you need it; we know many of you (and all of us!) rely on Discord. Please accept our apologies and thank you for your patience today as we worked through this incident.
Posted Nov 06, 2023 - 12:00 PST
The cluster is healthy and we are now seeing major recovery of users and latency, but it will take another 5-10 minutes for all users to come online. We are continuing to be online and monitor through the recovery.
Posted Nov 06, 2023 - 11:45 PST
The databases that were impacted have come back online and we are beginning to see recovery of the graphs on the authentication service. We will continue to apply rate limits to the service until every user is back online and latency has recovered.
Posted Nov 06, 2023 - 11:43 PST
We are continuing to bring users online as we are able, while we balance bringing back online some of the impacted database nodes. We appreciate your patience while we work through this.
The underlying issue was an operation the team performed to upgrade some of the database nodes. We will conduct a full postmortem analysis to understand what went wrong in the operation, but at the moment we are fully focused on bringing the service back online.
Posted Nov 06, 2023 - 11:36 PST
We have approximately 50% of users back in and using Discord, but latency is still elevated. If you still aren't able to log in, you should be able to get in in the near future. Latency will continue to be elevated until we are able to address the underlying database issue limiting throughput.
Posted Nov 06, 2023 - 11:26 PST
We are beginning to allow traffic back into the service and seeing users reconnecting, but we are applying a rate limit so that we can ramp traffic up at a rate that doesn't overwhelm the authentication service.
Posted Nov 06, 2023 - 11:16 PST
We have identified the underlying issue impacting our authentication service and we are working to resolve it now. We have the relevant teams online and engaged in our incident response process.
Posted Nov 06, 2023 - 11:08 PST
We are aware of an issue impacting Discord latency and errors. The team is online and investigating now.