Increase API Error Rate and Push Error Rate

Incident Report for Discord

Resolved

Service is working as expected at this time.

Engineering has root-caused this issue to a Google Cloud component called "Traffic Director" which is responsible for configuring our load balancing layer. In its malfunction, it caused our internal load balancing layer to not have a valid configuration, which caused a loss of availability of the API. Engineering took measures to remediate by moving to alternate proxy configuration that did not use traffic director. It took us a bit to switch to using a different load balancing topology, but we were able to do so to restore service before Traffic Director issues were resolved.
Posted Mar 08, 2022 - 13:01 PST

Monitoring

Typing events have been re-enabled. At this point all functionality has been restored and the service appears to be operating as designed. Oncall Engineering will continue to monitor the service and will work to understand the root cause of this issue with our provider partners.
Posted Mar 08, 2022 - 12:41 PST

Update

Oncall Engineering has brought the media infrastructure back online, media embeds should be functional again. Message acknowledgement has also been re-enabled. We are continuing to work to restore full functionality. Typing events are still disabled at this time and we are investigating reports from bot developers regarding unexpected 403 errors.
Posted Mar 08, 2022 - 12:19 PST

Identified

Remediations are working and traffic is coming back online. While we work to restore full service some functionality will remain intentionally disabled until the service stabilizes, typing events and message acknowledgement. Other functionality remains to be restored, media embeds may not work correctly at this time.
Posted Mar 08, 2022 - 11:49 PST

Update

We are working with our providers to correct the root cause. We believe the cause is upstream of our service and our providers are working on determining and correcting the issue upstream. In the interim we are implementing a set of remediations to work around the issue. As the service comes back up some functionality will be intentionally disabled, namely, typing events and message acknowledge.
Posted Mar 08, 2022 - 11:43 PST

Update

While we continue to investigate the root cause, work has begun on restoring service by working around the issue. Oncall Engineering will begin allowing more traffic through as we restore service.
Posted Mar 08, 2022 - 11:29 PST

Update

Oncall Engineering continues to investigate the root cause of this issue. We have engaged our partners and are preparing contingencies to restore service.
Posted Mar 08, 2022 - 10:59 PST

Update

We are continuing to investigate the issue impacting the API to find root cause.
Posted Mar 08, 2022 - 10:29 PST

Investigating

While monitoring this issue a new issue has occurred causing an major outage of the API. Oncall Engineering is working to correct this situation.
Posted Mar 08, 2022 - 10:12 PST

Update

As part of recovery, the root cause was also detected in our streaming service. A controlled restart was performed of this service which would have caused a temporary disruption of streaming, this should be operating correctly at this time.
Posted Mar 08, 2022 - 10:08 PST

Monitoring

Remediations appear to have restored service to normal operation, Oncall Engineering will monitor for full recovery
Posted Mar 08, 2022 - 09:54 PST

Identified

The root cause has been determined, remediations have been executed to restore service.
Posted Mar 08, 2022 - 09:53 PST

Investigating

We are currently investigating an increase in API Errors and Push Notification Errors.
Posted Mar 08, 2022 - 09:16 PST
This incident affected: API and Push Notifications.