Hello all,
Many of you experienced issues with Relayers over the course of the last week. We know Relayer performance is business critical and an outage like this is something we take very seriously. We want to account for the events that led to the outage and mitigation measures we have either already taken or are planning to take in the near future to ensure we reduce the likelihood of recurrence.
October 7
Starting at 8:00-9:00 UTC, Polygon Relayer users began receiving periodic 504 responses (gateway timeout) on Relayer send transaction calls. We traced these errors to degraded performance in Infura’s Polygon API where the API calls either did not respond or had very high latency, which caused Defender’s API Gateway to time out. Many calls processed successfully during this period, but ultimately we made the decision to disable this provider at 18:41 UTC, which provided a short-term resolution (#1). Defender monitoring triggered during this period but there was excessive noise, partly due to legacy resources on recently deprecated networks. See mitigation #1, #3, #6 below.
October 9
At approximately 7:40 UTC, Polygon Relayer users across multiple tenants again received 504 responses on send transaction calls. These issues were due to delayed responses from a different provider (Alchemy), which caused similar issues to those reported above on October 7. Again, Defender monitoring triggered but was not immediately identified by the support team due to excessive noise. See mitigation #3, #6.
Simultaneously in an unrelated incident (starting ~8:00 UTC), a Relayer on the Fuse network began throwing errors via a Defender queue that was shared across all production networks. These messages were returned to the queue and re-invoked. This ultimately caused a loop where the queue was overwhelmed with transactions at a much faster rate than could be processed and caused vastly slower processing times for Relayers across all networks. The intended purpose of this queue was simply to limit transaction throughput as Defender implements an alternative queuing mechanism - which is isolated by Relayer - to handle the business logic.
As a temporary measure, we manually purged the affected queue (as Defender has an alternate queuing mechanism to pick up any pending Relayer transactions) and patched what we believed to be the code logic and causing the retries. The vast majority of clogged transactions processed over the course of the following hour. Defender did not have monitoring/alerts configured related to this queue overload. Some internal monitoring alerts did fire during this time, but the severity of the incident was not clear. See mitigation #2, #4, #5.
From 10:00 UTC to 12:00 UTC, Polygon experienced a significant increase in gas prices. Defender has transaction replace logic to adjust gas prices in such a circumstance. However, in this case due to the extremely slow processing of these transactions (see directly above), the gas price had been calculated at an earlier point in time prior to the gas price spike. Many transactions remained in the mempool priced too low until the gas price dropped low enough for them to be mined.
October 10
Again, Relayer transactions began to process at very slow speeds across multiple networks. The Fuse Relayer issue recurred which indicated that the patch on October 9 did not solve the queue overload problem. The queue was manually purged and a patch was applied to correctly alter the logic to avoid future queue overloads. See mitigation #5.
Mitigations (timeline):
-
Disable Infura Polygon JSON RPC provider during degradation
-
Manually purge clogged queue to increase transaction throughput
-
Shorten time out for JSON RPC calls (in progress, will be partially available by release next week and fully available by end of October)
-
Alter transaction handler logic and restrict retries to prevent unnecessary re-queueing of transactions (in progress, will be partially available by release next week and fully available by end of October)
-
Implement monitoring for excessive pending transactions (already implemented)
-
Improve monitoring for provider failures (ongoing, this is considered the top site reliability priority and will be implemented over the course of next 2 months)