Analysis of delayed relayer transactions on 2022-06-15

Hey Defender users! Today (2022-06-15), for a period of about 2hs, the Defender relayers experienced a significant delay in sending transactions. The problem is now fixed, and we share here our analysis of the issue.

At approximately 6:40pm UTC we were notified of delays in transaction processing. Further inspection revealed that the main queue for transaction delivery was becoming clogged, due to errors on the handler function that picks up the requests to submit transactions and delivers them to the multiple infrastructure providers we use. These errors started occuring at 6pm UTC, the moment when the queue began to become full.

The queue consumer works by submitting each transaction request to every provider we have registered for that transaction's network. This gives us a guarantee that, if a provider starts failing or becomes disconnected from the chain, the transactions still go through another provider. Because of this broadcast-like behaviour, the handler has specific logic to ignore any errors related to duplicated transactions. Other errors are retried a few times before discarding the transaction request - we have additional logic to re-send the transaction if we don't see it on the blockchain after a while.

However, at 6pm UTC, one of the providers we rely on went through an upgrade that changed how duplicated transactions were reported. This caused the handler to interpret this new error format as a provider error, and start retrying each of those transactions, instead of discarding them. This caused a significant number of transaction requests to be re-enqueued, leading to a growing number of in-flight messages in the queue, that caused a delay.

At 8:15pm UTC, we rolled out a hotfix to cover for this new error format, which caused the queue to empty out almost immediately.

Screenshot from 2022-06-15 18-10-55

Moving forward, we'll introduce additional alerting to notify us of a similar situation by monitoring new error messages from providers and queue sizes, and reduce the aggressiveness of retries of the queue handler. We apologize for any inconveniences this delay may have caused.