Post-mortem: Defender Relay partial service disruption on November 5th, 2021

On November 5th 2021, at 10:37am UTC, we received a user report that their Defender relayers were not working as expected. Upon investigation, we realised the Relay service was partially disrupted and many other relayers across teams were affected to some extent. By 1:49pm UTC (slightly over 3 hours after that initial report), we had stabilised the most critically affected relayers in the system. By 5:03pm UTC (slightly under 7 hours after the initial report), we had introduced a number of enhancements to prevent future recurrence.

In this post, we share details on what happened, how we fixed it, and what measures we’ve taken to make sure that this doesn’t happen again in the future.

Defender Relay

Defender Relay is a service that allows users to send transactions via a regular HTTPS API or from Autotasks, while solving a number of complex problems related to this task: private key secure storage, transaction signing, nonce management, gas price estimation, resubmissions, and more.

Users consume the service by creating and managing relayers, which can be understood as the combination of an EOA stored in a secure key-management system and a high level API to send transactions through it.

Under the hood, each relayer is associated with a private key. When a request to send a transaction is received, the service validates the request, atomically assigns it a nonce, reserves balance for paying for its gas fees, resolves its speed to a gas price, signs it with its private key, and enqueues it for submission to the blockchain. The response is sent back to the client only after this process has finished.

Every minute, all relayer in-flight transactions are checked by the system. If they have not been mined and more than a certain time has passed (which depends on the transaction speed), their gas prices are adjusted and they are resubmitted. On the other hand, if the transaction has been mined, it is still monitored for several blocks after which it is confirmed.

The architecture of Defender Relay is designed to be resilient to high load peaks while optimizing resource consumption. A key component of this architecture is a global transaction queue that decouples transaction requests from transaction processing. Let’s briefly review the journey of a relayer TX to understand the role of this global transaction queue in the system. All Defender Relay transactions coming to every relayer on every main and test network in the system go roughly through the following steps:

  1. Client code creates the TX to be sent through the relayer R1.
  2. Defender puts the TX into the global transaction queue.
  3. A transaction sender worker gets the TX from the queue, and attempts to send it to the corresponding network via one or many providers. If the attempt to send fails, the worker enqueues the TX again. This is retried multiple times.
  4. A periodic process determines which relayers have in flight transactions to check any status updates. In this case, it finds that the TX has not yet been confirmed, so it triggers a status update on relayer R1.
  5. For each TX, the process continues until there’s evidence that it reached a terminal state (confirmed or failed).

Rinkeby outage and its effects on Defender Relay

The day of the service disruption November 5th 2021, the Rinkeby testnet suffered an outage that lasted several hours. During this time every single attempt to submit transactions from Defender relayers to Rinkeby failed. This incident in turn caused other relayers across all Defender networks to fail, including mainnet ones. So how did this happen?

As mentioned above, when Defender fails to send a transaction due to what looks like a transient error, it re-enqueues it so it eventually gets retried (note that a network outage, as is the case in this incident, usually qualifies as a transient error).

At the time of the incident, Defender was set up to re-enqueue each TX 5 times before moving it to a dead letter queue. So, as a result of Rinkeby’s prolonged outage and Defender’s retry policy, every Rinkeby transaction introduced (at least) a 5X load pressure for many hours on the system.

But this wasn’t the only additional pressure in play: enter transaction replacements. Defender lets users specify a validUntil attribute on their transactions. When a transaction fails to be mined after the validUntil provided by the user, Defender will try to replace it with a NOOP.

The extended Rinkeby’s outage meant that every Rinkeby transaction with a validUntil of less than a few hours, was being replaced by a NOOP. At the end of the day, a NOOP is a new transaction to submit, resubmit, and monitor.

All these factors together, extended for hours, resulted in a system overload sizable enough that the workers could no longer cope with the rate at which new transactions came in.

Overload-induced timeouts in the periodic process described in step 4 of the TX journey explained above made the problem worse, since that meant some relayers weren’t getting their transaction statuses checked.

At this point, another Defender feature started playing a role: low funds management. To optimize resources and delays, Defender calculates the total amount of gas that a relayer needs to successfully run all in-flight transactions. When the relayer’s balance goes below that total, Defender starts rejecting transactions sent by the client, replying with a low funds error. With so many transactions in-flight, the handful of relayers affected by this particular stage of the incident became unusable as they permanently entered this low funds error condition. These relayers were the most severely affected by the disruption.

This impacted multiple teams in the form of consistently failing relayers, which again, was particularly concerning in the case of mainnet transactions.

Stabilization

Once we understood the situation and had a more complete picture of the forces at play, it became clear that our first and most urgent goal was to unclog all mainnet relayers, starting with the ones blocked by the low funds condition. The solution was to:

  1. Pause those relayers, to prevent inconsistencies.
  2. Mark all their in-flight transactions as failed, so they stopped counting towards the low funds limit (and also to kill all the transaction lifecycle processing pressure that they added to the system).
  3. Reset their nonces.
  4. Unpause them.

With a cleaner transaction backlog, these critically affected relayers started working normally again. We could now turn our attention to the other affected relayers and to applying longer term measures.

In order to prevent cascading failures from testnets, we created a dedicated queue for mainnet transactions, and changed our transaction routing logic to select the corresponding queue based on the underlying relayer’s network. This change provides a natural isolation layer so that back pressure derived from testnet outages will not affect mainnet transactions.

Furthermore, we identified many other system optimizations (omitted here for the sake of brevity) that yielded performance improvements, as well as improvements to system monitoring, which in turn leaves Defender Relay a much more scalable and robust system than before the incident.

Since completing the initial corrective actions we have continued to monitor the relayers and queues. The system has been working normally since.

Conclusion

This event was due to the lack of isolation between mainnet and testnet relayers. Consequently, we added redundancy to isolate mainnet transactions from testnet transactions, and implemented additional performance and reliability improvements.

We know that transaction relaying is a hard problem. Each lesson learned while maintaining and hardening Defender Relay becomes code, which then goes on to benefit the broader community of Defender users. Through this process, we remain committed to providing a top class transaction relaying service, so the teams who rely on Defender can focus on solving their own domain specific challenges.

2 Likes