Relayer in deadlock

Hello support,

Our relayer for Arbitrum Goerli is stuck and using replaceTransacationByNonce is not properly working... It seems the relayer is in some deadlock state as nonces for pending tx in defender are actually failed relying on the output the defender API gives when replacing transactions by nonce.

Would it be possible to look at this?
Relayer: 762bd006-c7a6-427a-98b8-d4c06345117b

Kind regards

Any news on this one?

Kind regards

Support...

We're one week further and still no response or solution?
I would appreciate at least a status update.

Kind regards,

Simon

Hi @Archethect_Archethec - sorry for this issue. We identified a problem in Arbitrum Goerli where providers are returning an error/revert on an estimate gas call that we use for validation when the transaction would not actually be reverted. We have implemented a fix and it should be in production today. Please let us know if you are still unable to replace transactions after this fix is in place.

Hi @dan_oz ,

Yes it seems to work although each autotask now has timeout errors (this was almost never happening before....)

Kind regards,

Simon

Sorry to hear that! If you send more details on the Autotask we can investigate.

FWIW we've had many issues with the Arbitrum Goerli JSON RPC Provider so it may be that there are some latency issues there. Unfortunately with Arbitrum other providers (such as Infura) just forward requests to the public node so it is somewhat of a single point of failure.

@dan_oz ,

This is the relayer: https://defender.openzeppelin.com/#/relay/762bd006-c7a6-427a-98b8-d4c06345117b
I did kickstart 2 autotasks again that were autopaused because of too many fails, but they just don't go through. At this moment its just unworkeable on Arb Goerli testnet. All these autotasks work on Arbitrum One, but if we can't use autotasks anymore on testnet we'll be forced to move to other solutions I'm afraid. Fwiw, our own AWS based cron scheduler has no issues and we're using alchemy as RPC endpoint.

Kind regards

@dan_oz or anybody else?
Sorry to say it but we're thinking about moving away from Defender if this continues :frowning:

Hi @Archethect_Archethec -

Sorry for the delay. As I mentioned we've had a number of issues on with Arbitrum Goerli JSON RPC Providers that we've been working through. We made a release just now that should help improve performance on Arbitrum networks.

Your Relayer had a failed transaction that in all likelihood was the result of too low of gas limit being set (this has been one of the main issues we've had with Arbitrum JSON RPC Providers). This failed transaction caused the Relayer to stall and blocked future transactions. One of the optimizations we made with the release today is that in the event of a failed transaction, it should no longer clog the Relayer - rather, it should replace the tx with a NOOP so that the Relayer can continue to process future transactions.

I've resynced your Relayer with the nonce on chain so it should be able to new process transactions again. One piece of guidance is to ensure gasLimit is being set sufficiently high on all Arbitrum networks. If gasLimit is not provided by the caller, we fall back to using eth_estimateGas and this call has been very problematic on these networks - in some cases throwing errors and in others providing too low a gas estimate.

Hope that is helpful

Hi @dan_oz ,

Thanks for the explanation. This actually makes sense as we were not setting the gasLimit ourselves and relied on Defender for it. Thanks to this insight I changed it to 1.2x the estimated gas and it seems to work flawless for now.

Just one thing though.... It was not clear to me that this was the exact error... maybe it would be good to propagate such errors into the autotask dashboard somewhere?

Kind regards

That makes sense as a feature but with our current architecture it's a tricky thing. When you send a Relayer transaction, Defender performs some internal validation and if it passes, the transaction enters a queue. The sendTransaction call then receives a 200 response - the transaction was submitted successfully. The Autotask may well terminate at this point as well if there is no further code to execute.

Another process then picks up the transaction and actually sends it to the JSON RPC Provider at which point this error is received. There are a lot of benefits in this architecture around separating concerns and allowing for high throughput and internal error handling.

With all of our networks aside from rollups, this works great because our internal validation is identical to the validation the node would perform (checking sufficient balance, gas limit above intrinsic gas limit, etc.) With rollups, it's trickier because the sequencer can reject the transaction for a reason we did not anticipate such as this gas issue. And of course on Arbitrum the gas requirements are constantly changing due to L1 fees - plus the nodes are not always reliable in estimating gas in the first place.

At one time we were running eth_estimateGas for Arbitrum with the gas parameter in our internal validation as a sort of light simulation. But then that was throwing an error for valid transactions. I think this ultimately would be the best way for us to give feedback on the send transaction call, but we would need the JSON RPC Provider to give the correct response on these calls and thus far it's been spotty.

At least for the moment, we want to prevent a Relayer getting stuck in one of these cases, so the solution is to retry the transaction up to our max number of retries (50) at which point it would be replaced with a NOOP, mined, and then the Relayer could continue to process further transactions.