Iterating over elements on `EnumerableSet` in OpenZeppelin Contracts

I'm a bit concerned about the get() function in EnumerableSet.

It's purpose, along with length() is to allow retrieving all elements in the set when the set grows too big and enumerate() runs out of gas.

As pointed out in the code comments, the order of the elements inside a set is not guaranteed. More specifically each time an element is removed its position is occupied by the one with the highest index.

Looping through all the elements in the set using length() and get() is not guaranteed to return all the elements, but knowing whether this happened or not will be very difficult for the developer.

I'm very concerned about developers thinking that iterating over an EnumerableSet is fine, and that leading to bugs, vulnerabilities, and a lot of frustration.

I can see two alternatives to enumerate() when the set grows too big for a block.

  1. Implement a state machine, and lock down the set while being iterated.
  2. Encourage developers to emit events when changing the set, and keeping track of the set contents off-chain.

The first approach is the one I would take when I need to iterate over large data sets to do a state change for each set element. For example when distributing dividends for all token holders.

The second approach is the one I would take when I need to reveal the contents of the set to the front end.

I would leave length() in EnumerableSet.sol, but I would remove get() after pointing out the limitations carefully.

4 Likes

Hi @albertocuestacanada,

I am new to EnumerableSet, so apologies if I am missing something obvious.
I understand that there is no guarantee on ordering but why are length() and get() not guaranteed to return all the elements?

Hi @abcoathup.

It’s not an obvious issue, that’s why I am worried.

Under the hood, EnumerableSet.AddressSet is an array of addresses. To make it work like a set we also keep a mapping(address => uint256) that tells us in which position of the array we can find each address.

That way we can query if an address is in the array with O(1), just by checking in the mapping.

We can also retrieve all the addresses in the set, by returning the array.

The issue is that for simplicity and efficiency when an address is removed from the set, what we do is to move the last address from the array to the position that was occupied by the removed address. Then we update the mapping which acts as an index.

length() works fine, it will return how many addresses are in the set.

get(uint256 i) will return the address from the set at the i position in the array.

The issue is that if someone decides to combine those two to retrieve all elements in the set he might not get all of them. Consider the pseudocode below:

for (i = 0; i < set.length(); i++) {
    results.append(set.get(i));
}

We can’t control who is doing what to the set while we are retrieving the elements. If our set is large our calls to get(i) will most likely span a number of blocks.

If by the time we are executing get(1000) someone removes an address whose index was 999, for example, the last address of the array will move to position 999.

That means that the person executing the loop above will never know which address was at the last position, because it was moved around. He doesn’t even have a way of knowing how many elements he has missed, or if he has missed any at all.

Using enumerate() is fine, because it’s just one transaction. However, it is O(N) so it will run out of gas for large sets.

EnumerateSet.AddressSet is quite useful (I’ve been using it quite a lot lately), but needs to be used carefully. If you plan to have really large sets and really need to retrieve them whole you should either construct a copy of the set off-chain from events, or lock the set down while you are retrieving the positions one by one.

2 Likes

Hi @albertocuestacanada,

Thanks for the detailed explanation. I wasn’t thinking about someone changing the set whilst we enumerated large sets.

My initial thought would be to keep get but document its use, and either emit events or encourage developers to do so.

Would be great to get @nventuro’s input when they are back from vacation :desert_island:

Thanks for creating this @albertocuestacanada!

You are indeed correct, performing multiple calls over a span of blocks can lead to issues when querying an EnumerableSet's values.

However, this is true for all blockchain queries: any set of calls that look at blockchain state (be it Ether balancers, view functions or any other piece of state) should always be performed on the same block. It’d be an error to, for example, display the Ether balance of a number of accounts, and not query those balances at the same point in time (the same block).

web3 supports passing a block number to call (though this seems to be undocumented), which lets you implement this pattern. Your application would likely have an ‘update state’ function that first queries the latest block number (web3.eth.getBlock('latest')), and then performs a series of queries (myContract.methods.myMethod.call()) using that same block number.

Note that uncle blocks and reorgs may still cause issues, but these are harder to solve :stuck_out_tongue:

4 Likes

I had no idea you could query the state for a specific block using web3. That’s really cool.

Using block-specific calls my concerns disappear, I would only mention that in the documentation. It seems it was documented in the web3 docs a couple of weeks ago as well.

3 Likes

hi, is there any more example about this library in those different scenarios? Thanks.