Deconstructing a Solidity Contract — Part II: Creation vs. Runtime

Post originally written by @ajsantander on medium (Aug 13, 2018)

By Alejandro Santander in collaboration with Leo Arias .

Image from commons.wikimedia.org

Note: This article is part of a series. If you haven’t read the introduction, please have a look at it first. We’re deconstructing the EVM bytecode of a simple Solidity contract.

  1. Deconstructing a Solidity Contract — Part I: Introduction :heavy_check_mark:
  2. Deconstructing a Solidity Contract — Part II: Creation vs. Runtime :arrow_left:
  3. Deconstructing a Solidity Contract — Part III: The Function Selector
  4. Deconstructing a Solidity Contract — Part IV: Function Wrappers
  5. Deconstructing a Solidity Contract — Part V: Function Bodies
  6. Deconstructing a Solidity Contract — Part VI: The Metadata Hash

Let’s get started by attacking the disassembled gibberish of our contract with our divide-and-conquer lightsaber. As we saw in the introductory article, this disassembled code is very low-level, but quite readable compared to the raw bytecode. Make sure you’ve followed along in the Introduction, and that you have the BasicToken contract compiled and deployed in Remix. Debug the creation transaction and open up the Instructions panel. Also, have the deconstruction diagram at hand while we go along.

DISCLAIMER: All instructions provided in this article are subject to my own interpretation of how things work. Please feel free to make comments below in case I need to be corrected in any way, and I’ll be sure to update the article accordingly.

For now, let’s focus on the JUMP , JUMPI , JUMPDEST , RETURN , and STOP opcodes, and ignore all others . Whenever we find an opcode that is not one of these, we will ignore it and skip to the next instruction, pretending that nothing intervened.

When the EVM executes code, it does so top down with no exceptions — i.e., there are no other entry points to the code. It always starts from the top. It can jump around, yes, and that’s exactly what JUMP and JUMPI do. JUMP takes the topmost value from the stack and moves execution to that location. The target location must contain a JUMPDEST opcode, though, otherwise execution will fail. That is the sole purpose of JUMPDEST : to mark a location as a valid jump target. JUMPI is exactly the same, but there must not be a “0” in the second position of the stack, otherwise there will be no jump. So this is a conditional jump. STOP completely halts execution of the contract, and RETURN halts execution too, but returns data from a portion of the EVM’s memory, which is handy.

So, let’s start interpreting the code with all of this in mind. In Remix’s debugger, move the Transaction slider all the way to the left and open up the Instructions section. You can walk through the instructions with the Step Into button (the one that looks like a little down-pointing arrow). The first instructions can be ignored, but at instruction 11 we find our first JUMPI . If it doesn’t jump, it will continue through instructions 12 to 15 and end up in a REVERT , which would halt execution. But if it does jump, it will skip these instructions to the location 16 (hex 0x0010 , which was pushed to the stack at instruction 8). Instruction 16 is a JUMPDEST . So far so good.

Keep on stepping through the opcodes until the Transaction slider is all the way to the right. A lot of blah-blah just happened, but only in location 68 do we find a RETURN opcode (and a STOP opcode in instruction 69, just in case). This is rather curious. If you think about it, the control flow of this contract will always end at instructions 15 or 68. We’ve just walked through it and determined that there are no other possible flows, so what are the remaining instructions for? (If you slide down the Instructions panel, you’ll see that the code ends at location 566).

The set of instructions we’ve just traversed (0 to 69) is what’s known as the “creation code” of a contract. It will never be a part of the contract’s code per se, but is only executed by the EVM once during the transaction that creates the contract. As we will soon discover, this piece of code is in charge of setting the created contract’s initial state, as well as returning a copy of its runtime code. The remaining 497 instructions (70 to 566) which, as we saw, will never be reached by the execution flow, are precisely the code that will be part of the deployed contract.

If you open the deconstruction diagram, you should see how we’ve just made our first split: we’ve differentiated creation-time vs. runtime code.

We will now take a deep dive into the creation part of the code.

Figure 1. Deconstruction of the creation-time EVM bytecode of BasicToken.sol.

This is the most important concept to understand in this article. The creation code gets executed in a transaction, which returns a copy of the runtime code, which is the actual code of the contract. As we will see, the constructor is part of the creation code, and not part of the runtime code. The contract’s constructor is part of the creation code; it will not be present in the contract’s code once it is deployed.

How does this magic happen? That’s what we’ll analyze now, step by step.

Alright. So now our problem is reduced to understanding these ~70 instructions corresponding to the creation-time code.

Let’s re-take our top-down approach, this time understanding all the instructions as we go along, not skipping any of them. First, let’s focus on instructions 0 to 2, which use the PUSH1 and MSTORE opcodes.

Figure 2. The free memory pointer EVM bytecode structure.

PUSH1 simply pushes one byte onto the top of the stack, and MSTORE grabs the two last items from the stack and stores one of them in memory:

mstore(0x40, 0x80)
       |     |
       |     What to store.
        Where to store.

(in memory)

NOTE: The above snippet is Yul-ish code. Notice how it consumes elements from the stack from left to right, always consuming what’s on the top of the stack first.

This basically stores the number 0x80 (decimal 128) into memory at position 0x40 (decimal 64). What for? At this point of our narrative, who cares ¯_(ツ)_/¯?! There must be a reason (which we’ll actually see later). For now, open the Stack and Memory panels in Remix’s Debugger tab to visualize the values as you step back and forth through these instructions.

You might be wondering: what happened to instructions 1 and 3? PUSH instructions are the only EVM instructions that are actually composed of two or more bytes. So, PUSH 80 is really two instructions. The mystery is revealed, then: instruction 1 is 0x80 and instruction 3 is 0x40 .

Next up are instructions 5 to 15.

Figure 3. The non-payable check EVM bytecode structure.

Here we have a bunch of new opcodes: CALLVALUE , DUP1 , ISZERO , PUSH2 , and REVERT . CALLVALUE pushes the amount of wei involved in the creation transaction, DUP1 duplicates the first element on the stack, ISZERO pushes a 1 to the stack if the topmost value of the stack is zero, PUSH2 is just like PUSH1 but it can push two bytes to the stack instead of just one, and REVERT halts execution.

So what’s going on here? In Solidity, we could write this chunk of assembly like this:

if(msg.value != 0) revert();

This code was not actually part of our original Solidity source, but was instead injected by the compiler because we did not declare the constructor as payable. In the most recent versions of Solidity, functions that do not explicitly declare themselves as payable cannot receive ether. Going back to the assembly code, the JUMPI at instruction 11 will skip instructions 12 through 15 and jump to 16 if there is no ether involved. Otherwise, REVERT will execute with both parameters as 0 (meaning that no useful data will be returned).

Alright! Coffee break. The next part will be a bit trickier, so it might be a good idea to take a few minutes off. Go ahead and prepare yourself a nice cup of coffee while you summon your powers of concentration. Make sure you understand what we’ve seen so far, because the next part is a bit more complicated.

If you’d like yet another way to visualize what we’ve just done, try out this simple tool I’ve built: solmap. It allows you to compile Solidity code on real-time and then click on EVM opcodes to highlight the associated Solidity code. The disassembly is a bit different from Remix’s, but you should be able to understand it by comparison.

Coffee break!

Ready to move on? Great! Next up are instructions 16 to 37. Keep following with Remix’s debugger. (Remember, Remix is your friend ^^).

Figure 4. EVM bytecode structure for retrieving constructor parameters from code appended at the end of a contract’s bytecode.

The first four instructions (17 to 20) read whatever is in memory at position 0x40 and push that to the stack. If you recall from a little earlier, that should be the number 0x80 . The following instructions then push 0x20 (decimal 32) to the stack (instruction 21), duplicate that value (instruction 23), push 0x0217 (decimal 535) (instruction 24), and finally duplicate the fourth value (instruction 27), which should be 0x80 again. Phew! I almost ran out breath while writing that sentence. When looking at EVM instructions like this, it’s okay to not understand what’s going on for a while. Don’t worry, it will suddenly click in your mind.

On instruction 28, CODECOPY is executed, which takes three arguments: target memory position to copy the code to, instruction number to copy from, and number of bytes of code to copy. So, in this case, it targets memory at position 0x80 , from byte position 535 in code, 32 bytes of code length. Why?

If you look at the entire disassembled code, there are 566 instructions. So why is this code trying to copy the last 32 bytes of code? Actually, when deploying a contract whose constructor contains parameters, the arguments are appended to the end of the code as raw hex data. Scroll down the Instructions panel all the way to see this. In this case, the constructor takes one uint256 parameter, so all this code is doing is copying the argument to memory from the value appended at the end of the code. These 32 instructions don’t make sense as disassembled code, but they do in raw hex: 0x0000000000000000000000000...0000000000000000000002710 . Which is, of course, the decimal value 10000 we passed to the constructor when we deployed the contract!

See why you needed that coffee? Again, feel free to repeat this part in Remix step by step, making sure that you understand what just happened. The end result should be that you see the number 0x00..002710 stored in memory at position 0x80 .

Okay. For the next part, you might want to fix yourself a nice double measure of whisky. It’s about to get weird.

Whisky break!

Just kidding. No more magic, I promise. It’s all downhill from here.

The next set of instructions (29 to 35) update the value stored in memory at position 0x40 from the number 0x80 to the number 0xa0 : that is, they offset the value by 0x20 (32) bytes. Now we can start making sense of instructions 0 to 2 (remember when we shrugged?). Solidity keeps track of something called a “free memory pointer”: that is, a place in memory we can use to store stuff, with the guarantee that no one will overwrite it (except us if we make a mistake, of course, using inline assembly or Yul). So, since we stored the number 10000 in the old free memory position, we updated the free memory pointer by shifting it 32 bytes forward.

Even experienced Solidity developers can be confused when they see the expression “free memory pointer” or the code mload(0x40, 0x80) . These are just saying, “We’ll be writing to memory from this point on and keeping a record of the offset, each time we write a new entry.” Every single function in Solidity, when compiled to EVM bytecode, will initialize this pointer.

What’s in memory between 0x00 to 0x40 , you may wonder. Nothing. It’s just a chunk of memory that Solidity reserves for calculating hashes, which, as we’ll see soon, are necessary for mappings and other types of dynamic data.

Now, on instruction 37, MLOAD reads from memory at position 0x40 and basically downloads our 10000 value from memory into the stack, where it will be fresh and ready for consumption in the next set of instructions.

This is a common pattern in EVM bytecode generated by Solidity: before a function’s body is executed, the function’s parameters are loaded into the stack (whenever possible), so that the upcoming code can consume them — and that’s exactly what’s going to happen next.

Let’s continue with instructions 38 to 55.

Figure 5. The constructor’s body EVM code.

These instructions are nothing more and nothing less than the constructor’s body: that is, the Solidity code:

totalSupply_ = _initialSupply;
balances[msg.sender] = _initialSupply;

The first four instructions are pretty self-explanatory (38 to 42). First, 0 is pushed to the stack, then the second item in the stack is duplicated (that’s our 10000 number), and then the number 0 is duplicated and pushed to the stack, which is the position slot in storage of totalSupply_ . Now, SSTORE can consume the values and still keep 10000 lying around for future use:

sstore(0x00, 0x2710)
       |     |
       |     What to store.
       Where to store.

(in storage)

Voila! We stored the number 10000 in the variable totalSupply_ . Isn’t that amaaaazing???

Be sure to visualize this value in Remix’s Debugger tab too. You’ll find it in the Storage completely loaded panel.

The next set of instructions (43 to 54) are a bit disconcerting, but will basically deal with storing 10000 in the balances mapping for the key of msg.sender . Before moving forward, make sure you understand this part of the Solidity documentation, which explains how a mapping is saved in storage. Long story short, it will concatenate the slot of the mapping value (in this case the number 1 , because it’s the second variable declared in the contract) with the key used (in this case, msg.sender , obtained with the opcode CALLER ), then hash that with the SHA3 opcode and use that as the target position in storage. Storage, in the end, is just a simple dictionary or hash table.

Moving on with instructions 43 to 45, the msg.sender address is stored in memory (this time at position 0x00 ), and then in instructions 46 to 50, the value 1 (the slot of the mapping) is stored at memory position 0x20 . Finally, the SHA3 opcode calculates the Keccak256 hash of whatever is in memory from position 0x00 to position 0x40 — that is, the concatenation of the mapping’s slot/position with the key used. This is precisely where the value 10000 will be stored in our mapping:

sstore(hash..., 0x2710)
       |        |
       |        What to store.
       Where to store.

And that’s it. At this point, the constructor’s body has been fully executed.

All this may be a bit overwhelming at first, but it’s a fundamental part of how storage works in Solidity. If you didn’t quite get it, I recommend that you go over it a few times following along with Remix’s debugger, keeping the Stack and Memory panels in sight. Also, feel free to ask questions below. This pattern is used all over the place in EVM bytecode generated by Solidity, and you will quickly learn to identify it effortlessly. In the end, it is nothing more than calculating where to save, in storage, the value for a certain key for a mapping.

Alright, we’re almost done here. If you got this far, the next part will be a piece of cake.

Figure 6. The runtime code copy structure.

In instructions 56 to 65, we’re performing a code copy again. Only this time, we’re not copying the last 32 bytes of the code to memory; we’re copying 0x01d1 (decimal 465) bytes starting from position 0x0046 (decimal 70) into memory at position 0. That’s one big chunk of code to copy!

If you slide the Transaction slider all the way to the right again, you’ll notice that position 70 is right after our creation-time EVM code, where execution stopped. The runtime bytecode is contained in those 465 bytes. This is the part of the code that will be saved in the blockchain as the contract’s runtime code, which will be the code that will be executed every time someone or something interacts with the contract. (We’ll cover the runtime code in future parts of this series).

And this is exactly what instructions 66 to 69 do: the code that we copied to memory is returned.

Figure 7. The runtime code return EVM bytecode structure.

RETURN grabs the code copied to memory and hands it over to the EVM. If this creation code is executed in the context of a transaction to the 0x0 address, the EVM will execute the code and store the return value as the created contract’s runtime code.

That’s it! By now, our BasicToken contract instance will be created and deployed with its initial state and runtime code ready for use. If you take a step back and look at Figure 2, you’ll see that all the EVM bytecode structures we analyzed are generic, except the one highlighted in purple: that is, they will be in creation-time bytecode generated by the Solidity compiler. What will differ from constructor to constructor is just the purple part — the constructor function’s actual body. The structure that fetches parameters embedded at the end of the bytecode, as well as the structures that copy the runtime code and return it, can be considered boilerplate code and generic EVM opcode structures. You should be able to look at any constructor now, and before studying it instruction by instruction, you should have a general idea of the components that make it up.

In the next article of this series, we’ll look at the actual runtime code, starting with how you can interact with a contract’s EVM code at different entry points. For now, give yourself a well-deserved pat on the back, because you have just digested the most difficult part of the series. You should also have developed a strong ability to read through and debug EVM bytecode, understand generic structures, and, most importantly, understand the difference between creation-time and runtime EVM bytecode. This is what makes the constructor function of a contract so special in Solidity.

See you in the next episode!

4 Likes