How linear is the relationship of solidity code to compiled bytecode?

Hello all, amateur solidity programmer and python guy here. I've been thinking about ways of plying some of my machine learning experience for blockchain applications, and I decided to try and see if some of the recent advances in transformers might be useful in transformimg bytecode/decompiler bytecode into more human-readable functions.

Unfortunately, I don't have the raw compsci knowledge that would help to answer a pretty fundamental question about whether or not this will work at all--and that has to do with how closely compiled bytecode mirrors the smart contract it is compiled from, in terms of which variables and functions follow which, if the two are simply compared as raw strings.

This is important because transformers are generally designed to match encodings near each-other, sequentially speaking. If, say, the fallback() function is declared at the beginning or the middle of the solidity code, but is always placed at the back of the bytecode, this would introduce noise, especially in longer contracts.

So how linear are they, really? Does the compiler like to move stuff around, or are they fairly close to 1:1? If the latter, there's a good chance that a transformer based solution could provide a very useful tool for reverse-engineering bytecode.

1 Like