An Empirical Assessment on the Limits of Binary Code Summarization with Transformer-based Models

More Info
expand_more

Abstract

Reverse engineering binaries is required to understand and analyse programs for which the source code is unavailable. Decompilers can transform the largely unreadable binaries into a more readable source code-like representation. However, many aspects of source code, such as variable names and comments, are lost during the compilation and decompilation processes. Furthermore, by stripping the binaries, more informative symbols/tokens, including the function names, are also removed from the binary.

Reverse engineering is time-consuming, much of which is taken up by labelling the functions with semantic information. Therefore, we propose a novel code summarisation method for decompiled and stripped decompiled code. First, we leverage the existing BinSwarm dataset and extend it with aligned source code summaries. Next, we create an artificial demi-stripped dataset by removing the identifiers from unstripped decompiled code. To train our model for summarising code using this dataset, we fine-tune a pre-trained CodeT5 model for the code summarisation task on the given dataset. Furthermore, we investigate the performance of the input types, the impact of data duplication and the importance of each aspect present in the source code on the model performance. Moreover, we design and present some intermediate-training objectives to increase the model performance.

We present the following findings:
Firstly, we find that the model generates good summaries for decompiled code, with similar performance to source C code. Compared to summarising decompiled code, the quality of the demi-stripped model is significantly lower but still usable. Stripped performed worse and produced mostly incorrect and unusable summaries.
Secondly, we find that deduplication greatly reduces the performance of the model, putting the performance of decompiled code roughly in line with other decompiled datasets. Thirdly, we found that the loss of identifiers causes a drop in the BLEU-4 score of 35\%, with another 25\% decrease attributable to the increase of decompilation faults caused by stripping. Lastly, we show that our proposed deobfuscation intermediate-training objective improves the model's performance by 0.54 and 1.54 BLEU-4 on stripped and demi-stripped code, respectively.