An Empirical Assessment on the Limits of Binary Code Summarization with Transformer-based Models

Master thesis (2022)

Authors

A. Al-Kaswan Electrical Engineering, Mathematics and Computer Science

Contributors

A. van Deursen Software Technology (mentor)

Prem Devanbu University of California (mentor)

M. Izadi Software Engineering - (mentor)

Anand Ashok Sawant University of California (graduation committee member)

S.E. Verwer Cyber Security - (graduation committee member)

Faculty

Electrical Engineering, Mathematics and Computer Science, Electrical Engineering, Mathematics and Computer Science

Transformers Natural Language Processing Machine Learning Deep Learning Cyber Security Reverse engineering

To reference this document use:

http://resolver.tudelft.nl/uuid:06a525ef-a9a8-4899-bbdb-cf2925808dae

More Info

expand_more

Published Date

21-07-2022

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Abstract

Reverse engineering binaries is required to understand and analyse programs for which the source code is unavailable. Decompilers can transform the largely unreadable binaries into a more readable source code-like representation. However, many aspects of source code, such as variable names and comments, are lost during the compilation and decompilation processes. Furthermore, by stripping the binaries, more informative symbols/tokens, including the function names, are also removed from the binary.

Reverse engineering is time-consuming, much of which is taken up by labelling the functions with semantic information. Therefore, we propose a novel code summarisation method for decompiled and stripped decompiled code. First, we leverage the existing BinSwarm dataset and extend it with aligned source code summaries. Next, we create an artificial demi-stripped dataset by removing the identifiers from unstripped decompiled code. To train our model for summarising code using this dataset, we fine-tune a pre-trained CodeT5 model for the code summarisation task on the given dataset. Furthermore, we investigate the performance of the input types, the impact of data duplication and the importance of each aspect present in the source code on the model performance. Moreover, we design and present some intermediate-training objectives to increase the model performance.

We present the following findings:
Firstly, we find that the model generates good summaries for decompiled code, with similar performance to source C code. Compared to summarising decompiled code, the quality of the demi-stripped model is significantly lower but still usable. Stripped performed worse and produced mostly incorrect and unusable summaries.
Secondly, we find that deduplication greatly reduces the performance of the model, putting the performance of decompiled code roughly in line with other decompiled datasets. Thirdly, we found that the loss of identifiers causes a drop in the BLEU-4 score of 35\%, with another 25\% decrease attributable to the increase of decompilation faults caused by stripping. Lastly, we show that our proposed deobfuscation intermediate-training objective improves the model's performance by 0.54 and 1.54 BLEU-4 on stripped and demi-stripped code, respectively.

Files

Thesis_aalkaswan_final.pdf

(.pdf | 0.888 Mb)