AA
A. Al-Kaswan
info
Please Note
<p>This page displays the records of the person named above and is not linked to a unique person identifier. This record may need to be merged to a profile.</p>
2 records found
1
Master thesis
(2022)
-
A. Al-Kaswan, A. van Deursen, Prem Devanbu, M. Izadi, Anand Ashok Sawant, S.E. Verwer
Reverse engineering binaries is required to understand and analyse programs for which the source code is unavailable. Decompilers can transform the largely unreadable binaries into a more readable source code-like representation. However, many aspects of source code, such as variable names and comments, are lost during the compilation and decompilation processes. Furthermore, by stripping the binaries, more informative symbols/tokens, including the function names, are also removed from the binary.
Reverse engineering is time-consuming, much of which is taken up by labelling the functions with semantic information. Therefore, we propose a novel code summarisation method for decompiled and stripped decompiled code. First, we leverage the existing BinSwarm dataset and extend it with aligned source code summaries. Next, we create an artificial demi-stripped dataset by removing the identifiers from unstripped decompiled code. To train our model for summarising code using this dataset, we fine-tune a pre-trained CodeT5 model for the code summarisation task on the given dataset. Furthermore, we investigate the performance of the input types, the impact of data duplication and the importance of each aspect present in the source code on the model performance. Moreover, we design and present some intermediate-training objectives to increase the model performance.
We present the following findings:
Firstly, we find that the model generates good summaries for decompiled code, with similar performance to source C code. Compared to summarising decompiled code, the quality of the demi-stripped model is significantly lower but still usable. Stripped performed worse and produced mostly incorrect and unusable summaries.
Secondly, we find that deduplication greatly reduces the performance of the model, putting the performance of decompiled code roughly in line with other decompiled datasets. Thirdly, we found that the loss of identifiers causes a drop in the BLEU-4 score of 35\%, with another 25\% decrease attributable to the increase of decompilation faults caused by stripping. Lastly, we show that our proposed deobfuscation intermediate-training objective improves the model's performance by 0.54 and 1.54 BLEU-4 on stripped and demi-stripped code, respectively. ...
Reverse engineering is time-consuming, much of which is taken up by labelling the functions with semantic information. Therefore, we propose a novel code summarisation method for decompiled and stripped decompiled code. First, we leverage the existing BinSwarm dataset and extend it with aligned source code summaries. Next, we create an artificial demi-stripped dataset by removing the identifiers from unstripped decompiled code. To train our model for summarising code using this dataset, we fine-tune a pre-trained CodeT5 model for the code summarisation task on the given dataset. Furthermore, we investigate the performance of the input types, the impact of data duplication and the importance of each aspect present in the source code on the model performance. Moreover, we design and present some intermediate-training objectives to increase the model performance.
We present the following findings:
Firstly, we find that the model generates good summaries for decompiled code, with similar performance to source C code. Compared to summarising decompiled code, the quality of the demi-stripped model is significantly lower but still usable. Stripped performed worse and produced mostly incorrect and unusable summaries.
Secondly, we find that deduplication greatly reduces the performance of the model, putting the performance of decompiled code roughly in line with other decompiled datasets. Thirdly, we found that the loss of identifiers causes a drop in the BLEU-4 score of 35\%, with another 25\% decrease attributable to the increase of decompilation faults caused by stripping. Lastly, we show that our proposed deobfuscation intermediate-training objective improves the model's performance by 0.54 and 1.54 BLEU-4 on stripped and demi-stripped code, respectively. ...
Reverse engineering binaries is required to understand and analyse programs for which the source code is unavailable. Decompilers can transform the largely unreadable binaries into a more readable source code-like representation. However, many aspects of source code, such as variable names and comments, are lost during the compilation and decompilation processes. Furthermore, by stripping the binaries, more informative symbols/tokens, including the function names, are also removed from the binary.
Reverse engineering is time-consuming, much of which is taken up by labelling the functions with semantic information. Therefore, we propose a novel code summarisation method for decompiled and stripped decompiled code. First, we leverage the existing BinSwarm dataset and extend it with aligned source code summaries. Next, we create an artificial demi-stripped dataset by removing the identifiers from unstripped decompiled code. To train our model for summarising code using this dataset, we fine-tune a pre-trained CodeT5 model for the code summarisation task on the given dataset. Furthermore, we investigate the performance of the input types, the impact of data duplication and the importance of each aspect present in the source code on the model performance. Moreover, we design and present some intermediate-training objectives to increase the model performance.
We present the following findings:
Firstly, we find that the model generates good summaries for decompiled code, with similar performance to source C code. Compared to summarising decompiled code, the quality of the demi-stripped model is significantly lower but still usable. Stripped performed worse and produced mostly incorrect and unusable summaries.
Secondly, we find that deduplication greatly reduces the performance of the model, putting the performance of decompiled code roughly in line with other decompiled datasets. Thirdly, we found that the loss of identifiers causes a drop in the BLEU-4 score of 35\%, with another 25\% decrease attributable to the increase of decompilation faults caused by stripping. Lastly, we show that our proposed deobfuscation intermediate-training objective improves the model's performance by 0.54 and 1.54 BLEU-4 on stripped and demi-stripped code, respectively.
Reverse engineering is time-consuming, much of which is taken up by labelling the functions with semantic information. Therefore, we propose a novel code summarisation method for decompiled and stripped decompiled code. First, we leverage the existing BinSwarm dataset and extend it with aligned source code summaries. Next, we create an artificial demi-stripped dataset by removing the identifiers from unstripped decompiled code. To train our model for summarising code using this dataset, we fine-tune a pre-trained CodeT5 model for the code summarisation task on the given dataset. Furthermore, we investigate the performance of the input types, the impact of data duplication and the importance of each aspect present in the source code on the model performance. Moreover, we design and present some intermediate-training objectives to increase the model performance.
We present the following findings:
Firstly, we find that the model generates good summaries for decompiled code, with similar performance to source C code. Compared to summarising decompiled code, the quality of the demi-stripped model is significantly lower but still usable. Stripped performed worse and produced mostly incorrect and unusable summaries.
Secondly, we find that deduplication greatly reduces the performance of the model, putting the performance of decompiled code roughly in line with other decompiled datasets. Thirdly, we found that the loss of identifiers causes a drop in the BLEU-4 score of 35\%, with another 25\% decrease attributable to the increase of decompilation faults caused by stripping. Lastly, we show that our proposed deobfuscation intermediate-training objective improves the model's performance by 0.54 and 1.54 BLEU-4 on stripped and demi-stripped code, respectively.
DataFlex
Educational game about data centers for children
Bachelor thesis
(2020)
-
A. Al-Kaswan, B. El Attar, G. Wiemers, L.J. Kronstadt, G. d' Abreu de Paulo, W.P. Brinkman, S. De Wit, T.A.R. Overklift Vaupel Klein
Women are largely underrepresented in IT, girls’ interest in STEM and IT fields tends to drop throughout secondary education. Educational games are a great tool to change the perception of certain topics, as well as changing the behavior of the players. Thus, this report describes the development of a game to make the field of IT more appealing to girls between the ages of 10 and 14.
After collecting requirements with the client and doing a literature study a design is proposed. The final product is a two-player 2D Role-Playing-Game with puzzle elements, specifically designed to be played in a classroom environment. The game takes place in a data center and will show the players the societal importance of data centers as well as the diversity of the work in data centers. The gameplay consists of exploring a data center, talking with both male and female employees in various roles, helping them with their work through minigames, and solving a mystery. The game was designed to specifically cater to girls and to break stereotypes regarding women in IT.
...
After collecting requirements with the client and doing a literature study a design is proposed. The final product is a two-player 2D Role-Playing-Game with puzzle elements, specifically designed to be played in a classroom environment. The game takes place in a data center and will show the players the societal importance of data centers as well as the diversity of the work in data centers. The gameplay consists of exploring a data center, talking with both male and female employees in various roles, helping them with their work through minigames, and solving a mystery. The game was designed to specifically cater to girls and to break stereotypes regarding women in IT.
...
Women are largely underrepresented in IT, girls’ interest in STEM and IT fields tends to drop throughout secondary education. Educational games are a great tool to change the perception of certain topics, as well as changing the behavior of the players. Thus, this report describes the development of a game to make the field of IT more appealing to girls between the ages of 10 and 14.
After collecting requirements with the client and doing a literature study a design is proposed. The final product is a two-player 2D Role-Playing-Game with puzzle elements, specifically designed to be played in a classroom environment. The game takes place in a data center and will show the players the societal importance of data centers as well as the diversity of the work in data centers. The gameplay consists of exploring a data center, talking with both male and female employees in various roles, helping them with their work through minigames, and solving a mystery. The game was designed to specifically cater to girls and to break stereotypes regarding women in IT.
After collecting requirements with the client and doing a literature study a design is proposed. The final product is a two-player 2D Role-Playing-Game with puzzle elements, specifically designed to be played in a classroom environment. The game takes place in a data center and will show the players the societal importance of data centers as well as the diversity of the work in data centers. The gameplay consists of exploring a data center, talking with both male and female employees in various roles, helping them with their work through minigames, and solving a mystery. The game was designed to specifically cater to girls and to break stereotypes regarding women in IT.