Authorship Identification and Verification of JavaScript Source Code

An Evaluation of Techniques

Master thesis (2014)

Authors

W.C. Wilco

Contributors

A.E. Zaidman (mentor)

Programme

Software Engineering () (TU Delft)

JavaScript N-gram Authorship analysis Authorship identification Authorhip verification Source code Minification

More Info

expand_more

To reference this document use:

http://resolver.tudelft.nl/uuid:f6aa2f88-e657-4fef-b684-188a212c71ad

Published Date

18-12-2014

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Programme

Software Engineering

Abstract

The increasing number of criminals that exploit the speed and anonymity of the Web has become of increasing concern. Little effort has been spent to trace the authors of malicious code. To that end we investigated authorship identification and verification of JavaScript source code. We evaluated three character based approaches and propose a new domain specific approach. What is new in the domain specific analysis approach, is that it represents code by a parse tree to extract structural features. The evaluation of the techniques with open source code from GitHub, turned out that the approaches that use character n-gram features achieved the best performance. However, the combination of n-gram and domain specific features turned out to be complementary, resulting in a higher performance. Techniques that used similarity based classification were especially successful if a limited amount of training data were available, while feature vector based techniques were mainly successful when a large amount of training data were available and in an authorship verification context. By means of code minification we evaluated how the classification accuracy is affected by removing authorship information from the source code. Code minification has shown to significantly deteriorate the performance of the authorship analysis methods. Especially the compression based technique is robust against code minification.

Files

MSc_Thesis_Wilco_Wisse.pdf

(pdf | 4.37 Mb)