Authorship Identification and Verification of JavaScript Source Code

An Evaluation of Techniques

More Info
expand_more

Abstract

The increasing number of criminals that exploit the speed and anonymity of the Web has become of increasing concern. Little effort has been spent to trace the authors of malicious code. To that end we investigated authorship identification and verification of JavaScript source code. We evaluated three character based approaches and propose a new domain specific approach. What is new in the domain specific analysis approach, is that it represents code by a parse tree to extract structural features. The evaluation of the techniques with open source code from GitHub, turned out that the approaches that use character n-gram features achieved the best performance. However, the combination of n-gram and domain specific features turned out to be complementary, resulting in a higher performance. Techniques that used similarity based classification were especially successful if a limited amount of training data were available, while feature vector based techniques were mainly successful when a large amount of training data were available and in an authorship verification context. By means of code minification we evaluated how the classification accuracy is affected by removing authorship information from the source code. Code minification has shown to significantly deteriorate the performance of the authorship analysis methods. Especially the compression based technique is robust against code minification.