Finding defects in proposed changes is one of the biggest motivations and expected outcomes of code review, but does not result as often as expected in actually finding defects. Just-in-time (JIT) defect prediction focuses on predicting bug-introducing changes, which can help wit
...
Finding defects in proposed changes is one of the biggest motivations and expected outcomes of code review, but does not result as often as expected in actually finding defects. Just-in-time (JIT) defect prediction focuses on predicting bug-introducing changes, which can help with efficient allocation of inspection time according to the defect-proneness of the changed software parts. Despite the promising results achieved by DeepJIT and CC2Vec, two deep learning-based JIT defect prediction models, industry-based JIT defect prediction studies have not opted yet to apply deep models. In this work, the goal is to build and evaluate several JIT defect prediction models that can help Adyen developers spot defective changes during code review. To construct a new dataset with a large enough set of labels, we identify four sources of potential bug-fixing commits by analysing Adyen's way of working. We make several practical adaptations to DeepJIT and CC2Vec and compare their performances with three traditional metric-based models when making predictions at both commit-level and file-level. Our results indicate that deep models are able to outperform the metric-based models across all three datasets. All models performed slightly worse when evaluated on Adyen data compared to an open-source setting, but both deep models still achieved respectable performances and significantly outperformed the metric-based models. When evaluated in a real-world setting on bugs manually collected by Adyen developers, DeepJIT performed consistent with earlier findings when evaluated on commit-level, but performances fall on file-level. Lastly, we find that although inclusion of each bug source generally does not lead to worse performance, whether it leads to better performance is dependent on both what type of model is used and at what granularity predictions are made.