Enhancing Issue Tracking Efficiency with AI-Driven Natural Language Processing: Improving Classification, Association and Resolution
V.A. Pocheva (TU Delft - Electrical Engineering, Mathematics and Computer Science)
N. Yorke-Smith – Mentor (TU Delft - Algorithmics)
Maliheh Izadi – Mentor (TU Delft - Software Engineering)
René van den Berg – Mentor (NXP Semiconductors)
M.A. Costea – Graduation committee member (TU Delft - Programming Languages)
D. Spinellis – Mentor (TU Delft - Software Engineering)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
In large-scale engineering environments, efficient issue tracking is essential for timely problem resolution and knowledge reuse. However, manual classification and association of issue reports present scalability challenges, further complicated by inconsistent annotations and the absence of semantic linking mechanisms. This project investigates the application of Natural Language Processing and Artificial Intelligence to automate multi-label classification and discover meaningful semantic associations between technical issues. Over 70 model configurations were evaluated on a real-world industrial dataset, comparing classical models with transformer-based and deep learning approaches. DistilBERT achieved the highest Recall@5 (0.93), indicating strong performance in identifying relevant categories. Classical methods, such as TF-IDF combined with Logistic Regression, also performed well, offering a computationally efficient and interpretable option. For association discovery, approaches including lexical retrieval, embedding-based similarity, clustering-based filtering, and topic modelling were assessed using both quantitative metrics and expert review. Lexical (BM25) and embedding-based (SBERT + Cosine Similarity) methods offer complementary strengths, retrieving overlapping yet distinct sets of associations. Associations identified by both models were rated as useful in over 70% of cases by domain experts, suggesting that agreement between methods may serve as an indicator of relevance. While Copilot provided consistent relevance assessments, its ratings were often higher than those provided by human evaluators and did not always reflect their detailed assessments. These findings highlight the potential of combining lexical and semantic methods with human-in-the-loop validation to support scalable and accurate industrial applicability.