Utilizing Lingual Structures to Enhance Transformer Performance in Source Code Completions

Master Thesis (2022)
Author(s)

J.B. Katzy (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Maurício Aniche – Mentor (TU Delft - Software Engineering)

S.A.M. Mir – Graduation committee member (TU Delft - Software Engineering)

Faculty
Electrical Engineering, Mathematics and Computer Science
Copyright
© 2022 Jonathan Katzy
More Info
expand_more
Publication Year
2022
Language
English
Copyright
© 2022 Jonathan Katzy
Graduation Date
23-08-2022
Awarding Institution
Delft University of Technology
Programme
Computer Science
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

We explored the effect of augmenting a standard language model’s architecture (BERT) with a structural component based on the Abstract Syntax Trees (ASTs) of the source code. We created a universal abstract syntax tree structure that can be applied to multiple languages to enable the model to work in a multilingual setting. We adapted the general graph transformer architecture to function as the structural component of the transformer. Furthermore, we extended the Embeddings from Language Models (ELMo) style embeddings to work in a multilingual setting when working with incomplete source code. The final results showed that the multilingual setting was beneficial to achieving higher quality embeddings for the embedding model, however, monolingual models performed better in most cases for the transformer model. The addition of ASTs resulted in increased performance in the best performing models on all languages, while also reducing the need for a pre-training task to achieve the best performance. The largest increase in performance for a Java model compared to its baseline counterpart was 3.0% on average on the test set, the largest increase in performance for a Julia model compared to its baseline counterpart was 1.1% on average on the test set, and the largest increase in performance of a CPP model compared to its baseline counterpart was 5.7% on average on the test set.

Files

Thesis_Jonathan_Katzy.pdf
(pdf | 14.2 Mb)
License info not available