Utilizing Lingual Structures to Enhance Transformer Performance in Source Code Completions
J.B. Katzy (TU Delft - Electrical Engineering, Mathematics and Computer Science)
Maurício Aniche – Mentor (TU Delft - Software Engineering)
S.A.M. Mir – Graduation committee member (TU Delft - Software Engineering)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
We explored the effect of augmenting a standard language model’s architecture (BERT) with a structural component based on the Abstract Syntax Trees (ASTs) of the source code. We created a universal abstract syntax tree structure that can be applied to multiple languages to enable the model to work in a multilingual setting. We adapted the general graph transformer architecture to function as the structural component of the transformer. Furthermore, we extended the Embeddings from Language Models (ELMo) style embeddings to work in a multilingual setting when working with incomplete source code. The final results showed that the multilingual setting was beneficial to achieving higher quality embeddings for the embedding model, however, monolingual models performed better in most cases for the transformer model. The addition of ASTs resulted in increased performance in the best performing models on all languages, while also reducing the need for a pre-training task to achieve the best performance. The largest increase in performance for a Java model compared to its baseline counterpart was 3.0% on average on the test set, the largest increase in performance for a Julia model compared to its baseline counterpart was 1.1% on average on the test set, and the largest increase in performance of a CPP model compared to its baseline counterpart was 5.7% on average on the test set.