Utilizing Lingual Structures to Enhance Transformer Performance in Source Code Completions

Master Thesis (2022)
Author(s)

J.B. Katzy (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

M. Finavaro Aniche – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

S.A.M. Mir – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2022
Language
English
Graduation Date
23-08-2022
Awarding Institution
Delft University of Technology
Programme
Computer Science
Faculty
Electrical Engineering, Mathematics and Computer Science
Downloads counter
238
Collections
thesis
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

We explored the effect of augmenting a standard language model’s architecture (BERT) with a structural component based on the Abstract Syntax Trees (ASTs) of the source code. We created a universal abstract syntax tree structure that can be applied to multiple languages to enable the model to work in a multilingual setting. We adapted the general graph transformer architecture to function as the structural component of the transformer. Furthermore, we extended the Embeddings from Language Models (ELMo) style embeddings to work in a multilingual setting when working with incomplete source code. The final results showed that the multilingual setting was beneficial to achieving higher quality embeddings for the embedding model, however, monolingual models performed better in most cases for the transformer model. The addition of ASTs resulted in increased performance in the best performing models on all languages, while also reducing the need for a pre-training task to achieve the best performance. The largest increase in performance for a Java model compared to its baseline counterpart was 3.0% on average on the test set, the largest increase in performance for a Julia model compared to its baseline counterpart was 1.1% on average on the test set, and the largest increase in performance of a CPP model compared to its baseline counterpart was 5.7% on average on the test set.

Files

Thesis_Jonathan_Katzy.pdf
(pdf | 14.2 Mb)
License info not available