CodeFill

Multi-token Code Completion by Jointly learning from Structure and Naming Sequences

Conference Paper (2022)
Authors

Maliheh Izadi (TU Delft - Software Engineering)

Roberta Gismondi (Student TU Delft)

Georgios Gousios (TU Delft - Software Technology, TU Delft - Software Engineering)

Research Group
Software Engineering
Copyright
© 2022 M. Izadi, Roberta Gismondi, G. Gousios
To reference this document use:
https://doi.org/10.1145/3510003.3510172
More Info
expand_more
Publication Year
2022
Language
English
Copyright
© 2022 M. Izadi, Roberta Gismondi, G. Gousios
Research Group
Software Engineering
Pages (from-to)
401-412
ISBN (electronic)
9781450392211
DOI:
https://doi.org/10.1145/3510003.3510172
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Code completion is an essential feature of IDEs, yet current auto-completers are restricted to either grammar-based or NLP-based single token completions. Both approaches have significant draw-backs: grammar-based autocompletion is restricted in dynamically-typed language environments, whereas NLP-based autocompleters struggle to understand the semantics of the programming language and the developer's code context. In this work, we present CodeFill, a language model for autocompletion that combines learned structure and naming information. Using a parallel Transformer architecture and multi-task learning, CodeFill consumes sequences of source code token names and their equivalent AST token types. Uniquely, CodeFill is trained both for single-token and multi-token (statement) prediction, which enables it to learn long-range dependencies among grammatical and naming elements. We train CodeFill on two datasets, consisting of 29M and 425M lines of code, respectively. To make the evaluation more realistic, we develop a method to automatically infer points in the source code at which completion matters. We compare CodeFill against four baselines and two state-of-the-art models, GPT-C and TravTrans+. CodeFill surpasses all baselines in single token prediction (MRR: 70.9% vs. 66.2% and 67.8%) and outperforms the state of the art for multi-token prediction (ROUGE-L: 63.7% vs. 52.4% and 59.2%, for n=4 tokens). We publicly release our source code and datasets.