CodeFill

None, None; None, None; None, None

CodeFill

Multi-token Code Completion by Jointly learning from Structure and Naming Sequences

Conference Paper (2022)

Author(s)

M. Izadi (TU Delft - Software Engineering)

Roberta Gismondi (Student TU Delft)

G. Gousios (TU Delft - Software Technology, TU Delft - Software Engineering)

Research Group

Software Engineering

Copyright

DOI related publication

https://doi.org/10.1145/3510003.3510172

Transformers Multi-Task Learning Types Automatic Code Completion Dynamically-typed Languages

To reference this document use:

https://resolver.tudelft.nl/uuid:bbc3c3d6-0e32-487f-83aa-d576553a29ba

More Info

expand_more

Publication Year

2022

Language

English

Copyright

Research Group

Software Engineering

Pages (from-to)

401-412

ISBN (electronic)

9781450392211

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Code completion is an essential feature of IDEs, yet current auto-completers are restricted to either grammar-based or NLP-based single token completions. Both approaches have significant draw-backs: grammar-based autocompletion is restricted in dynamically-typed language environments, whereas NLP-based autocompleters struggle to understand the semantics of the programming language and the developer's code context. In this work, we present CodeFill, a language model for autocompletion that combines learned structure and naming information. Using a parallel Transformer architecture and multi-task learning, CodeFill consumes sequences of source code token names and their equivalent AST token types. Uniquely, CodeFill is trained both for single-token and multi-token (statement) prediction, which enables it to learn long-range dependencies among grammatical and naming elements. We train CodeFill on two datasets, consisting of 29M and 425M lines of code, respectively. To make the evaluation more realistic, we develop a method to automatically infer points in the source code at which completion matters. We compare CodeFill against four baselines and two state-of-the-art models, GPT-C and TravTrans+. CodeFill surpasses all baselines in single token prediction (MRR: 70.9% vs. 66.2% and 67.8%) and outperforms the state of the art for multi-token prediction (ROUGE-L: 63.7% vs. 52.4% and 59.2%, for n=4 tokens). We publicly release our source code and datasets.

Files

3510003.3510172.pdf

(pdf | 0.509 Mb)