PyHintSearch: Validated Combinatorial Search for Probabilistic Type Inference Models

Master Thesis (2024)
Author(s)

M.J. Bekooy (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Georgios Gousios – Mentor (TU Delft - Software Technology)

C.B. Bach – Graduation committee member (TU Delft - Programming Languages)

S. Proksch – Graduation committee member (TU Delft - Software Engineering)

Faculty
Electrical Engineering, Mathematics and Computer Science
Copyright
© 2024 Mark Bekooy
More Info
expand_more
Publication Year
2024
Language
English
Copyright
© 2024 Mark Bekooy
Graduation Date
05-03-2024
Awarding Institution
Delft University of Technology
Programme
['Computer Science']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Type annotations in Python are an integral part of static analysis. They can be used for code documentation, error detection and the development of cleaner architectures. By enhancing code quality, they contribute to the robustness, maintainability and comprehensibility of codebases. Tools like static type checkers use type annotations to detect bugs early, with some type checkers like Pyright being capable of inferring annotations statically from source code.

This thesis uses an innovative approach to further enhance type annotation coverage in Python codebases by using a combination of machine learning predictions and combinatorial search. To do this, PyHintSearch was developed. PyHintSearch constructs a search tree to which a depth-first search is applied to systematically explore potential combinations of predicted type annotations and validate them using feedback from the Pyright static type checker. Ultimately, the goal is to identify a branch containing a valid combination of type annotations. These annotations can then be integrated into Python code, thereby enhancing the type annotation coverage, which leads to improved static analysis and ultimately better code quality.

PyHintSearch's effectiveness is evaluated based on type annotation coverage and correctness, performance, and practical usability. Experimental results demonstrate different improvements in type annotation coverage, depending on the machine learning model used for type inference. Type4Py showed an improvement of 62.45% and TypeT5 of 79.93%. The precision of type annotations from these models are 0.36 and 0.51, respectively. Performance-wise, PyHintSearch can efficiently explore the exponential search space, annotating 16 diverse projects, ranging from small to large, in approximately 13.75 hours when using the Type4Py model. Regarding practical usability, the impact of type annotations on downstream program analysis is examined through the generation of call graphs. The additional information that type annotations provide can be used to refine the call graph by eliminating irrelevant calls to make it more precise.

Files

License info not available