An empirical study of large language models for type and call graph analysis in Python and JavaScript

None, None; None, None; None, None; None, None; None, None; None, None

An empirical study of large language models for type and call graph analysis in Python and JavaScript

Journal Article (2025)

Author(s)

Ashwin Prasad S. Venkatesh (Paderborn University)

Rose Sunil (Paderborn University)

Samkutty Sabu (Paderborn University)

Amir M. Mir (TU Delft - Software Engineering)

Sofia Reis (Universidade de Lisbon, INESC-ID)

Eric Bodden (Paderborn University)

Research Group

Software Engineering

DOI related publication

https://doi.org/10.1007/s10664-025-10704-3

To reference this document use:

https://resolver.tudelft.nl/uuid:3da1db02-b4ca-4eba-92ff-9043560209d2

More Info

expand_more

Publication Year

2025

Language

English

Research Group

Software Engineering

Issue number

6

Volume number

30

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Large Language Models (LLMs) are increasingly being explored for their potential in software engineering, particularly in static analysis tasks. In this study, we investigate the potential of current LLMs to enhance call-graph analysis and type inference for Python and JavaScript programs. We empirically evaluated 24 LLMs, including OpenAI’s GPT series and open-source models like LLaMA and Mistral, using existing and newly developed benchmarks. Specifically, we enhanced TypeEvalPy, a micro-benchmarking framework for type inference in Python, with auto-generation capabilities, expanding its scope from 860 to 77,268 type annotations for Python. Additionally, we introduce SWARM-CG and SWARM-JS, comprehensive benchmarking suites for evaluating call-graph construction tools across multiple programming languages. Our findings reveal a contrasting performance of LLMs in static analysis tasks. For call-graph generation, traditional static analysis tools such as PyCG for Python and Jelly for JavaScript consistently outperform LLMs. While advanced models like mistral-large-it-2407-123b and gpt-4o show promise, they still struggle with completeness and soundness in call-graph analysis across both languages. In contrast, LLMs demonstrate a clear advantage in type inference for Python, surpassing traditional tools like HeaderGen and hybrid approaches such as HiTyper. These results suggest that, while LLMs hold promise in type inference, their limitations in call-graph analysis highlight the need for further research. Our study provides a foundation for integrating LLMs into static analysis workflows, offering insights into their strengths and current limitations.

Files

S10664-025-10704-3.pdf

(pdf | 6.04 Mb)