An empirical study of large language models for type and call graph analysis in Python and JavaScript

Journal Article (2025)
Author(s)

Ashwin Prasad Shivarpatna Venkatesh (Paderborn University)

Rose Sunil (Paderborn University)

Samkutty Sabu (Paderborn University)

Amir M. Mir (TU Delft - Software Engineering)

Sofia Reis (Universidade de Lisbon, INESC-ID)

Eric Bodden (Paderborn University)

DOI related publication
https://doi.org/10.1007/s10664-025-10704-3 Final published version
More Info
expand_more
Publication Year
2025
Language
English
Journal title
Empirical Software Engineering
Issue number
6
Volume number
30
Article number
167
Downloads counter
127
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Large Language Models (LLMs) are increasingly being explored for their potential in software engineering, particularly in static analysis tasks. In this study, we investigate the potential of current LLMs to enhance call-graph analysis and type inference for Python and JavaScript programs. We empirically evaluated 24 LLMs, including OpenAI’s GPT series and open-source models like LLaMA and Mistral, using existing and newly developed benchmarks. Specifically, we enhanced TypeEvalPy, a micro-benchmarking framework for type inference in Python, with auto-generation capabilities, expanding its scope from 860 to 77,268 type annotations for Python. Additionally, we introduce SWARM-CG and SWARM-JS, comprehensive benchmarking suites for evaluating call-graph construction tools across multiple programming languages. Our findings reveal a contrasting performance of LLMs in static analysis tasks. For call-graph generation, traditional static analysis tools such as PyCG for Python and Jelly for JavaScript consistently outperform LLMs. While advanced models like mistral-large-it-2407-123b and gpt-4o show promise, they still struggle with completeness and soundness in call-graph analysis across both languages. In contrast, LLMs demonstrate a clear advantage in type inference for Python, surpassing traditional tools like HeaderGen and hybrid approaches such as HiTyper. These results suggest that, while LLMs hold promise in type inference, their limitations in call-graph analysis highlight the need for further research. Our study provides a foundation for integrating LLMs into static analysis workflows, offering insights into their strengths and current limitations.