Can Small Beat Big?
Evaluating Fine-Tuned CodeT5 Models on Assertion Generation Quality and Efficiency
T.A. van Leeuwen (TU Delft - Electrical Engineering, Mathematics and Computer Science)
A. Panichella – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)
M.J.G. Olsthoorn – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)
A. Voulimeneas – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Testing is a core practice in software development for detecting faults and checking that code behaves as expected. With the recent advent of Large Language Models (LLMs), code generation has never been more widespread. In assertion generation, where the focus is on the oracles that assess the state of the program, fine-tuned code language models have emerged. One such model, AsserT5, is a CodeT5-large (770M parameters) fine-tuned on focal-method and test-method pairs. Although it achieves state-of-the-art performance when measured by exact match to the ground truth, it remains unclear how the top-1 predictions of the smaller variants (CodeT5-small, 60M; CodeT5-base, 220M) perform on mutation score when the same fine-tuning procedure is applied.
Across ten real-world Java projects and 541 assertion-generation tasks, we find that the fine-tuned 60M CodeT5-small matches the 220M and 770M variants on mutation score (within 0.2 p.p.), achieving the highest score of the three by generating more assertions that compile. Among the larger code-specific baselines (Qwen2.5-Coder 3B, 7B, and 14B), CodeT5-small underperforms only the 14B model, and only by 0.6 p.p. This advantage is concentrated in just two of the ten projects, and the 14B model attains it at the cost of 38x more memory (9.00 GB vs 0.24 GB) and 2.6x slower inference. Because the difference is small and confined to two out of ten projects, we recommend the fine-tuned CodeT5-small to practitioners seeking local assertion-generation assistance at reasonable computational cost.