Can Small Beat Big?

None, None

Can Small Beat Big?

Evaluating Fine-Tuned CodeT5 Models on Assertion Generation Quality and Efficiency

Bachelor Thesis (2026)

Author(s)

T.A. van Leeuwen (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

A. Panichella – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

M.J.G. Olsthoorn – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

A. Voulimeneas – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Faculty

Electrical Engineering, Mathematics and Computer Science

LLM Mutation Testing Test-Assertion Generation

To reference this document use

https://resolver.tudelft.nl/uuid:36bfefb0-c601-4b3a-9fbe-12d9abc8f9a4

More Info

expand_more

Publication Year

2026

Language

English

Graduation Date

23-06-2026

Awarding Institution

Delft University of Technology

Project

CSE3000 Research Project

Programme

Computer Science and Engineering

Faculty

Electrical Engineering, Mathematics and Computer Science

Downloads counter

6

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Testing is a core practice in software development for detecting faults and checking that code behaves as expected. With the recent advent of Large Language Models (LLMs), code generation has never been more widespread. In assertion generation, where the focus is on the oracles that assess the state of the program, fine-tuned code language models have emerged. One such model, AsserT5, is a CodeT5-large (770M parameters) fine-tuned on focal-method and test-method pairs. Although it achieves state-of-the-art performance when measured by exact match to the ground truth, it remains unclear how the top-1 predictions of the smaller variants (CodeT5-small, 60M; CodeT5-base, 220M) perform on mutation score when the same fine-tuning procedure is applied.

Across ten real-world Java projects and 541 assertion-generation tasks, we find that the fine-tuned 60M CodeT5-small matches the 220M and 770M variants on mutation score (within 0.2 p.p.), achieving the highest score of the three by generating more assertions that compile. Among the larger code-specific baselines (Qwen2.5-Coder 3B, 7B, and 14B), CodeT5-small underperforms only the 14B model, and only by 0.6 p.p. This advantage is concentrated in just two of the ten projects, and the 14B model attains it at the cost of 38x more memory (9.00 GB vs 0.24 GB) and 2.6x slower inference. Because the difference is small and confined to two out of ten projects, we recommend the fine-tuned CodeT5-small to practitioners seeking local assertion-generation assistance at reasonable computational cost.

Files

Can_small_beat_big.pdf

(pdf | 0.728 Mb)

License info not available