Beyond the Exact Match
Investigating the Relationship Between Syntactic and Semantic Equivalence in Human and LLM Test Assertions
R.H. van der Giessen (TU Delft - Electrical Engineering, Mathematics and Computer Science)
A. Panichella – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)
M.J.G. Olsthoorn – Mentor (TU Delft - Electrical Engineering, Mathematics and Computer Science)
A. Voulimeneas – Graduation committee member (TU Delft - Electrical Engineering, Mathematics and Computer Science)
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Test assertions form a critical component of software tests, as they are the component that actually verifies whether the code under test is exhibiting the desired behaviour. However, writing test assertions is time consuming, and thus research has been carried out on how to help in automation of this task. Since the emergence of Large Language Models (LLMs) in recent years, interest in their application for assertion generation has grown. LLMs have shown promise, with LLM-generated assertions achieving mutation scores similar to human-written assertions. However, existing research evaluates the assertions based on either exact matches or mutation scores in isolation, thus not investigating the relationship between syntactic and semantic equivalence. This matters because syntactically different assertions can have the same semantics. Punishing the LLM for writing assertEquals(a, b) instead of assertTrue(a.equals(b)) leads to systematically under-reported LLM performance.
In this paper we investigated the extent to which LLM-generated assertions differ syntactically but remain semantically equivalent to human-written reference assertions. We construct a dataset with 177 filtered entries drawn from open source projects and generate assertions using gpt-oss-20b. We then measure the syntactic similarity via normalised tree edit distance and related metrics. We approximate the semantic similarity based on Jaccard and Ochiai similarity between the sets of mutants killed with PIT mutation testing. We find a moderately strong correlation between normalised tree edit distance and the Jaccard similarity of the killed mutants (ρ = -0.685, p < 0.001), indicating that the two metrics are related but not interchangeable. Open coding of 41 semantically equivalent but syntactically different pairs revealed ten transformation categories. The LLM showed a universal preference for the omission of assertion messages and for replacing boolean checks with equality assertions. We use open coding to evaluate the syntactic differences between semantically equivalent assertions. Finally we use a decision tree to generate a threshold allowing us to effectively distinguish between datapoints likely and unlikely to be semantically equivalent. We find this threshold to be 0.41 for the normalised tree edit distance, showing a median Jaccard similarity of 0.5290 below it and a median of 1.000 above it. Our findings suggest that exact match evaluation significantly underestimates LLM assertion generation performance, and that syntactic similarity with a fixed threshold offers a more useful metric for assertion quality.