M. Berzins

info

Please Note

<p>This page displays the records of the person named above and is not linked to a unique person identifier. This record may need to be merged to a profile.</p>

Bachelor thesis (1)

1 records found

Can LLMs Consistently Describe Programs Across Source Code, Assembly, and Binary Representations?

Evaluating the Quality of Generated High-Level Descriptions of Benign and Malware Programs

Bachelor thesis (2026) - M. Berzins, S.S. Chakraborty, P. Pawelczak, A. van Deursen

Large language models (LLMs) are increasingly used to summarize and reason about software artifacts. This is especially true in cybersecurity, where analysts must often interpret low-level code such as assembly or binary. If LLMs describe the same program differently depending on its representation, analysts may therefore receive inconsistent or incomplete explanations. This paper evaluates whether an LLM generates descriptions that are both consistent across representations of the same program and aligned with human reference descriptions. Using a balanced subset of SBAN, a dataset which provides aligned high-level source code, disassembled assembly, and a raw-hexadecimal binary representation of the same programs together with a natural-language reference, we generate high-level descriptions for every representation with Qwen3.5-2B using a fixed prompt and low-temperature stochastic decoding, repeated over five runs for 75000 descriptions in total. To prevent context-level reference leakage and support reproducibility, each description is generated independently, without conversation history, the reference description, dataset labels, or the other representations. We measure cross-representation consistency and reference alignment with complementary metrics: sentence-transformer cosine similarity and ROUGE-L over the full dataset, BERTScore against the references, and Prometheus, an independent LLM judge, on a fixed 600-sample subset. Source-code descriptions align best with the references and assembly-source descriptions are the most consistent, while binary-source is the least consistent. A Friedman test confirms a statistically significant representation effect on reference-based quality. The absolute differences are small, however, and cross-representation consistency is only moderate across all metrics. These results indicate that representation choice measurably affects both the quality and consistency of LLM-generated descriptions, likely because each representation exposes a different level of semantic information. ...