Evaluating SURP MIA performance on code samples

Bachelor Thesis (2026)
Author(s)

Ísak Jónsson (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

M. Izadi – Mentor (TU Delft - Software Engineering)

A. Al-Kaswan – Mentor (TU Delft - Software Engineering)

J.B. Katzy – Mentor (TU Delft - Software Engineering)

R.L. Lagendijk – Graduation committee member (TU Delft - Cyber Security)

More Info
expand_more
Publication Year
2026
Language
English
Graduation Date
28-01-2026
Awarding Institution
Project
CSE3000 Research Project
Programme
Computer Science and Engineering
Downloads counter
33
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Code language models are pretrained on massive datasets scraped from public repositories which are rarely disclosed. Membership Inference Attacks (MIAs) aim to predict whether specific samples were used in training but attack performance is contested. Previous work has shown that many attacks on LLMs perform randomly when evaluated on independent and identically distributed (i.i.d.) members and non-members. We consider three MIAs: LOSS, MinK\%, and SURP (where each attack extends the last with additional filtering of tokens considered for the membership signal), on StarCoder2-3B and Mellum-4B using the AISE MIA dataset, which contains 100,000 Java files with verified membership labels. We address a gap in the evaluation of these attacks on i.i.d. code samples and in the detailed comparison of SURP and MinK\%. A bag-of-words (BoW) classifier is used to measure distribution shift with an expected ROC-AUC of 0.5 under i.i.d. conditions. We achieve a ROC-AUC of 0.91 confirming substantial distribution shift. We apply two debiasing procedures to construct evaluation subsets: Taking samples close to the BoW decision boundary reduces BoW ROC-AUC performance to 0.66, while selecting BoW misclassified samples fails to reduce shift. After debiasing, all attacks perform at or below the bag-of-words baseline, with ROC-AUC between 0.55 and 0.63 and TPR at 5\% FPR between 0.05 and 0.16; suggesting random performance under strict i.i.d conditions. Hyperparameter ablation reveals that SURP collapses to MinK\% under optimization: optimal configurations disable SURP filtering or have classification agreement exceeding 94\% excluding one outlier. These results extend prior natural language findings to code: reference-free attacks exploit distributional differences rather than detecting membership.

Files

License info not available