Benchmark Blindspots: A systematic audit of documentation decay in TPAMI’s∗datasets

Bachelor Thesis (2025)
Author(s)

A. Despan (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Andrew Demetriou – Mentor (TU Delft - Multimedia Computing)

Cynthia C.S. Liem – Mentor (TU Delft - Multimedia Computing)

J. Yang – Graduation committee member (TU Delft - Web Information Systems)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2025
Language
English
Graduation Date
24-06-2025
Awarding Institution
Delft University of Technology
Project
['CSE3000 Research Project', 'Annotation Practices in Societally Impactful Machine Learning Applications: What are these automated systems actually trained on?']
Programme
['Computer Science and Engineering']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

High-impact
vision research still rests on datasets whose labels arrive via opaque, rarely documented
pipelines. To understand how serious the problem is inside a large venue, we
audited 75 TPAMI papers (2009-2024) that rely or introduce datasets. Each
dataset was coded against a 27-item checklist adapted from Garbage in, Garbage
out, spanning annotator recruitment, training, compensation, overlap-resolution
and more. Across the corpus, 37% of the expected annotation metadata is
missing; the rate changes little between recent (2022-24) and older cohorts.
The scarcest fields are labeller-population rationale (76.6% absent),
prescreening criteria (73.4%), total annotators (68.8%), compensation (67.2%)
and training procedures (62.5%). Documentation quality shows virtually no
correlation with a paper’s citation impact, suggesting community prestige does
not buy transparency. A handful of well—curated datasets achieve >75%
completeness, proving that thorough documentation is possible when incentives
align. The median TPAMI benchmark still ships with an unverifiable "ground
truth", threatening the reproducibility and fairness claims of downstream
models. We advocate that journals and conferences require a concise,
checklist-based annotation statement, mirroring existing ethics and
reproducibility forms, to ensure future vision systems are built (and
evaluated) on transparent, trustworthy data foundations.



Files

Research_paper-1.pdf
(pdf | 0.77 Mb)
License info not available