Benchmark Blindspots: A systematic audit of documentation decay in TPAMI’s∗datasets

None, None

Benchmark Blindspots: A systematic audit of documentation decay in TPAMI’s∗datasets

Bachelor Thesis (2025)

Author(s)

A. Despan (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Andrew Demetriou – Mentor (TU Delft - Multimedia Computing)

Cynthia C.S. Liem – Mentor (TU Delft - Multimedia Computing)

J. Yang – Graduation committee member (TU Delft - Web Information Systems)

Faculty

Electrical Engineering, Mathematics and Computer Science

Machine learning Dataset Transparency

To reference this document use:

https://resolver.tudelft.nl/uuid:63d62d4d-f2ff-4c9f-a2dc-94fe21a806cd

More Info

expand_more

Publication Year

2025

Language

English

Graduation Date

24-06-2025

Awarding Institution

Delft University of Technology

Project

['CSE3000 Research Project', 'Annotation Practices in Societally Impactful Machine Learning Applications: What are these automated systems actually trained on?']

Programme

['Computer Science and Engineering']

Faculty

Electrical Engineering, Mathematics and Computer Science

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

High-impact
vision research still rests on datasets whose labels arrive via opaque, rarely documented
pipelines. To understand how serious the problem is inside a large venue, we
audited 75 TPAMI papers (2009-2024) that rely or introduce datasets. Each
dataset was coded against a 27-item checklist adapted from Garbage in, Garbage
out, spanning annotator recruitment, training, compensation, overlap-resolution
and more. Across the corpus, 37% of the expected annotation metadata is
missing; the rate changes little between recent (2022-24) and older cohorts.
The scarcest fields are labeller-population rationale (76.6% absent),
prescreening criteria (73.4%), total annotators (68.8%), compensation (67.2%)
and training procedures (62.5%). Documentation quality shows virtually no
correlation with a paper’s citation impact, suggesting community prestige does
not buy transparency. A handful of well—curated datasets achieve >75%
completeness, proving that thorough documentation is possible when incentives
align. The median TPAMI benchmark still ships with an unverifiable "ground
truth", threatening the reproducibility and fairness claims of downstream
models. We advocate that journals and conferences require a concise,
checklist-based annotation statement, mirroring existing ethics and
reproducibility forms, to ensure future vision systems are built (and
evaluated) on transparent, trustworthy data foundations.

Files

Research_paper-1.pdf

(pdf | 0.77 Mb)

License info not available