Beyond Accuracy: A Mixed-Method Exploration of Hash Database Verification
Focusing on the Detection of Child Sexual Abuse Material and Terrorist Content Online
M.J. Rottier (TU Delft - Technology, Policy and Management)
Savvas Zannettou – Mentor (TU Delft - Organisation & Governance)
Michel Eeten – Graduation committee member (TU Delft - Organisation & Governance)
M. Kroesen – Graduation committee member (TU Delft - Transport and Logistics)
Arda Gerkens – Mentor
Ellen Janssen – Mentor
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
The spread of Child Sexual Abuse Material (CSAM) and Terrorist Content Online (TCO) remains a pressing societal issue. Various organizations rely on hash databases to detect, flag, and remove harmful content. These databases function as storage of digital fingerprints of previously identified illegal material, enabling automated platform filtering. However, the effectiveness and reliability of such databases rely on the verification processes used to determine what content qualifies for inclusion.
This thesis investigates the characteristics of verification processes in CSAM and TCO hash databases, with a particular focus on triple verification. Using a multiphase mixed-methods design, the study integrates qualitative insights from stakeholder interviews, an annotation experiment, and a follow-up focus group with annotators.
The interviews with experts highlighted variations in verification workflows, ranging from single-rater decisions to triple verification models. While triple verification is seen as a standard for increasing trust and minimizing false positives, its feasibility in terms of emotional toll and volume has been questioned. Thematic insights centered around benefits (e.g., legal considerations), challenges (e.g., emotional toll, inconsistent thresholds), necessity (e.g., utility and impact), future opportunities (e.g., automation), and differences between CSAM and TCO workflows.
In the experiment, two raters from the Dutch National Police classified 2,031 real potentially illegal items under two different conditions. In the blind phase, raters voted independently, whereas in the non-blind phase, prior votes were visible. Overall inter-rater agreement rose from 89.4% in the blind condition to 97.1% in the non-blind condition. A statistically significant association was found between voting order and agreement rates, suggesting that seeing one or two prior votes can subtly influence rater alignment.
The focus group offered further insight into the found disagreements. A key theme was the importance of recognizing image series: individual images were often reclassified as illegal when identified as part of a known CSAM series. Age estimation was also a recurring source of ambiguity, particularly when visual quality was poor or when victims’ physical development and ethnicity made assessment difficult. Raters relied on indicators such as skin texture, body proportions, and dental features, though these cues were often interpreted differently.
The findings emphasize the need for verification systems that are both flexible and context-sensitive. Not all cases require the same level of scrutiny: while baseline CSAM could be classified with fewer checks, ambiguous cases require more checks. Rather than enforcing uniformity, organizations should accommodate interpretive differences while safeguarding consistency and accountability.