Measurement by Proxy: On the Accuracy of Online Marketplace Measurements

Conference Paper (2022)
Author(s)

Alejandro Cuevas (Carnegie Mellon University)

F.E.G. Miedema (TU Delft - Organisation & Governance)

Kyle Soska (University of Illinois)

Nicolas Christin (Carnegie Mellon University)

Rolf van Wegberg (TU Delft - Organisation & Governance)

Research Group
Organisation & Governance
Copyright
© 2022 Alejandro Cuevas, F.E.G. Miedema, Kyle Soska, Nicolas Christin, R.S. van Wegberg
More Info
expand_more
Publication Year
2022
Language
English
Copyright
© 2022 Alejandro Cuevas, F.E.G. Miedema, Kyle Soska, Nicolas Christin, R.S. van Wegberg
Research Group
Organisation & Governance
Pages (from-to)
2153-2170
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

A number of recent studies have investigated online anony- mous (“dark web”) marketplaces. Almost all leverage a “measurement-by-proxy” design, in which researchers scrape market public pages, and take buyer reviews as a proxy for ac- tual transactions, to gain insights into market size and revenue. Yet, we do not know if and how this method biases results. We build a framework to reason about marketplace mea- surement accuracy, and use it to contrast estimates projected from scrapes of Hansa Market with data from a back-end database seized by the police. We further investigate, by sim- ulation, the impact of scraping frequency, consistency and rate-limits. We find that, even with a decent scraping regimen, one might miss approximately 46% of objects – with scraped listings differing significantly from not-scraped listings on price, views and product categories. This bias also impacts revenue calculations. We find Hansa’s total market revenue to be US $50M, which projections based on our scrapes un- derestimate by a factor of four. Simulations further show that studies based on one or two scrapes are likely to suffer from a very poor coverage (on average, 14% to 30%, respectively). A high scraping frequency is crucial to achieve reliable coverage, even without a consistent scraping routine. When high-frequency scraping is difficult, e.g., due to deployed anti- scraping countermeasures, innovative scraper design, such as scraping most popular listings first, helps improve cover- age. Finally, abundance estimators can provide insights on population coverage when population sizes are unknown.

Files

Sec22_cuevas.pdf
(pdf | 1.03 Mb)
License info not available