Watermarking of numerical datasets used for ML

A DWT approach for watermarking numerical datasets

Bachelor Thesis (2024)
Author(s)

M.C. Crăciun (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Contributor(s)

Z Erkin – Mentor (TU Delft - Cyber Security)

Devris Isler – Mentor

A. Katsifodimos – Graduation committee member (TU Delft - Data-Intensive Systems)

Faculty
Electrical Engineering, Mathematics and Computer Science
More Info
expand_more
Publication Year
2024
Language
English
Graduation Date
20-06-2024
Awarding Institution
Delft University of Technology
Project
['CSE3000 Research Project']
Programme
['Computer Science and Engineering']
Faculty
Electrical Engineering, Mathematics and Computer Science
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

AI and machine learning have been topics of big interest in the last couple of years, with plenty of applications in many domains. To train these models into useful and desirable tools, a large amount of data is necessary. This data is expensive to collect, becoming one of the most valuable commodities of this century. As the value of data increases, protecting this intellectual property becomes more and more relevant. Watermarking is a technique widely used for data protection in media, but the non-media counterpart has not been researched as thoroughly. In this paper, an adaptation of a common watermarking technique, DWT watermarking, is applied on two datasets used for machine learning. This technique is invisible and robust in signal watermarking, but its performance on a numerical dataset has not been previously researched. A previously devised algorithm was used, but it was adjusted to better fit dataset watermarking. To assess the quality of the watermark, the marked data has been subjected to create, remove, update and zero-out attacks. On top of this, multiple machine-learning models have been trained on the marked data. Initial results show that the proposed technique performs well in terms of invisibility, obtaining similar or better accuracies than models trained on the original data, but it is quite sensitive to attacks. Even small modifications, less than 1\% of the data, can break the signature.

Files

Research_paper.pdf
(pdf | 0.431 Mb)
License info not available