Generating labeled datasets for schema matching
More Info
expand_more
Abstract
Matching schemas is a fundamental task in data integration and semantic web applications. However, generating labeled data for schema matching tasks is challenging, requiring an efficient and effective approach. This thesis addresses this challenge by investigating schema matching techniques and crowdsourcing solutions. We developed a prototype crowdsourcing platform for schema matching called Crowdie. The platform utilizes a novel pre-filtering algorithm to reduce the number of possible correspondences and improve the platform’s efficiency while minimizing the cost of crowdsourcing.
Additionally, we designed a simple yet effective task interface to ensure high-quality labeled data. Our findings demonstrate that crowdsourcing is viable for generating labeled data for schema matching tasks. Overall, this work contributes to reducing search spaces and developing crowdsourcing solutions for schema matching tasks.