Europeana is a digital library of Europe's cultural heritage, housing a large corpus of data representing artworks, literature, historical locations and many culturally significant items. Europeana currently relies of traditional text-matching retrieval, such as BM25, to facilita
...
Europeana is a digital library of Europe's cultural heritage, housing a large corpus of data representing artworks, literature, historical locations and many culturally significant items. Europeana currently relies of traditional text-matching retrieval, such as BM25, to facilitate their search and discovery across millions of multilingual metadata-based records. However, these models are not capable of semantic understanding and require additional treatments to facilitate multilingual retrieval which costs Europeana resources, these treatments entail translating queries and data from other language into English and enriching content by adding entities from linked open data. Europeana's current methodology is ultimately limited in its ability to provide semantically relevant multilingual search results.
This thesis investigates the application of Neural Information Retrieval (NIR) to enhance Europeana's search capabilities. This investigation aims to assess the impact of NIR on multilingual retrieval and retrieval performance while also determining the value of existing translation and enrichment processes. To support this investigation, we contribute by developing a structured and preprocessed dataset specifically for NIR, as no such dataset previously existed for NIR. We conduct an extensive evaluation of NIR models, analyzing the impact of fine-tuning, query treatments, and document treatments on retrieval quality. Additionally, we assess the computational requirements, scalability, and practicality of deploying NIR, identifying trade-offs in retrieval efficiency and resource consumption, to provide an idea of an infrastructure Europeana would need to implement NIR.
This research required meticulous planning across all stages—from data collection and formatting to model training and evaluation—since applying NIR at this scale for metadata search is new for Europeana. Therefore, research not only provides insights into the viability of NIR as a replacement or enhancement to Europeana's existing search system but also lays the foundation for future advancements in multilingual retrieval for Europeana.
Through this thesis, we found that NIR models can offer promising improvements in multilingual retrieval and semantic search, reducing reliance on exact term matching. Our analysis suggests that not all of Europeana’s current preprocessing treatments are necessary for NIR models, as they inherently capture cross-lingual relationships more effectively than BM25, though the benefits vary depending on the model and configuration used. Overall, we recommend that a hybrid retrieval system that leverages both lexical and neural approaches may be the most practical solution for Europeana and warrants further exploration.
The integration of NIR presents several challenges, particularly in terms of infrastructure and evaluation. NIR models are sensitive to changes in document structure and content, requiring careful consideration of indexing and fine-training. Furthermore, while these models improve semantic search, they may struggle with entity-based queries, where BM25’s exact matching approach remains valuable.
A major limitation of this study was the absence of explicit relevance judgements in our dataset, which constrained our ability to make definitive conclusions about retrieval effectiveness. Future work should prioritize the development of a comprehensive evaluation framework, incorporating expert and user-based relevance assessments, to enable a more robust analysis of NIR’s impact.