C. Lofi | TU Delft Repository

Relationships between geo-spatial features and COVID-19 hospitalisations revealed by machine learning models and SHAP values

Journal article (2024) - Lixia Chu, Jeroen Nelen, Alessandro Crivellari, Dainius Masiliūnas, Carola Hein, Christoph Lofi

Uncovering relationships between geospatial features and COVID-19 features is a comprehensive, confounding, cross-disciplinary and challenging topic, as the spread and effects of COVID-19 are related to many aspects of our lives, including socio-economic, cultural, and environmental features. Our research aims to provide an innovative data-driven method to uncover the relationships between the heterogeneous and cross-disciplinary geospatial features with COVID-19 features at the municipality scale in Germany. We exploit these relationships using supervised machine learning, explainable AI and spatial analysis in Germany from March 2020 to October 2021. First, we integrated multi-source data including social data, economic data, cultural data, air pollution data and COVID-19 features data into one spatiotemporally harmonised dataset. Second, we trained three machine learning models (a Support Vector Regressor, a Random Forest, and a Light Gradient Boosting Machine) on the integrated dataset to learn the relationships between the spatial features and the COVID-19 features. Third, we used Shapley Additive exPlanations (SHAP) to rank the relevance of each feature. After that, we illustrated the results by the visualised spatial differences within municipalities. The output delivers key information regarding the Covid hospitalisation rate with the control of NO2 concentration and education level in Germany with transferable methods. ...

Crowd's performance on temporal activity detection of musical instruments in polyphonic music

Conference paper (2023) - Ioannis Petros Samiotis, Christoph Lofi, Alessandro Bozzon

Musical instrument recognition enables applications such as instrument-based music search and audio manipulation, which are highly sought-after processes in everyday music consumption and production. Despite continuous progresses, advances in automatic musical instrument recognition is hindered by the lack of large, diverse and publicly available annotated datasets. As studies have shown, there is potential to scale up music data annotation processes through crowdsourcing. However, it is still unclear the extent to which untrained crowdworkers can effectively detect when a musical instrument is active in an audio excerpt. In this study, we explore the performance of nonexperts on online crowdsourcing platforms, to detect temporal activity of instruments on audio extracts of selected genres. We study the factors that can affect their performance, while we also analyse user characteristics that could predict their performance. Our results bring further insights into the general crowd's capabilities to detect instruments. ...

ImECGnet

Cardiovascular Disease Classification from Image-Based ECG Data Using a Multi-branch Convolutional Neural Network

Journal article (2023) - Amir Ghahremani, Christoph Lofi

Reliable Cardiovascular Disease (CVD) classification performed by a smart system can assist medical doctors in recognizing heart illnesses in patients more efficiently and effectively. Electrocardiogram (ECG) signals are an important diagnostic tool as they are already available early in the patients’ health diagnosis process and contain valuable indicators for various CVDs. Most ECG processing methods represent ECG data as a time series, often as a matrix with each row containing the measurements of a sensor lead; and/or the transforms of such time series like wavelet power spectrums. While methods processing such time-series data have been shown to work well in benchmarks, they are still highly dependent on factors like input noise and sequence length, and cannot always correlate lead data from different sensors well. In this paper, we propose to represent ECG signals incorporating all lead data plotted as a single image, an approach not yet explored by literature. We will show that such an image representation combined with our newly proposed convolutional neural network specifically designed for CVD classification can overcome the aforementioned shortcomings. The proposed (Convolutional Neural Network) CNN is designed to extract features representing both the proportional relationships of different leads to each other and the characteristics of each lead separately. Empirical validation on the publicly available PTB, MIT-BIH, and St.-Petersburg benchmark databases shows that the proposed method outperforms time series-based state-of-the-art approaches, yielding classification accuracy of 97.91%, 99.62%, and 98.70%, respectively. ...

Scriptoria

A Crowd-powered Music Transcription System

Conference paper (2022) - Ioannis Petros Samiotis, Christoph Lofi, Shaad Alaka, Cynthia C. S. Liem, Alessandro Bozzon

In this demo we present Scriptoria, an online crowdsourcing system to tackle the complex transcription process of classical orchestral scores. The system’s requirements are based on experts’ feedback from classical orchestra members. The architecture enables an end- to-end transcription process (from PDF to MEI) using a scalable microtask design. Reliability, stability, task and UI design were also evaluated and improved through Focus Group Discussions. Finally, we gathered valuable comments on the transcription process it- self alongside future additions that could greatly enhance current practices in their field. ...

Response of air quality to Covid-19 lockdown policies from Sentinel-5P TROPOMI sensor

Poster (2022) - Lixiia Chu, Dainius Masiliunas, Alessandro Crivellari, Christoph Lofi

The outbreak of the coronavirus disease 19 (Covid-19) has posed a worldwide threat to human beings, economic activities, and society. Enforced lockdowns for limiting the spread of Covid-19 virus also substantially reduce air pollutant emissions from vehicle traffic, industrial plants, etc. The lockdown restrictions have brought beneficial environmental implications, such as improvement of air quality. Previous studies recorded the reduction of air pollutants during the short-term lockdown in some cities and areas in Indian, China, and the U.S. [1-5]. While some studies argue that the improvement of air quality is not due to lockdown, but season influence or temporary change by coincidence [6]. Therefore, there is not enough evidence that the improvement of air quality is mainly due to reduced human activities. It is beneficial to answer this question by investigating and comparing the air pollution changes within countries with multi waves of pandemic timelines and different lockdown measures. Our research chose Germany and the Netherlands to investigate the air pollutant changes during their multiple lockdowns. Both of the two countries have gone through several pandemic waves while their imposed strategies are different, ranging from lockdown light, partial lockdown, full lockdown, to curfew in different stages of the pandemic. Our research investigates changes in air quality during their multiple pandemic waves and compares seasonal and monthly changes with the historical records (pre-pandemic) from ground stations to analyze the anomalies. During the pandemic period, the research will compare the disparities of air quality improvement with the several pandemic waves among mega urban agglomerations within the two countries. For the pre-pandemic period, this research analyzes the anomaly in comparison with the historical records with the air quality index. In particular, we adopt the datasets produced by a space-borne air pollution sensor TROPOMI on the Sentinel-5P satellite, provided in Google Earth Engine data catalog. We process the data and extract information about air pollution, including CO, NO2, SO2, O3, and CH4 for analyzing the air pollutant composition changes during the several pandemic waves. First, the decline values of air pollutant composition will be calculated and analyzed between the pandemic waves to prove the different changes following every wave in main urban areas within the two countries. Second, by aggregating the air pollutant concentrations from the satellite-based air pollution data into monthly, seasonal, and annual data and comparing them with corresponding historical records from ground stations at the same periods of pre-pandemic time, the anomalies will be calculated and analyzed to illustrate the improvement of air quality because of pandemic lockdowns at the country level. The historical record data will be collected from the air quality index based on the ground station measurements. Third, the disparities of air pollutant reduction during the pandemic will be also analyzed between the Netherlands and Germany, considering their different lockdown strategies. The result will provide strong evidence on the air quality improvement due to the reduction of human activities during lockdown periods and highlight the influence of anthropogenic activities on air pollution. The resulting information will provide information to policymakers concerning emission control and sustainable urban development. Keywords: Air quality changes, lockdowns, pre-pandemic, Google Earth Engine Reference: 1.Parida, B.R., et al., Impact of COVID-19 induced lockdown on land surface temperature, aerosol, and urban heat in Europe and North America. Sustainable Cities and Society, 2021. 75: p. 103336. 2.Naqvi, H.R., et al., Improved air quality and associated mortalities in India under COVID-19 lockdown. Environmental Pollution, 2021. 268: p. 115691. 3.Berman, J.D. and K. Ebisu, Changes in U.S. air pollution during the COVID-19 pandemic. Science of The Total Environment, 2020. 739: p. 139864. 4.Sahani, N., S.K. Goswami, and A. Saha, The impact of COVID-19 induced lockdown on the changes of air quality and land surface temperature in Kolkata city, India. Spatial Information Research, 2021. 29(4): p. 519-534. 5.Li, L., et al., Air quality changes during the COVID-19 lockdown over the Yangtze River Delta Region: An insight into the impact of human activity pattern changes on air pollution variation. Science of The Total Environment, 2020. 732: p. 139282. 6.Etchie, T.O., et al., Season, not lockdown, improved air quality during COVID-19 State of Emergency in Nigeria. Science of The Total Environment, 2021. 768: p. 145187. ...

The outbreak of the coronavirus disease 19 (Covid-19) has posed a worldwide threat to human beings, economic activities, and society. Enforced lockdowns for limiting the spread of Covid-19 virus also substantially reduce air pollutant emissions from vehicle traffic, industrial plants, etc. The lockdown restrictions have brought beneficial environmental implications, such as improvement of air quality. Previous studies recorded the reduction of air pollutants during the short-term lockdown in some cities and areas in Indian, China, and the U.S. [1-5]. While some studies argue that the improvement of air quality is not due to lockdown, but season influence or temporary change by coincidence [6]. Therefore, there is not enough evidence that the improvement of air quality is mainly due to reduced human activities. It is beneficial to answer this question by investigating and comparing the air pollution changes within countries with multi waves of pandemic timelines and different lockdown measures. Our research chose Germany and the Netherlands to investigate the air pollutant changes during their multiple lockdowns. Both of the two countries have gone through several pandemic waves while their imposed strategies are different, ranging from lockdown light, partial lockdown, full lockdown, to curfew in different stages of the pandemic. Our research investigates changes in air quality during their multiple pandemic waves and compares seasonal and monthly changes with the historical records (pre-pandemic) from ground stations to analyze the anomalies. During the pandemic period, the research will compare the disparities of air quality improvement with the several pandemic waves among mega urban agglomerations within the two countries. For the pre-pandemic period, this research analyzes the anomaly in comparison with the historical records with the air quality index. In particular, we adopt the datasets produced by a space-borne air pollution sensor TROPOMI on the Sentinel-5P satellite, provided in Google Earth Engine data catalog. We process the data and extract information about air pollution, including CO, NO2, SO2, O3, and CH4 for analyzing the air pollutant composition changes during the several pandemic waves. First, the decline values of air pollutant composition will be calculated and analyzed between the pandemic waves to prove the different changes following every wave in main urban areas within the two countries. Second, by aggregating the air pollutant concentrations from the satellite-based air pollution data into monthly, seasonal, and annual data and comparing them with corresponding historical records from ground stations at the same periods of pre-pandemic time, the anomalies will be calculated and analyzed to illustrate the improvement of air quality because of pandemic lockdowns at the country level. The historical record data will be collected from the air quality index based on the ground station measurements. Third, the disparities of air pollutant reduction during the pandemic will be also analyzed between the Netherlands and Germany, considering their different lockdown strategies. The result will provide strong evidence on the air quality improvement due to the reduction of human activities during lockdown periods and highlight the influence of anthropogenic activities on air pollution. The resulting information will provide information to policymakers concerning emission control and sustainable urban development. Keywords: Air quality changes, lockdowns, pre-pandemic, Google Earth Engine Reference: 1.Parida, B.R., et al., Impact of COVID-19 induced lockdown on land surface temperature, aerosol, and urban heat in Europe and North America. Sustainable Cities and Society, 2021. 75: p. 103336. 2.Naqvi, H.R., et al., Improved air quality and associated mortalities in India under COVID-19 lockdown. Environmental Pollution, 2021. 268: p. 115691. 3.Berman, J.D. and K. Ebisu, Changes in U.S. air pollution during the COVID-19 pandemic. Science of The Total Environment, 2020. 739: p. 139864. 4.Sahani, N., S.K. Goswami, and A. Saha, The impact of COVID-19 induced lockdown on the changes of air quality and land surface temperature in Kolkata city, India. Spatial Information Research, 2021. 29(4): p. 519-534. 5.Li, L., et al., Air quality changes during the COVID-19 lockdown over the Yangtze River Delta Region: An insight into the impact of human activity pattern changes on air pollution variation. Science of The Total Environment, 2020. 732: p. 139282. 6.Etchie, T.O., et al., Season, not lockdown, improved air quality during COVID-19 State of Emergency in Nigeria. Science of The Total Environment, 2021. 768: p. 145187.

An Analysis of Music Perception Skills on Crowdsourcing Platforms

Journal article (2022) - Ioannis Petros Samiotis, Sihang Qiu, Christoph Lofi, Jie Yang, Ujwal Gadiraju, Alessandro Bozzon

Music content annotation campaigns are common on paid crowdsourcing platforms. Crowd workers are expected to annotate complex music artifacts, a task often demanding specialized skills and expertise, thus selecting the right participants is crucial for campaign success. However, there is a general lack of deeper understanding of the distribution of musical skills, and especially auditory perception skills, in the worker population. To address this knowledge gap, we conducted a user study (N = 200) on Prolific and Amazon Mechanical Turk. We asked crowd workers to indicate their musical sophistication through a questionnaire and assessed their music perception skills through an audio-based skill test. The goal of this work is to better understand the extent to which crowd workers possess higher perceptions skills, beyond their own musical education level and self reported abilities. Our study shows that untrained crowd workers can possess high perception skills on the music elements of melody, tuning, accent, and tempo; skills that can be useful in a plethora of annotation tasks in the music domain. ...

How can Explainability Methods be Used to Support Bug Identification in Computer Vision Models?

Conference paper (2022) - Agathe Balayn, Natasa Rikalo, Christoph Lofi, Jie Yang, Alessandro Bozzon

Deep learning models for image classification suffer from dangerous issues often discovered after deployment. The process of identifying bugs that cause these issues remains limited and understudied. Especially, explainability methods are often presented as obvious tools for bug identification. Yet, the current practice lacks an understanding of what kind of explanations can best support the different steps of the bug identification process, and how practitioners could interact with those explanations. Through a formative study and an iterative co-creation process, we build an interactive design probe providing various potentially relevant explainability functionalities, integrated into interfaces that allow for flexible workflows. Using the probe, we perform 18 user-studies with a diverse set of machine learning practitioners. Two-thirds of the practitioners engage in successful bug identification. They use multiple types of explanations, e.g. visual and textual ones, through non-standardized sequences of interactions including queries and exploration. Our results highlight the need for interactive, guiding, interfaces with diverse explanations, shedding light on future research directions. ...

Framework of visualising and analysing urban transformation features responding to Covid 19 pandemic

Poster (2022) - Lixia Chu, Jeroen Nelen, Lukas Höller, Hülya Lasch, Dirk Schubert, Carola Hein, Christoph Lofi

Long-term exposure to ambient air pollution is one of the main public health concerns worldwide. Exposure to air pollution is highly related to a range of diseases including respiratory and cardiovascular diseases, such as lung cancers, asthma, diabetes, irregular heartbeat, stroke and obesity [1-3]. The outbreak of the pathogenic agent of coronavirus disease 19 (Covid-19) has led to a large number of deaths worldwide, and previous studies have pointed out how the long-term exposure to air pollution may have an impact on its high death rate [4]. Moreover, the hospitalization rate and infected population numbers are central indicators for lock-down policy-making, indicating whether the local medical system is able to handle the increasing infected population number through its available intensive care facilities. In fact, predicting hospitalization is vital for authorities and policymakers. We hereby hypothesize that high air pollutants concentration leads to a rise in the hospitalization rate under the influence of Covid-19 outbreaks. We attempt to predict such hospitalization numbers for past data by means of a task-specific optimized machine learning model, after we integrate social, economic, cultural, and other environmental features in future with an ongoing project we are conducting. While such a prediction model cannot directly be used for predicting the future development of the pandemic, analysing it still gives valuable insights on the influence of various environmental features had on it in the past.Air pollution is a mixture of a large number of chemical compounds such as CO2, CO, NOx, SO2, O3, heavy metals, and respirable particulate matter (PM2.5 and PM10); the main sources of such pollutants are identified as vehicle traffic, heating systems, and industrial plants [5]. Previous studies focused on the relationships between the variables of pandemic with the air pollutants information. Among all the air pollutants, NO2 and respirable articulate matter are highly related to the pandemic variables [6-8]. In our research, we extract the air pollutants information (CO, NO2, CH4, SO2) from the Sentinel-5P TROPOMI sensor, and integrate it with open-access data on Covid-19 features (mortality, infection rate, intensive care rate, etc). The air pollutant data is processed from the Sentinel-5P data catalog provided in Google Earth Engine. We therefore aim to ascertain the relationships between hospitalization and air pollutants concentration with the incidence of Covid-19. In particular, our ultimate research purpose is to develop a machine learning model to uncover the relationships between a mixture of features derived from air pollutants and Covid-19 related information, at municipality scales in Germany and the Netherlands. The relationships provide important clues on understanding how air pollution may affect on hospitalization rate and other features of Covid-19, through the evidence of potential low hospitalization or low mortality with better air quality. The output will deliver key information regarding public health effects and control of emission in Germany and the Netherlands. Specifically, on a temporal scale, we aggregated daily Covid-19 data and four air pollutant measures into weekly measures. On a spatial scale, the air pollutants were aggregated based on each municipality in Germany and the Netherlands to match the Covid-19 features. A choice of machine learning models were trained and evaluated on historical data (from March of 2020 to Oct of 2021), using features comprising weekly hospitalizations, death rate, and infected rate, tropospheric NO2 concentration, CO, SO2, CH4 concentrations. In addition, a post-processing analysis using machine-learning explainability methodologies was carried out to mine potential relationships between hospitalization attributes and specific air pollution concentration features. By processing municipalities as separate spatial entities, the results are intended to highlight hospitalization disparities and pollutants’ effect diversities among different geographic areas. By highlighting the relationships between air pollutant concentrations and incidence of Covid-19 with the hospitalization rate, and illustrating the hospitalization disparities among municipalities, our results provide key information regarding policymaking on urban emission control and public health at municipality level. When integrating other Covid-related features, our models could offer support to policymakers on effective lock-down decisions and health system management. Keywords: Air pollutant, Covid-19, supervised machine learning models, Google Earth Engine. Reference 1. Bernstein, J.A., et al., Health effects of air pollution. Journal of allergy and clinical immunology, 2004. 114(5): p. 1116-1123. 2. Brunekreef, B. and S.T. Holgate, Air pollution and health. The lancet, 2002. 360(9341): p. 1233-1242. 3. Strak, M., et al., Long-term exposure to particulate matter, NO2 and the oxidative potential of particulates and diabetes prevalence in a large national health survey. Environment international, 2017. 108: p. 228-236. 4. Ogen, Y., Assessing nitrogen dioxide (NO2) levels as a contributing factor to coronavirus (COVID-19) fatality. Science of The Total Environment, 2020. 726: p. 138605. 5. Vineis, P., et al., Air pollution and risk of lung cancer in a prospective study in Europe. International Journal of Cancer, 2006. 119(1): p. 169-174. 6. Gautam, S., COVID-19: air pollution remains low as people stay at home. Air Quality, Atmosphere & Health, 2020. 13: p. 853-857. 7. Vîrghileanu, M., et al., Nitrogen Dioxide (NO2) Pollution monitoring with Sentinel-5P satellite imagery over Europe during the coronavirus pandemic outbreak. Remote Sensing, 2020. 12(21): p. 3575. 8. Omrani, H., et al., Spatio-temporal data on the air pollutant nitrogen dioxide derived from Sentinel satellite for France. Data in Brief, 2020. 28: p. 105089. ...

Long-term exposure to ambient air pollution is one of the main public health concerns worldwide. Exposure to air pollution is highly related to a range of diseases including respiratory and cardiovascular diseases, such as lung cancers, asthma, diabetes, irregular heartbeat, stroke and obesity [1-3]. The outbreak of the pathogenic agent of coronavirus disease 19 (Covid-19) has led to a large number of deaths worldwide, and previous studies have pointed out how the long-term exposure to air pollution may have an impact on its high death rate [4]. Moreover, the hospitalization rate and infected population numbers are central indicators for lock-down policy-making, indicating whether the local medical system is able to handle the increasing infected population number through its available intensive care facilities. In fact, predicting hospitalization is vital for authorities and policymakers. We hereby hypothesize that high air pollutants concentration leads to a rise in the hospitalization rate under the influence of Covid-19 outbreaks. We attempt to predict such hospitalization numbers for past data by means of a task-specific optimized machine learning model, after we integrate social, economic, cultural, and other environmental features in future with an ongoing project we are conducting. While such a prediction model cannot directly be used for predicting the future development of the pandemic, analysing it still gives valuable insights on the influence of various environmental features had on it in the past.Air pollution is a mixture of a large number of chemical compounds such as CO2, CO, NOx, SO2, O3, heavy metals, and respirable particulate matter (PM2.5 and PM10); the main sources of such pollutants are identified as vehicle traffic, heating systems, and industrial plants [5]. Previous studies focused on the relationships between the variables of pandemic with the air pollutants information. Among all the air pollutants, NO2 and respirable articulate matter are highly related to the pandemic variables [6-8]. In our research, we extract the air pollutants information (CO, NO2, CH4, SO2) from the Sentinel-5P TROPOMI sensor, and integrate it with open-access data on Covid-19 features (mortality, infection rate, intensive care rate, etc). The air pollutant data is processed from the Sentinel-5P data catalog provided in Google Earth Engine. We therefore aim to ascertain the relationships between hospitalization and air pollutants concentration with the incidence of Covid-19. In particular, our ultimate research purpose is to develop a machine learning model to uncover the relationships between a mixture of features derived from air pollutants and Covid-19 related information, at municipality scales in Germany and the Netherlands. The relationships provide important clues on understanding how air pollution may affect on hospitalization rate and other features of Covid-19, through the evidence of potential low hospitalization or low mortality with better air quality. The output will deliver key information regarding public health effects and control of emission in Germany and the Netherlands. Specifically, on a temporal scale, we aggregated daily Covid-19 data and four air pollutant measures into weekly measures. On a spatial scale, the air pollutants were aggregated based on each municipality in Germany and the Netherlands to match the Covid-19 features. A choice of machine learning models were trained and evaluated on historical data (from March of 2020 to Oct of 2021), using features comprising weekly hospitalizations, death rate, and infected rate, tropospheric NO2 concentration, CO, SO2, CH4 concentrations. In addition, a post-processing analysis using machine-learning explainability methodologies was carried out to mine potential relationships between hospitalization attributes and specific air pollution concentration features. By processing municipalities as separate spatial entities, the results are intended to highlight hospitalization disparities and pollutants’ effect diversities among different geographic areas. By highlighting the relationships between air pollutant concentrations and incidence of Covid-19 with the hospitalization rate, and illustrating the hospitalization disparities among municipalities, our results provide key information regarding policymaking on urban emission control and public health at municipality level. When integrating other Covid-related features, our models could offer support to policymakers on effective lock-down decisions and health system management. Keywords: Air pollutant, Covid-19, supervised machine learning models, Google Earth Engine. Reference 1. Bernstein, J.A., et al., Health effects of air pollution. Journal of allergy and clinical immunology, 2004. 114(5): p. 1116-1123. 2. Brunekreef, B. and S.T. Holgate, Air pollution and health. The lancet, 2002. 360(9341): p. 1233-1242. 3. Strak, M., et al., Long-term exposure to particulate matter, NO2 and the oxidative potential of particulates and diabetes prevalence in a large national health survey. Environment international, 2017. 108: p. 228-236. 4. Ogen, Y., Assessing nitrogen dioxide (NO2) levels as a contributing factor to coronavirus (COVID-19) fatality. Science of The Total Environment, 2020. 726: p. 138605. 5. Vineis, P., et al., Air pollution and risk of lung cancer in a prospective study in Europe. International Journal of Cancer, 2006. 119(1): p. 169-174. 6. Gautam, S., COVID-19: air pollution remains low as people stay at home. Air Quality, Atmosphere & Health, 2020. 13: p. 853-857. 7. Vîrghileanu, M., et al., Nitrogen Dioxide (NO2) Pollution monitoring with Sentinel-5P satellite imagery over Europe during the coronavirus pandemic outbreak. Remote Sensing, 2020. 12(21): p. 3575. 8. Omrani, H., et al., Spatio-temporal data on the air pollutant nitrogen dioxide derived from Sentinel satellite for France. Data in Brief, 2020. 28: p. 105089.

What do You Mean? Interpreting Image Classification with Crowdsourced Concept Extraction and Analysis

Conference paper (2021) - Agathe Balayn, Panagiotis Soilis, Christoph Lofi, Jie Yang, Alessandro Bozzon

Global interpretability is a vital requirement for image classification applications. Existing interpretability methods mainly explain a model behavior by identifying salient image patches, which require manual efforts from users to make sense of, and also do not typically support model validation with questions that investigate multiple visual concepts. In this paper, we introduce a scalable human-in-the-loop approach for global interpretability. Salient image areas identified by local interpretability methods are annotated with semantic concepts, which are then aggregated into a tabular representation of images to facilitate automatic statistical analysis of model behavior. We show that this approach answers interpretability needs for both model validation and exploration, and provides semantically more diverse, informative, and relevant explanations while still allowing for scalable and cost-efficient execution. ...

Valentine: Evaluating Matching Techniques for Dataset Discovery

Conference paper (2021) - Christos Koutras, George Siachamis, Andra Ionescu, Kyriakos Psarakis, Jerry Brons, Marios Fragkoulis, Christoph Lofi, Angela Bonifati, Asterios Katsifodimos

Data scientists today search large data lakes to discover and integrate datasets. In order to bring together disparate data sources, dataset discovery methods rely on some form of schema matching: the process of establishing correspondences between datasets. Traditionally, schema matching has been used to find matching pairs of columns between a source and a target schema. However, the use of schema matching in dataset discovery methods differs from its original use. Nowadays schema matching serves as a building block for indicating and ranking inter-dataset relationships. Surprisingly, although a discovery method’s success relies highly on the quality of the underlying matching algorithms, the latest discovery methods employ existing schema matching algorithms in an ad-hoc fashion due to the lack of openly-available datasets with ground truth, reference method implementations, and evaluation metrics. In this paper, we aim to rectify the problem of evaluating the effectiveness and efficiency of schema matching methods for the specific needs of dataset discovery. To this end, we propose Valentine, an extensible open-source experiment suite to execute and organize large-scale automated matching experiments on tabular data. Valentine includes implementations of seminal schema matching methods that we either implemented from scratch (due to absence of open source code) or imported from open repositories. The contributions of Valentine are: i) the definition of four schema matching scenarios as encountered in dataset discovery methods, ii) a principled dataset fabrication process tailored to the scope of dataset discovery methods and iii) the most comprehensive evaluation of schema matching techniques to date, offering insight on the strengths and weaknesses of existing techniques, that can serve as a guide for employing schema matching in future dataset discovery methods. ...

Hybrid Annotation Systems for Music Transcription

Conference paper (2021) - Ionnis Petros Samiotis, Christoph Lofi, Alessandro Bozzon

Automated methods and human annotation are being extensively utilized to scale up modern classification systems. Processes though such as music transcription, oppose certain challenges due to the complexity of the domain and the expertise needed to read and process music scores. In this work, we examine how music transcription could benefit from systems that utilize hybrid annotation workflows, where automated methods are being trained, evaluated or have their output fixed by crowdworkers, using microtask designs. We argue that through careful task design utilizing microtask crowdsourcing principles, the general crowd can meaningfully contribute to such hybrid transcription systems. ...

Exploring the Music Perception Skills of Crowd Workers

Journal article (2021) - I.P. Samiotis, S. Qiu, C. Lofi, J. Yang, Ujwal Gadiraju, Alessandro Bozzon

Music content annotation campaigns are common on paid crowdsourcing platforms. Crowd workers are expected to annotate complicated music artefacts, which can demand certain skills and expertise. Traditional methods of participant selection are not designed to capture these kind of domain-specific skills and expertise, and often domain-specific questions fall under the general demographics category. Despite the popularity of such tasks, there is a general lack of deeper understanding of the distribution of musical properties - especially auditory perception skills - among workers. To address this knowledge gap, we conducted a user study (N=100) on Prolific. We asked workers to indicate their musical sophistication through a questionnaire and assessed their music perception skills through an audio-based skill test. The goal of this work is to better understand the extent to which crowd workers possess higher perceptions skills, beyond their own musical education level and self reported abilities. Our study shows that untrained crowd workers can possess high perception skills on the music elements of melody, tuning, accent and tempo; skills that can be useful in a plethora of annotation tasks in the music domain. ...

Managing bias and unfairness in data for decision support: a survey of machine learning and data engineering approaches to identify and mitigate bias and unfairness within data management and analytics systems

Journal article (2021) - A.M.A. Balayn, C. Lofi, G.J.P.M. Houben

The increasing use of data-driven decision support systems in industry and governments is accompanied by the discovery of a plethora of bias and unfairness issues in the outputs of these systems. Multiple computer science communities, and especially machine learning, have started to tackle this problem, often developing algorithmic solutions to mitigate biases to obtain fairer outputs. However, one of the core underlying causes for unfairness is bias in training data which is not fully covered by such approaches. Especially, bias in data is not yet a central topic in data engineering and management research. We survey research on bias and unfairness in several computer science domains, distinguishing between data management publications and other domains. This covers the creation of fairness metrics, fairness identification, and mitigation methods, software engineering approaches and biases in crowdsourcing activities. We identify relevant research gaps and show which data management activities could be repurposed to handle biases and which ones might reinforce such biases. In the second part, we argue for a novel data-centered approach overcoming the limitations of current algorithmic-centered methods. This approach focuses on eliciting and enforcing fairness requirements and constraints on data that systems are trained, validated, and used on. We argue for the need to extend database management systems to handle such constraints and mitigation methods. We discuss the associated future research directions regarding algorithms, formalization, modelling, users, and systems. ...

LOREM

Language-consistent Open Relation Extraction from Unstructured Text

Conference paper (2020) - Tom Harting, Sepideh Mesbah, Christoph Lofi

We introduce a Language-consistent multi-lingual Open Relation Extraction Model (LOREM) for finding relation tuples of any type between entities in unstructured texts. LOREM does not rely on language-specific knowledge or external NLP tools such as translators or PoS-taggers, and exploits information and structures that are consistent over different languages. This allows our model to be easily extended with only limited training efforts to new languages, but also provides a boost to performance for a given single language. An extensive evaluation performed on 5 languages shows that LOREM outperforms state-of-the-art mono-lingual and cross-lingual open relation extractors. Moreover, experiments on languages with no or only little training data indicate that LOREM generalizes to other languages than the languages that it is trained on. ...

Microtask crowdsourcing for music score Transcriptions: an experiment with error detection

Conference paper (2020) - I.P. Samiotis, S. Qiu, A. Mauri, C.C.S. Liem, C. Lofi, A. Bozzon

Human annotation is still an essential part of modern transcription workflows for digitizing music scores, either as a standalone approach where a single expert annotator transcribes a complete score, or for supporting an automated Optical Music Recognition (OMR) system. Research on human computation has shown the effectiveness of crowdsourcing for scaling out human work by defining a large number of microtasks which can easily be distributed and executed. However, microtask design for music transcription is a research area that remains unaddressed. This paper focuses on the design of a crowdsourcing task to detect errors in a score transcription which can be deployed in either automated or human-driven transcription workflows. We conduct an experiment where we study two design parameters: 1) the size of the score to be annotated and 2) the modality in which it is presented in the user interface. We analyze the performance and reliability of non-specialised crowdworkers on Amazon Mechanical Turk with respect to these design parameters, differentiated by worker experience and types of transcription errors. Results are encouraging, and pave the way for scalable and efficient crowdassisted music transcription systems. ...

REMA

Graph embeddings-based relational schema matching

Abstract (2020) - Christos Koutras, Marios Fragkoulis, Asterios Katsifodimos, Christoph Lofi

Schema matching is the process of capturing correspondence between attributes of different datasets and it is one of the most important prerequisite steps for analyzing heterogeneous data collections. State-of-the-art schema matching algorithms that use simple schema- or instance-based similarity measures struggle with finding matches beyond the trivial cases. Semantics-based algorithms require the use of domain-specific knowledge encoded in a knowledge graph or an ontology. As a result, schema matching still remains a largely manual process, which is performed by few domain experts. In this paper we present the Relational Embeddings MAtcher, or rema, for short. rema is a novel schema matching approach which captures semantic similarity of attributes using relational embeddings: a technique which embeds database rows, columns and schema information into multidimensional vectors that can reveal semantic similarity. This paper aims at communicating our latest findings, and at demonstrating rema's potential with a preliminary experimental evaluation. ...

Evaluating Neural Text Simplification in the Medical Domain

Conference paper (2019) - Laurens van den Bercken, Robert-Jan Sips, Christoph Lofi

Health literacy, i.e. the ability to read and understand medical text, is a relevant component of public health. Unfortunately, many medical texts are hard to grasp by the general population as they are targeted at highly-skilled professionals and use complex language and domain-specific terms. Here, automatic text simplification making text commonly understandable would be very beneficial. However, research and development into medical text simplification is hindered by the lack of openly available training and test corpora which contain complex medical sentences and their aligned simplified versions. In this paper, we introduce such a dataset to aid medical text simplification research. The dataset is created by filtering aligned health sentences using expert knowledge from an existing aligned corpus and a novel simple, language independent monolingual text alignment method. Furthermore, we use the dataset to train a state-of-the-art neural machine translation model, and compare it to a model trained on a general simplification dataset using an automatic evaluation, and an extensive human-expert evaluation. ...

Perceptual relational attributes

Navigating and discovering shared perspectives from user-generated reviews

Conference paper (2019) - Manuel Valle Torre, Mengmeng Ye, Christoph Lofi

Effectively modelling and querying experience items like movies, books, or games in databases is challenging because these items are better described by their resulting user experience or perceived properties than by factual attributes. However, such information is often subjective, disputed, or unclear. Thus, social judgments like comments, reviews, discussions, or ratings have become a ubiquitous component of most Web applications dealing with such items, especially in the e-commerce domain. However, they usually do not play major role in the query process, and are typically just shown to the user. In this paper, we will discuss how to use unstructured user reviews to build a structured semantic representation of database items such that these perceptual attributes are (at least implicitly) represented and usable for navigational queries. Especially, we argue that a central challenge when extracting perceptual attributes from social judgments is respecting the subjectivity of expressed opinions. We claim that no representation consisting of only a single tuple will be sufficient. Instead, such systems should aim at discovering shared perspectives, representing dominant perceptions and opinions, and exploiting those perspectives for query processing. ...

Coner

A Collaborative Approach for Long-Tail Named Entity Recognition in Scientific Publications

Conference paper (2019) - Daniel Vliegenthart, Sepideh Mesbah, Christoph Lofi, Akiko Aizawa, Alessandro Bozzon

Named Entity Recognition (NER) for rare long-tail entities as e.g., often found in domain-specific scientific publications is a challenging task, as typically the extensive training data and test data for fine-tuning NER algorithms is lacking. Recent approaches presented promising solutions relying on training NER algorithms in an iterative weakly-supervised fashion, thus limiting human interaction to only providing a small set of seed terms. Such approaches heavily rely on heuristics in order to cope with the limited training data size. As these heuristics are prone to failure, the overall achievable performance is limited. In this paper, we therefore introduce a collaborative approach which incrementally incorporates human feedback on the relevance of extracted entities into the training cycle of such iterative NER algorithms. This approach, called Coner, allows to still train new domain specific rare long-tail NER extractors with low costs, but with ever increasing performance while the algorithm is actively used in an application. ...

Training Data Augmentation for Detecting Adverse Drug Reactions in User-Generated Content

Conference paper (2019) - Sepideh Mesbah, Jie Yang, Robert-Jan Sips, Manuel Valle Torre, Christoph Lofi, Alessandro Bozzon, Geert-Jan Houben

Social media provides a timely yet challenging data source for adverse drug reaction (ADR) detection. Existing dictionary-based, semi-supervised learning approaches are intrinsically limited by the coverage and maintainability of laymen health vocabularies. In this paper, we introduce a data augmentation approach that leverages variational autoencoders to learn high-quality data distributions from a large unlabeled dataset, and subsequently, to automatically generate a large labeled training set from a small set of labeled samples. This allows for efficient social-media ADR detection with low training and re-training costs to adapt to the changes and emergence of informal medical laymen terms. An extensive evaluation performed on Twitter and Reddit data shows that our approach matches the performance of fully-supervised approaches while requiring only 25% of training data. ...