Multi-Source Data Modelling to Understand the Effects of Tourism Demand on Air Quality in Italy

More Info
expand_more

Abstract

The goal of this research is to model and understand the effects of tourism demand on air quality by performing data integration on multi-source data. This research is aimed at researchers and practitioners aiming to perform multidisciplinary research in the fields of data science and geoscience, presenting the methods and challenges that arise when performing such an analysis. A data processing pipeline explains the research from a data integration perspective involving the data retrieval and pre-processing tasks. This enables the construction of datasets for machine learning modelling and prediction of air pollutant levels based on tourism data. The study area of this research is Italy which is chosen based on its significant tourism industry and wide availability of data about tourism development. For this study, in situ air quality data sampled using Google Earth Engine (GEE) around accommodation, transportation and tourism attraction locations is modelled with tourist arrival numbers, nights spent and average length of stay. Long short-term memory (LSTM) multivariate time series modelling is performed afterwards to understand predictability of air quality on a national and regional level. To this end, this research looks into three different stages of the modelling process of tourism with air quality which are: (i) retrieving accommodation, transportation and tourism attraction locations using the RDF model, (ii) identifying which pollutants are correlated and Granger-caused by the different tourism demand features using sampled satellite air quality data of the identified tourism locations, (iii) understanding performance characteristics of LSTM time series models by training on tourism demand and air quality data. Correlation analysis indicates the potential to model the relation between tourism demand indicators and PM2.5 in overall cleaner regions in terms of this pollutant. In these regions, Granger-causality testing suggests a higher chance of predictability of PM2.5 time series using tourism demand data from the previous month. Training an LSTM model using the information of this lagged relationship suggests that regions with overall high PM2.5 levels are challenging to model showing high RMSE scores. Training an LSTM model for these regions also required more epochs compared to overall cleaner regions to model the effects of tourism demand on air quality.