Urban Air Quality Modeling Using Low-Cost Sensor Network and Data Assimilation in the Aburra Valley, Colombia

13 The use of low air quality networks has been increasing in recent years to study urban pollution 14 dynamics. Here we show the evaluation of the operational Aburrá Valley’s low-cost network 15 against the official monitoring network. The results show that the PM2.5 low-cost measurements 16 are very close to those observed by the official network. Additionally, the low-cost allows a 17 higher spatial representation of the concentrations across the valley. We integrate low-cost ob18 servations with the chemical transport model LOTOS-EUROS using data assimilation. Two dif19 ferent configurations of the low-cost networkwere assimilated: using thewhole low-cost network 20 (255 sensors), and a high-quality using just the sensors with a correlation factor greater than 0.8 21 with respect to the official network (115 sensors). The official stations were also assimilated to 22 compare the more dense low-cost network’s impact on the model performance. Both simulations 23 assimilating the low-cost model outperform the model without assimilation and assimilating the 24 official network. The model capability to predict high concentration events’ warnings is also 25 improved by assimilating the low-cost network with respect to the other simulations. Finally, the 26 simulation using the high-quality configuration has lower error values than using the complete 27 low-cost network, showing that it is essential to consider the quality and location and not just the 28 total number of sensors. Our results suggest that with the current advance in low-cost sensors, 29 it is possible to improve model performance with low-cost network data assimilation. 30


Introduction
Particulate matter (PM) is one of the most problematic pollutants in urban air [1]. The effects of PM on human health, associated especially with PM of ≤2.5 µm in diameter, include asthma, lung cancer and cardiovascular disease [2]. Consequently, major urban centers commonly monitor PM 2.5 as part of their air quality management strategies. There are several techniques to measure the PM concentration in the air, such as filter-based gravimetric method (GMM), β-attenuation absorption method (BAM), and laser-based optical method [3]. The filter-based GMM is the most accurate method but requires parallel processing of blank and sample filter performed in laboratory conditions, and thus is difficult for field application [4]. The BAM PM analyzers employ the absorption of beta radiation by solid particles extracted from air flow [3]. These analyzers have gone through strict field evaluations and demonstrated that they could provide concentration measures

Materials and Methods
The period of interest for all data evaluations, simulations and data assimilation experiments spans from 25 February to 15 March 2019. During these days, the PM concentrations were higher than usual due to the Northbound transit of the Inter-Tropical Convergence Zone.

Hyper-Dense Low-Cost Sensor Network
In Medellín and its greater metropolitan area inside the Aburrá Valley, the Sistema de Alerta Temprana del Valle de Aburrá (SIATA) project operates the official high-end air quality monitoring network (henceforth official network), and a hyper-dense, low-cost air quality network developed within the Citizen Scientist program (henceforth low-cost network).
The official network provides high quality measurements for different pollutants in the atmosphere over the Aburrá Valley such as O 3 , SO 2 , PM 10 , PM 2.5 and PM 1 . The official network is distributed among the 10 municipalities of the valley, with the majority of the stations located within the city of Medellín ( Figure 1, panel a). The PM measurement equipment consists of Met One Instruments BAM-1020 and BAM-1022 that produce averaged hourly data [16]. The low-cost network was created with the aim of engaging the community in issues surrounding air quality, and as an extension of the official network. As of writing, the low-cost network consists of 255 real-time PM 2.5 sensors across the Aburrá Valley and its surrounding hills. The sensors are located in the premises of private homes and public or private institutions ( Figure 1, panel b). The measuring equipment was developed by SIATA based on the well-known laser-based optical Shinyei PPD42NS, NOVA SDS011, and Bjhike HK-A5 sensors [16]. The description of the network deployment is presented in [16]. Data were downloaded from SIATA's data portal (available at https://siata.gov.co/descarga_siata/index.php/index2/. Last accessed, December 2020). Data from the official network for the corresponding dates were used for validation of both the low-cost network and the model simulations before and after data assimilation. Each station from the official network served as a reference point for all low-cost network sensors within a 2-km radius of the former. Performance of the latter was evaluated using as metrics the Mean Fractional Bias (MFB), the Root Mean Square Error (RMSE) and the Pearson correlation coefficient (R) [20][21][22]. When a low-cost sensor had more than one official station within a 2-km radius, the average value of the official measurements was used.   [23] is a chemical transport model that simulates concentrations of gasses and aerosols in the lower troposphere on a 3D grid. The simulated species include ozone, nitrogen oxides, volatile organic compounds, secondary inorganic aerosols, dust, and sea-salt [24]. The dynamics are regulated by processes such as chemical reactions, diffusion, drag, dry and wet deposition, emissions and advection [25].
Simulations were conducted using a one-way nested domain configuration as shown in Figure 2 and detailed in Table 1. The innermost domain (D4), the focus of the present study, covered the Aburrá Valley with a model resolution of 0.01°(about 1 × 1 km) as shown in Figure 1. The anthropogenic emissions input for D4 was updated with a highresolution local emissions inventory constructed as described in Section 2.2.2. The model setup is summarized in Table 2 (for details, see [18]).  Table 1.

Local Emissions Inventory
An anthropogenic urban emissions inventory for 2016 specific to Medellín and the other nine municipalities of the Aburrá Valley was used for the simulations on the D4 domain. This inventory provides a complete set of emitted trace gases such as carbon monoxide (CO), nitrogen oxides (NO x ), sulphur oxides (SO x ), and volatile organic compounds (VOCs), as well as particulate matter with diameter less than 2.5 µm (PM 2.5 ) or less than 10 µm (PM 10 ). The construction of the inventory followed a bottom-up methodology, combining activity data (traffic intensities, industrial production) with emission factors. Only traffic and industrial point sources were considered, without accounting for neither household nor commercial emissions [26].
For integration into LOTOS-EUROS, the emission inventory was disaggregated over the Aburrá Valley (76°W-75°W and 5.7°N-6.8°N) at a resolution of 0.01°× 0.01°(approximately 1 × 1 km), using a method based on road density as in [27]. The road network map was obtained from the OpenStreetMap database [28], and simplified by removing residential road segments as recommended in [29,30]. The simplification of the road network can reduce errors in the spatial disaggregation since residential roads correspond to a high portion of the road network length but carry a low percentage of total vehicular traffic. For each grid cell j, the corresponding disaggregation factor DF was calculated as in [27]. Namely, where S i,j is the length of road segment i in the grid cell j, I is the number of road segments in cell j, and J is the total number of grid cells. The point-source emissions were distributed on the grid using their known location, obtained from the official emissions inventory [26]. Figure 3 shows the resulting emissions maps for PM 2.5 and PM 10 .

Ensemble Kalman Filter
The Ensemble Kalman Filter (EnKF) is a Monte Carlo ensemble method, based on the approximation of the state probability density through an ensemble of model realizations [15]. The EnKF is initialized by generating a random ensemble of model states that represents the model's uncertainty: Since emissions are a major source of uncertainty in air quality modelling, we generated the ensembles from perturbations in the emissions. Each ensemble member was propagated in time by the model M to obtain a forecast ensemble: where ξ is the i-th member of the forecast ensemble at time k. The forecast ensemble describes a stochastic distribution with mean and covariance available from: with N being the number of ensemble members. The matrix L is formed by deviations of the ensemble members from the mean: Most of the data assimilation applications do not calculate the matrix P f directly due to its large size. Instead, a consistent square root formulation that only uses and stores L f is computed [31] in the operational code. The EnKF uses observations y k to update the forecast ensemble into a corrected or analysis ensemble. Observations collected in a vector y k are represented as a linear mapping from the state vector plus an observation representation error: The observation operator H maps the state into the observations. In this application, H selects the concentration in locations where the observations are available. The representation error v k describes the difference between observation and simulation due to both instrument and sampling errors. v k is defined as a Gaussian noise with mean 0 and standard deviation depending on the measurement instrument. The analysis ensemble members are calculated as follows: with The EnKF system in this application was configured to obtain estimates of both concentrations and emissions. An augmented state vector was used combining the PM 2.5 concentrations (c), propagated in time by LOTOS-EUROS, and the emission correction factors (δe), propagated in time by a colored noise model [32]: where M LE is the LOTOS-EUROS model, τ and σ are the correlation length and variance of the stochastic process, and w k is a standard white noise sample. The emissions (ê) were calculated as:ê where e represents the nominal emissions from the emissions inventory. For all the simulations we used a τ of 1 day and a σ of 0.5 following previous results [18]. Additionally, we used a covariance localization scheme to reduce spurious correlations among distant states. The covariance localization technique artificially reduces the covariance between states that are separated by longer distances than a threshold radius ρ [33,34]. The parameter ρ defines the area of influence of a given observation on the concentrations and emissions to be estimated. We defined a localization radius ρ = 5 km for all the simulations. We used an ensemble of N = 25 members. Additional experiments with larger ensembles were performed without improvements in performance (not shown). Two sets of low-cost sensors data were assembled: The first one included 255 sensors from the low-cost network that had a station from the official network within a 2-km radius. The second, higher quality one consisted of a subset of the previous set, including only those sensors whose data showed an R value equal or greater than 0.8 when evaluated against the official network.
a simulation with assimilation of data (observations) from the 14 stations of the official network (henceforth LE-official); 3.
a simulation with assimilation of the data from the entire low-cost network (henceforth LE-lowcost) 4.
a simulation with assimilation only of high-quality data from the low-cost network (henceforth LE-lowcost-HQ).
The seven stations from the official network that were not used for data assimilation were used as validation stations for all simulations.

Forecast Experiments
Data assimilation can improve forecast performance mainly for two reasons: First, if the simulation is initialized with an assimilated field value, initial conditions at the start of the forecast window may be a closer representation of reality than what the model alone may provide; second, the emission correction factors that were included in the assimilation state (10) can be applied to the model during the forecast window to adjust the emissions in the same direction as during assimilation.
Forecasting experiments were conducted to evaluate the capabilities of the model with data assimilation to forecast PM concentrations in the valley up to three days. Simulations were carried out as above, with the assimilation schedule illustrated in Figure 4. Data assimilation was conducted up to the indicated date, with the three subsequent days representing the forecast window. The forecasting window started at 00:00 hours of the first day after the end of data assimilation. To bring the information obtained in the assimilation window into the forecast window, we used the hourly profile of the correction factor calculated from the last 24 h of data assimilation. The experiments continued until all days between 9 March and 13 March (inclusive) had predictions as the first, second and third day of the forecast. The performance of the forecast was evaluated by calculating the Air Quality Index (AQI) according to the ranges established by the Metropolitan Area (available in Spanish https://www.metropol.gov.co/ambiental/calidad-del-aire/Documents/POECA/ Plan_de_Acci%C3%B3n_POECA_Metropolitano_2019.pdf. Last accessed, October 2020.) illustrated in Table 3, and comparing it to the AQI observed for the corresponding day. The comparison against the AQI rather than against plain PM concentrations facilitates the interpretation of the model forecast performance by decision makers and the general public.
Additionally, this representation affords evaluating the ability of the model to predict warning-triggering episodes (AQI in orange, red or purple levels). Forecasts missing warning-triggering episodes (false negatives) are especially problematic in air quality management because the resulting inaction can lead to human exposure to dangerous concentrations of pollutants.

Evaluation with Low-Cost Sensor Network
The performance of 145 sensors from the low-cost network was evaluated against data from the official network. The remaining 110 sensors did not have an official monitoring station within a 2-km radius. Figure 5 shows the histograms of the MFB, RMSE and R, and the geographical distribution of the performance values. For the majority (67%) of the low-cost sensors an MFB between −0.25 and 0.25 was obtained, with an average of about 0.2. The average RMSE was close to 8 µg/m 3 , with most sensors presenting values below 15 µg/m 3 . The majority (88%) of the sensors showed correlations with R values above 0.7. Observed errors fell within acceptable ranges (as in [21,22]). Zonal differences in measurement errors were observed. Locations in the South-central part of the city of Medellín (green ellipse on Figure 5d-f) contained most of the sensors with a R values lower than 0.5 and RMSE values grater than 15 µg/m 3 . Those sensors were located in a dense urban area, while the closest monitoring station is located in the outskirts of the city. Figure 6 shows the spatial distribution of the complete low-cost network and the subset of 115 low-cost sensors with the highest quality data (as defined in Section 2.3). The selection of the low-cost high quality stations was based on the results shown in Figure 5b,e.    Figure 6. Spatial distribution of the different sets of sensors (complete network (a) and high-quality network (b) ) used for data assimilation and validation. Blue dots indicate the location of the low-cost sensors. Red squares correspond to the locations of the official monitoring stations that were used for data assimilation. Green stars indicate the stations from the official network whose data where used for validation of all model simulations.

Evaluation of Data Assimilation Runs
The concentration fields generated by the model simulations with or without data assimilation were compared to the observations from seven of the official monitoring stations (validation stations, green stars in Figure 6) to evaluate the performance of the data assimilation schemes. Figure 7 shows the temporal series for the simulated and observed PM 2.5 concentrations at four of the validation stations. The four selected stations represent downtown Medellín (station 25), residential areas (station 86), areas with high vehicular flow (station 88), and a peri-urban area in the outskirts of the city (station 85). Those stations summarize the behavior of all seven validation stations. The LE simulation consistently underestimated the concentrations observed at stations 85 and 88. At stations 25 and 86, the LE simulation results were close in magnitude between 24 February and 3 March and 10 March to 15 March; between 3 March and 10 March, the model presented values much lower than those observed. The day-to-day variability was reduced for this same period, as seen in stations 85 and 86. This inconsistent behavior suggests a poor representation of the meteorological dynamics that governed the dispersion and accumulation of PM 2.5 within the valley. Simulations using data assimilation showed noisier behaviors than the LE simulation. This process is commonly observed when applying the EnKF and obeys the stochastic nature and the handling of uncertainty inherent to the method [15]. However, those simulations managed to correct the large discrepancies present in the LE simulation. Both LE-official, LE-lowcost, and LE-lowcost-HQ represented more accurately the day-today variability of the observations than LE. In general terms, there was no evidence of a sizeable and persistent difference among the simulations with data assimilation throughout the entire period. A spin-up period consisting of the 5 days prior to the start date was used for each simulation. Figure 8 shows the diurnal cycles during the simulation period in the four selected validations stations. The diurnal cycle of the LE simulation differed from the observations in both magnitude and temporal behavior. The highest concentration peak that appeared around 09:00 in all the stations is mainly due to traffic dynamics. In stations 25 and 88, the LE morning peak corresponded in time but not in magnitude with the observations; in stations 85 and 86, said peak appeared later in the simulations than in the observations. This time lag suggests a poor spatial representation of mobile emissions by the emissions inventory; or a deficiency it the wind fields in reproducing the valley dynamics, showing a late transport of the particulate material to these areas. The LE simulation did not capture the evening peak shown by the observations around 21:00 hours. The simulations using data assimilation presented diurnal cycles closer to the observations than did the LE simulation. The LE-official simulation captured the time and magnitude of the morning peak in stations 85 and 86. In station 88, LE-official corrected the time lag in the morning peak seen in LE, and improved the estimated magnitudes albeit still falling short of the observed values. A different behavior was seen for station 25, where LE-official had low diurnal variability, with a slight underestimation in the morning, and an overestimation in the afternoon. The LE-lowcost and LE-lowcost-HQ simulations results resembled closely the diurnal behavior of the observations, especially the temporal component. In all the stations, both the morning and the evening peaks matched the observations. The observed concentrations for stations 25 and 88 fell inside the standard deviation range for the LElowcost simulation; the same simulation overestimated the concentrations between 11:00 and 19:00 for station 85, and underestimated the concentrations between 01:00 and 13:00 for station 86. The LE-lowcost-HQ simulation results were overall the closest to observations.  The averaged evaluation statistics among all the validation station are shown in Table 4. The simulation results without data assimilation (LE) underestimated the observed concentrations in all the validation stations. This was also seen in previous related works [18,35]. The RMSE value reflected a low correspondence between the observed and simulated concentrations when using the model without data assimilation. The correlation coefficient was low, meaning that the model was not able to capture the variations in diurnal and day-to-day concentrations. In contrast, the three simulations using data assimilation had MFB values close to 0, without a significant difference among them. The data assimilation was thus effective in reducing the difference between the model and reality. The RMSE also decreased when using data assimilation, by 24.4% in the LE-official, 32.8% in the LE-lowcost, and 36.2% in the LE-lowcost-HQ simulations relative to the RMSE of the LE simulation. The R values were all above the criteria of good performance according with [36] Table 2, and based in [21,37]. Assimilation of either data set from the low-cost network resulted in improved error statistics when compared to the LE-official simulation.  Figure 9 shows a graphical evaluation of the model forecasts for 12 March as Day 1, 2 or 3 within the forecasting window. Forecasts for all other days within the forecasting experiment behaved similarly. The observed AQIs and the values for the LE simulation are the same in all the graphs since all graphs illustrate the same calendar day (12 March). Similar to the results shown in Section 3.2, the LE simulation underestimated PM 2.5 concentrations throughout the valley, yielding in most cases AQI lower than those reported. The AQI forecasts of the three simulations with data assimilation were consistently more accurate than the estimates from the simulation without assimilation (LE). There were no significant differences in performance among the three data assimilation simulations through the three forecast days. The forecast accuracy decreased as the forecasting window advanced, as could be expected from the uncertainty inherent in the meteorological fields and nominal emission factors. All three simulations with data assimilation had similar spatial behavior, with a tendency to underestimate the AQI in the Northern and Eastern areas of the valley.

Evaluation of Forecasts
For public information on air quality, it is essential that a forecast correctly warns of a critical pollution event. Figure 10 shows the confusion matrix for LE-official, LElowcost, and LE-lowcost-HQ simulations in the data assimilation and forecast windows. The confusion matrix summarizes the percentage of true negatives, true positives, false negatives, and false positives [38]. Data assimilation evaluation was performed just in the seven validation stations shown in Figure 6. The LE simulation did not offer a warning in any station in the assimilation nor in the forecast window; for that reason, its confusion matrix is not presented. In the assimilation window, data assimilation simulations have a percentage of true negatives and true positives higher than 80%, and even higher than 90% in the case of the LE-lowcost-HQ. Both simulations using the low-cost network showed lower false negative values than LE-official. The LE-lowcost-HQ had the highest accuracy in reproducing the warning-triggering events within the data assimilation window. The accuracy of the three simulations was lower in the forecast window than in the assimilation window. The small percentage of false positives and high percentage of false negatives suggests that even using the estimated emissions inventory, the simulations continue to underestimate the observations. As observed within the data assimilation window, the two simulations assimilating data from the low-cost network (LE-lowcost and LE-lowcost-HQ) had better warning forecast performance than the LE-official simulation.   Table 3.
(a) Data assimilation window

Discussion
The results presented herein support the feasibility of using networks of low-cost measuring stations for monitoring air quality in cities such as Medellín. The high spatial density of the low-cost network allowed for a much higher spatial resolution than that attained with the official network. The errors in the low-cost sensors located within the green ellipse in Figure 5d-f represented spatial outliers. The increased errors observed in this sector of the valley may be attributed to specific factors such as maintenance, characteristics of the infrastructure in which the sensors are located, differences in elevation relative to the official station against which they were evaluated, or particular meteorological conditions within the subregion of the valley that may yield local heterogeneity in PM concentrations. Said green ellipse corresponds to a transition zone between peri-urban terrain and an expanding horizon of high-density residential buildings. The low-cost sensors are located in buildings, while the official monitoring station is located in a school surrounded by forests. This may explain the apparent overestimation of the PM levels by the low-cost sensors and the low correlation values of their data.
Our results displayed low correlation values and a marked tendency to underestimate the observed concentrations by the LOTOS-EUROS model without assimilation. Similar behaviors were observed in previous works [18,35]. In [35] the WRF-Chem model in a sub-kilometer configuration was used to reproduce the CO dynamics in the valley. The emission inventory was obtained from the AMVA Official Emission Inventory [26] following a methodology similar to the presented in Section 2.2.2. Although the meteorological fields showed a high similarity with observations, the model underestimated the CO con-centrations. The underestimation in both cases is attributed to mismatches in the official emissions inventory and uncertainties generated by the simplifications of disaggregation methodologies, such as the removal of residential road segments. Additionally, the official emissions inventory does not include commercial sources, biogenic sources, and resuspended dust, which can constitute a considerable part of particulate emissions [39]. However, data assimilation notably improved the ability of LOTOS-EUROS to represent the magnitude and dynamics of PM 2.5 within the Aburrá Valley. The assimilation of data from the low-cost network improved the correlation between the observed and the simulated concentrations to a greater extent than when data from the sparse official network was assimilated, both in terms of the RMSE and the R values. The errors left in the simulated concentrations after the assimilation of the low-cost network were within the performance goals for PM 2.5 representation formulated in [21,22,37,40]. The uncertainty present in the model caused the percentage of predicted warning-triggering events related to high concentration of PM 2.5 , to decrease to almost one half of the events observed within the forecasting window ( Figure 10). Our results highlight the persistent need to improve the inventories of emissions, the meteorological data used as inputs, and to reduce other sources of uncertainty in the model in order to increase forecasting capacity. Nevertheless, the model's forecasting capacity was increased when observations were assimilated. The greater spatial coverage of the low-cost network contributed significantly to the improvements against the simulations assimilating data from the official network. The higher density of observations also allowed estimating emissions in more detail, as seen in Figure 8. The more detailed emission estimations in turn allowed a better reproduction of the concentrations in the forecast window even in the absence of data assimilation.
Although the LE-lowcost simulation used more observations than the LE-lowcost-HQ simulation (255 and 115, respectively), the location and quality of the additional observations played an important role. The LE-lowcost-HQ was defined using a high similarity criterion to the official network, so it was not as affected by observations with low quality as LE-lowcost. Comparisons between Figure 6a,b revealed that the additional locations do not increase the spacial density considerably relative to the high-quality lowcost subset. Our results suggest that while a high observation density is essential for improving the performance of a model with data assimilation, it is crucial to consider other factors such as the quality of the data and the location of the sensors. Different techniques of observation localization allow optimizing the number of sensors to improve the data assimilation or other data fusion techniques [41][42][43][44]. We highly recommend implementing these techniques in the development of new low-cost monitoring networks. Apart from minimizing the number of sensors and associated costs, the processing of a reduced number of observations requires less computational resources. As an example, the LE-lowcost simulation was 3.2 times slower than the LE-lowcost-HQ using the same computation configuration. Optimization of computational and time resources are especially important for operational systems.
Jointly with previous work [9,12,14,[45][46][47], our results can support and motivate the development of future low-cost networks and their integration in data fusion applications. According to the literature, North America, Europe, and China harbor most of the current low-cost implementations, with experimental, citizen, and data dissemination purposes [8,48]. In developing countries, a low-cost network, together with a CTM and data assimilation can provide a valuable first approach to monitoring PM without the high cost of an official air quality network.

Conclusions
We present the assimilation of data from a hyper-dense low-cost PM network into the chemical transport model LOTOS-EUROS in a urban setting. The low-cost network provided high quality data comparable to those provided by the official monitoring network. The performance of the model with assimilation of the spatially-dense data from the low-cost network improved both in terms of its representation of the observed dynamics, as well as in its forecast capabilities, highlighting its value as an air-quality management tool. Our results support the idea than with the current advances in the low-cost sensors, it is possible to use low-cost networks and data assimilation to model and predict air quality in urban areas.
Although one of the main advantages of a low-cost networks is the possibility of implementing hyper-dense networks at relatively low costs, it is recommended to prioritize the quality of the data (sensor quality, calibration, maintenance) and the study of optimal localization. High quality data, and the correct number and localization of sensors improves the data assimilation process and minimizes operational and computational costs.