Figure 5: Inversion and pollution as a result of topography of Sofia (Paspaldzhiev, 2018).
Figure 1: Urban population exposure to concentrations above EU standards. Source: (European Environmental Agency, 2017)
Figure 2: Comparative visualization of fine particles with diameter less than 10µm mixture of solid particles and liquid droplets found in the air (PM10) Source: (US EPA, 2017)
Figure 3: AirSofia.Info citizen measurement network (AirBG.info / Code Foundation - Bulgaria, 2017-2019).
Figure 4: Locations of official measurement stations (World Air Quality Index, 2019).
Firstly there is bias correction of citizen science measurements, checked against the official measurement stations.
Secondly a prediction model for next-day forecast of PM10 is built, using additional factors from meteorological parameters (from a weather forecast) and topography satellite data.
The research is as much data oriented as it is public communication oriented. So in that regard this is a relatively small research oriented towards the general public, and it address some of the hotly discussed topics in the public circles:
1) Wide mistrust by the public to the official predictions (note: the research team does not endorse this type of mistrust in viable scientific results);
2) Popularity of the civic system of air quality sensors is based mostly on the fact that the data is oriented locally (neighborhood by neighborhood), and they give some local context and understanding to for the citizens of Sofia.
3) While no one disputes the vast technical superiority of the official measurement stations over the civic network sensors, the popular opinion is that the five official stations do not meet the needs of the citizens for in-time and on-spot predictions.
Figure 7: Methodology framework. Source: Own
Bias correction model for the data of citizen science measurements of AirSofia.info, checked against the official measurement stations of EEA. This is a most useful result in itself, making the widely available data usable for research and information purposes.
There have been some public concern regarding the sensors in the civic network of stations – all of them use the SDS011 optical sensor (https://airbg.info/).
A background check concludes that most of the critique is towards them is at PM2.5 level, while at PM10 level there are not too many objections to their functioning.
There is uncertainty of the measurements of these sensors under certain conditions (temperature and moisture) so that the measurements start to drift in one direction.
Under such hypothesis a bias-correction algorithm is a must along with all sorts of data cleansing as a standard approach for big data research.
Figure 6. Dissection of a citizen network station (OK Lab Stuttgart, 2017)
Preliminary data cleaning
First of all, we cap levels of PM10 concentration by citizens’ stations taking into account official hourly measurements. We apply a threshold of 125%, i.e. all observations with values above 125% of the maximum official measurement for the particular hour are capped to this level. This choice suggests that approximately 10% of all observations are capped.
Secondly, we remove stations with history of less than a preset threshold value. We select a threshold of 90 days, i.e. we analyze further only citizens’ stations with history of at least 3 months.
The following procedure was used in order to identify citizen stations, where the data quality might be questionable and remove them from the dataset used in module 3:
- Step 1. Calculate the distances between all the station pairs.
- Step 2. Create а group for each station, which include the station (will be referred to as main station) and all the station within a certain distance of it (will be referred as group station).
- Step 3. Calculate a dissimilarity measurement for each pair of main station-group station of in the group.
- Step 4. Based on this dissimilarity measurement, identify the station which has the most main station-group station pairs with a big dissimilarity measurement. In case of a tie, pick one of the tied at random.
- Step 5. Remove the station from the dataset and repeat from step 2.
- Step 6. Stop when some condition is met.
After an exploratory data analysis we concluded the following key findings using the official EEA data:
Table 5: List of features used for prediction purposes.
|Variable name||Variable Label|
|TASMAX||Daily maximum temperature|
|TASMIN||Daily minimum temperature|
|RHAVG||Daily average relative humidity|
|PSLAVG||Daily average surface pressure|
|lagP1||Previous day concentration of PM10|
|CP||Cross-product of current and previous day wind speed|
|R||Ratio of the Previous day concentration of PM10 and the cross-product|
|D1||Dummy variable reflecting the case of 100% maximum humidity|
|D2||Dummy variable reflecting the case of 0 km/h minimum wind speed|
|D3||Dummy variable reflecting the case of 0 mm average precipitation amount|
|Day||Day of the week|
|Month||Month of the year|
This module explains the algorithms and techniques used to predict the PM10 particles of the so-called citizen stations defined in Module 1 by using their relations with the official data. Key findings here are:
The open-source web representation of the research includes methodology, programming code in R for reproducibility, data transformation, numerical simulations, statistical modeling, data visualization, and interactive maps.
Figure 17. Flowchart of Module 3 analysis.
As part of our research, a standalone beta version of web application has been built in order to allow end users to visualize the result of the predictive model and get better understanding of what level of PM10 particles in Sofia to expect. This application (see Figure 16) could be used as a Proof of Concept to be further developed into a fully automated app with real time data feed, which would serve as a predictor of air pollution in different locations of Sofia. However, further development is not part of the current research.
Figure 16 Proof of concept for interactive map with predictions