Predict the Air Pollution by Geo-locations in Sofia

Science meets Regions: Energy and Climate Change. Social Effects of Climate Change, 29-31 March 2019

Executive summary

Defining the study


Figure 5: Inversion and pollution as a result of topography of Sofia (Paspaldzhiev, 2018).


Figure 1: Urban population exposure to concentrations above EU standards. Source: (European Environmental Agency, 2017)

Figure 2: Comparative visualization of fine particles with diameter less than 10µm mixture of solid particles and liquid droplets found in the air (PM10) Source: (US EPA, 2017)


Figure 3: AirSofia.Info citizen measurement network ( / Code Foundation - Bulgaria, 2017-2019).

Figure 4: Locations of official measurement stations (World Air Quality Index, 2019).

Firstly there is bias correction of citizen science measurements, checked against the official measurement stations.

Secondly a prediction model for next-day forecast of PM10 is built, using additional factors from meteorological parameters (from a weather forecast) and topography satellite data.

Local Context

The research is as much data oriented as it is public communication oriented. So in that regard this is a relatively small research oriented towards the general public, and it address some of the hotly discussed topics in the public circles:

1) Wide mistrust by the public to the official predictions (note: the research team does not endorse this type of mistrust in viable scientific results);

2) Popularity of the civic system of air quality sensors is based mostly on the fact that the data is oriented locally (neighborhood by neighborhood), and they give some local context and understanding to for the citizens of Sofia.

3) While no one disputes the vast technical superiority of the official measurement stations over the civic network sensors, the popular opinion is that the five official stations do not meet the needs of the citizens for in-time and on-spot predictions.

Key findings


Figure 7: Methodology framework. Source: Own

Bias correction

Figure 6. Dissection of a citizen network station (OK Lab Stuttgart, 2017)

Preliminary data cleaning

First of all, we cap levels of PM10 concentration by citizens’ stations taking into account official hourly measurements. We apply a threshold of 125%, i.e. all observations with values above 125% of the maximum official measurement for the particular hour are capped to this level. This choice suggests that approximately 10% of all observations are capped.

Secondly, we remove stations with history of less than a preset threshold value. We select a threshold of 90 days, i.e. we analyze further only citizens’ stations with history of at least 3 months.

Data preparation

The following procedure was used in order to identify citizen stations, where the data quality might be questionable and remove them from the dataset used in module 3:

  • Step 1. Calculate the distances between all the station pairs.
  • Step 2. Create а group for each station, which include the station (will be referred to as main station) and all the station within a certain distance of it (will be referred as group station).
  • Step 3. Calculate a dissimilarity measurement for each pair of main station-group station of in the group.
  • Step 4. Based on this dissimilarity measurement, identify the station which has the most main station-group station pairs with a big dissimilarity measurement. In case of a tie, pick one of the tied at random.
  • Step 5. Remove the station from the dataset and repeat from step 2.
  • Step 6. Stop when some condition is met.

Analysis of factors and features

After an exploratory data analysis we concluded the following key findings using the official EEA data:

Table 5: List of features used for prediction purposes.

Variable name Variable Label
TASMAX Daily maximum temperature
TASMIN Daily minimum temperature
RHAVG Daily average relative humidity
PSLAVG Daily average surface pressure
lagP1 Previous day concentration of PM10
CP Cross-product of current and previous day wind speed
R Ratio of the Previous day concentration of PM10 and the cross-product
D1 Dummy variable reflecting the case of 100% maximum humidity
D2 Dummy variable reflecting the case of 0 km/h minimum wind speed
D3 Dummy variable reflecting the case of 0 mm average precipitation amount
D D1*D2*D3
Day Day of the week
Month Month of the year

Prediction Model

This module explains the algorithms and techniques used to predict the PM10 particles of the so-called citizen stations defined in Module 1 by using their relations with the official data. Key findings here are:

Figure 17. Flowchart of Module 3 analysis.

As part of our research, a standalone beta version of web application has been built in order to allow end users to visualize the result of the predictive model and get better understanding of what level of PM10 particles in Sofia to expect. This application (see Figure 16) could be used as a Proof of Concept to be further developed into a fully automated app with real time data feed, which would serve as a predictor of air pollution in different locations of Sofia. However, further development is not part of the current research.

Figure 16 Proof of concept for interactive map with predictions

[Acknowledgment] [Introduction] [Methodology] [Bias correction] [Analysis] [Features] [Prediction] [Summary]

Purchase Order: A.B610473 on Request to tender: Ares(2018)5990107