Covering 300 major cities across the world with hourly updates
Air pollution kills more than 7 million people every year according to the WHO — more than malaria, tuberculosis and AIDS combined. Yet the data needed to monitor, analyze and act on this problem is scattered across dozens of heterogeneous sources, in incompatible formats, with very different update frequencies.
This Big Data project addresses this challenge by building an end-to-end platform that ingests data from three complementary sources, enriches it with machine learning, and exposes it in an interactive near-real-time dashboard.
Measurements from certified physical monitoring stations — the same source as IQAir and the WHO. US EPA AQI scale (0–500). Uneven coverage, data sometimes missing.
European atmospheric model (Copernicus). Never any missing data. Also provides weather variables: wind, rain, humidity, boundary layer height.
Air-pollution-attributable disease burden by country: COPD, lung cancer, lower respiratory infections, in DALYs/100k inhabitants. Contextualizes the health impact of pollution.
The strategy: WAQI is the primary source (certified data), Open-Meteo is the automatic fallback (universal coverage), and WHO enriches each record with the country's health profile. Every document carries a data_source field tracing which source was used.
s3://airquality-datalake/raw/… via LocalStackair_quality_realtimeair_quality_realtimeA Random Forest (scikit-learn, 150 trees) predicts the dominant pollution source for each city at each hour. The five classes: traffic, industry, biomass_burning, dust, mixed.
The model is trained on synthetic data based on known pollution chemistry (high NO₂ → traffic, high SO₂ → industry, high PM2.5/PM10 ratio → biomass). The prediction confidence (source_confidence) is exposed in the dashboard.
pm_score = min(100, (PM2.5 / 12) × 20)
Current pollution level relative to the WHO guideline of 12 µg/m³.
disease_score = min(100, (total_dalys / 2500) × 100)
The country's air-pollution-attributable disease burden (WHO GHO), normalized against the worst case among our 300-city countries.
mixing_penalty = max(0, 1 − boundary_layer_height / 2000)
A low boundary layer traps pollution near the ground instead of letting it disperse.
heat_factor = 1 + max(0, (temp − 30°C) / 20) × max(0, (humidity − 60%) / 40) × 0.30
Hot, humid conditions make the same level of pollution more dangerous to breathe.
This formula combines exposure (PM2.5), population vulnerability (WHO GHO total DALYs per country), and dispersion conditions (atmospheric boundary layer height). The heat factor amplifies risk during humid heatwaves.
Concrete example: Lagos and Paris may have the same PM2.5, but Lagos scores much higher because Nigeria's total_dalys ≈ 2339/100k versus ~426 for France. The highest value across our 300-city countries (≈2461/100k) belongs to Ukraine, which is used as the normalization ceiling (2500).
Every JSON file is automatically replicated into an S3 bucket airquality-datalake hosted by LocalStack, a local emulation of AWS. Migrating to real AWS S3 requires changing a single parameter (S3_ENDPOINT). The mechanism is non-blocking: an upload failure never stalls the main pipeline.
Delhi and Lahore regularly show AQI > 300 in winter, correlated with boundary layers below 200m that trap pollution like a lid. Lagos has one of the highest risk scores despite a moderate AQI — the high prevalence of air-pollution-attributable diseases (WHO GHO) makes all the difference. Tokyo shows the opposite effect: favorable atmospheric conditions and a high boundary layer keep risk low despite extreme urban density.
The main challenge: missing WAQI data (offline stations, "-" values). Solution: graceful fallback to Open-Meteo with data_source traceability.
The WAQI × Open-Meteo join required a unit conversion: WAQI provides an aggregated AQI (0–500), Open-Meteo provides raw concentrations (µg/m³). The EPA conversion was implemented against official tables with cross-validation.
Ordered Docker startup (a dozen services with complex dependencies) is handled by Compose health checks and wait loops in start.sh.