Global Air Quality & Health Risk Dashboard

Covering 300 major cities across the world with hourly updates

Apache Spark Apache Airflow Apache Kafka Elasticsearch LocalStack S3 scikit-learn

Introduction

Air pollution kills more than 7 million people every year according to the WHO — more than malaria, tuberculosis and AIDS combined. Yet the data needed to monitor, analyze and act on this problem is scattered across dozens of heterogeneous sources, in incompatible formats, with very different update frequencies.

This Big Data project addresses this challenge by building an end-to-end platform that ingests data from three complementary sources, enriches it with machine learning, and exposes it in an interactive near-real-time dashboard.

Why Combine Three Sources?

WAQI / aqicn.org

Measurements from certified physical monitoring stations — the same source as IQAir and the WHO. US EPA AQI scale (0–500). Uneven coverage, data sometimes missing.

Open-Meteo / CAMS

European atmospheric model (Copernicus). Never any missing data. Also provides weather variables: wind, rain, humidity, boundary layer height.

WHO GHO (AIR_9)

Air-pollution-attributable disease burden by country: COPD, lung cancer, lower respiratory infections, in DALYs/100k inhabitants. Contextualizes the health impact of pollution.

The strategy: WAQI is the primary source (certified data), Open-Meteo is the automatic fallback (universal coverage), and WHO enriches each record with the country's health profile. Every document carries a data_source field tracing which source was used.

Pipeline Architecture

WAQIhourly · real station AQI
Open-Meteo / CAMShourly · concentrations + weather
WHO GHO (AIR_9)annual · disease burden by country
data/raw/…dual-written to s3://airquality-datalake/raw/… via LocalStack
Random Forest trainingcached after the first run — skipped if a model already exists
Spark — JSON → Parquettyped schema enforcement, UTC timestamps
Spark join + 12 enrichmentsUS EPA AQI · ML pollution source · health risk score · heat amplification · anomaly detection
data/usage/air_quality_enriched/partitioned by year / month / day
Elasticsearch → KibanaGlobal Air Quality & Health Risk Dashboard

Real-time pipeline (Kafka)

WAQI API10 cities, polled every 60 seconds
kafka-producerpublishes to topic air_quality_realtime
kafka-consumerindexes into Elasticsearch air_quality_realtime
Kibanaauto-refresh every 30s — end-to-end latency < 90s

Machine Learning: Pollution Source Classification

A Random Forest (scikit-learn, 150 trees) predicts the dominant pollution source for each city at each hour. The five classes: traffic, industry, biomass_burning, dust, mixed.

The model is trained on synthetic data based on known pollution chemistry (high NO₂ → traffic, high SO₂ → industry, high PM2.5/PM10 ratio → biomass). The prediction confidence (source_confidence) is exposed in the dashboard.

Health Risk Score

50%

PM2.5 Exposure

pm_score = min(100, (PM2.5 / 12) × 20)

Current pollution level relative to the WHO guideline of 12 µg/m³.

30%

Disease Burden

disease_score = min(100, (total_dalys / 2500) × 100)

The country's air-pollution-attributable disease burden (WHO GHO), normalized against the worst case among our 300-city countries.

20%

Atmospheric Mixing

mixing_penalty = max(0, 1 − boundary_layer_height / 2000)

A low boundary layer traps pollution near the ground instead of letting it disperse.

↓ weighted sum ↓
health_risk = 0.50 × pm_score + 0.30 × disease_score + 0.20 × (mixing_penalty × 100) Range: 0 – 100
↓ amplified by heat & humidity ↓

Heat Amplification

heat_factor = 1 + max(0, (temp − 30°C) / 20) × max(0, (humidity − 60%) / 40) × 0.30

Hot, humid conditions make the same level of pollution more dangerous to breathe.

adjusted_health_risk = health_risk × heat_factor Capped at 100 — this is the score that drives the dashboard

This formula combines exposure (PM2.5), population vulnerability (WHO GHO total DALYs per country), and dispersion conditions (atmospheric boundary layer height). The heat factor amplifies risk during humid heatwaves.

Concrete example: Lagos and Paris may have the same PM2.5, but Lagos scores much higher because Nigeria's total_dalys ≈ 2339/100k versus ~426 for France. The highest value across our 300-city countries (≈2461/100k) belongs to Ukraine, which is used as the normalization ceiling (2500).

LocalStack S3: Distributed Storage

Every JSON file is automatically replicated into an S3 bucket airquality-datalake hosted by LocalStack, a local emulation of AWS. Migrating to real AWS S3 requires changing a single parameter (S3_ENDPOINT). The mechanism is non-blocking: an upload failure never stalls the main pipeline.

Key Insights

Delhi and Lahore regularly show AQI > 300 in winter, correlated with boundary layers below 200m that trap pollution like a lid. Lagos has one of the highest risk scores despite a moderate AQI — the high prevalence of air-pollution-attributable diseases (WHO GHO) makes all the difference. Tokyo shows the opposite effect: favorable atmospheric conditions and a high boundary layer keep risk low despite extreme urban density.

Technical Challenges

The main challenge: missing WAQI data (offline stations, "-" values). Solution: graceful fallback to Open-Meteo with data_source traceability.

The WAQI × Open-Meteo join required a unit conversion: WAQI provides an aggregated AQI (0–500), Open-Meteo provides raw concentrations (µg/m³). The EPA conversion was implemented against official tables with cross-validation.

Ordered Docker startup (a dozen services with complex dependencies) is handled by Compose health checks and wait loops in start.sh.

What's Next