Global Air Quality & Health Risk Dashboard

Introduction

Air pollution kills more than 7 million people every year according to the WHO, more than malaria, tuberculosis and AIDS combined. Yet the data needed to monitor, analyze and act on this problem is scattered across dozens of heterogeneous sources, in incompatible formats, with very different update frequencies.

This Big Data project addresses this challenge by building an end-to-end platform that ingests data from three complementary sources, enriches it with machine learning, and exposes it in an interactive near-real-time dashboard.

Why Combine Three Sources?

WAQI / aqicn.org

Measurements from certified physical monitoring stations - the same source as IQAir and the WHO. US EPA AQI scale (0–500). Uneven coverage, data sometimes missing.

Open-Meteo / CAMS

European atmospheric model (Copernicus). Never any missing data. Also provides weather variables: wind, rain, humidity, boundary layer height.

WHO GHO (AIR_9)

Air-pollution-attributable disease burden by country: COPD, lung cancer, lower respiratory infections, in DALYs/100k inhabitants. Contextualizes the health impact of pollution.

The strategy: WAQI is the primary source (certified data), Open-Meteo is the automatic fallback (universal coverage), and WHO enriches each record with the country's health profile. Every document carries a data_source field tracing which source was used.

Pipeline Architecture

WAQIhourly · real station AQI

Open-Meteo / CAMShourly · concentrations + weather

WHO GHO (AIR_9)annual · disease burden by country

↓

data/raw/…dual-written to s3://airquality-datalake/raw/… via LocalStack

↓

Random Forest trainingcached after the first run - skipped if a model already exists

↓

Spark - JSON → Parquettyped schema enforcement, UTC timestamps

↓

Spark join + 12 enrichmentsUS EPA AQI · ML pollution source · health risk score · heat amplification · anomaly detection

↓

data/usage/air_quality_enriched/partitioned by year / month / day

↓

Elasticsearch → KibanaGlobal Air Quality & Health Risk Dashboard

Real-time pipeline (Kafka)

WAQI API10 cities, polled every 60 seconds

↓

kafka-producerpublishes to topic air_quality_realtime

↓

kafka-consumerindexes into Elasticsearch air_quality_realtime

↓

Kibanaauto-refresh every 30s - end-to-end latency < 90s

Machine Learning: Pollution Source Classification

A Random Forest (scikit-learn, 150 trees) predicts the dominant pollution source for each city at each hour. The five classes: traffic, industry, biomass_burning, dust, mixed.

The model is trained on synthetic data based on known pollution chemistry (high NO₂ → traffic, high SO₂ → industry, high PM2.5/PM10 ratio → biomass). The prediction confidence (source_confidence) is exposed in the dashboard.

Health Risk Score

50%

PM2.5 Exposure

pm_score = min(100, (PM2.5 / 12) × 20)

Current pollution level relative to the WHO guideline of 12 µg/m³.

30%

Disease Burden

disease_score = min(100, (total_dalys / 2500) × 100)

The country's air-pollution-attributable disease burden (WHO GHO), normalized against the worst case among our 300-city countries.

20%

Atmospheric Mixing

mixing_penalty = max(0, 1 − boundary_layer_height / 2000)

A low boundary layer traps pollution near the ground instead of letting it disperse.

↓ weighted sum ↓

health_risk = 0.50 × pm_score + 0.30 × disease_score + 0.20 × (mixing_penalty × 100) Range: 0 – 100

↓ amplified by heat & humidity ↓

Heat Amplification

heat_factor = 1 + max(0, (temp − 30°C) / 20) × max(0, (humidity − 60%) / 40) × 0.30

Hot, humid conditions make the same level of pollution more dangerous to breathe.

↓

adjusted_health_risk = health_risk × heat_factor Capped at 100 - this is the score that drives the dashboard

This formula combines exposure (PM2.5), population vulnerability (WHO GHO total DALYs per country), and dispersion conditions (atmospheric boundary layer height). The heat factor amplifies risk during humid heatwaves.

Concrete example: Lagos and Paris may have the same PM2.5, but Lagos scores much higher because Nigeria's total_dalys ≈ 2339/100k versus ~426 for France. The highest value across our 300-city countries (≈2461/100k) belongs to Ukraine, which is used as the normalization ceiling (2500).

LocalStack S3: Distributed Storage

Every JSON file is automatically replicated into an S3 bucket airquality-datalake hosted by LocalStack, a local emulation of AWS. Migrating to real AWS S3 requires changing a single parameter (S3_ENDPOINT). The mechanism is non-blocking: an upload failure never stalls the main pipeline.

Key Insights

Delhi and Lahore regularly show AQI > 300 in winter, correlated with boundary layers below 200m that trap pollution like a lid. Lagos has one of the highest risk scores despite a moderate AQI - the high prevalence of air-pollution-attributable diseases (WHO GHO) makes all the difference. Tokyo shows the opposite effect: favorable atmospheric conditions and a high boundary layer keep risk low despite extreme urban density.

Technical Challenges

The main challenge: missing WAQI data (offline stations, "-" values). Solution: graceful fallback to Open-Meteo with data_source traceability.

The WAQI × Open-Meteo join required a unit conversion: WAQI provides an aggregated AQI (0–500), Open-Meteo provides raw concentrations (µg/m³). The EPA conversion was implemented against official tables with cross-validation.

Ordered Docker startup (a dozen services with complex dependencies) is handled by Compose health checks and wait loops in start.sh.

What's Next

Spark Structured Streaming - replace the hourly batch with a streaming pipeline consuming from Kafka
Delta Lake / Iceberg - ACID transactions and time travel on the Gold layer
LSTM forecasting - predict air quality 24h ahead on time series
AWS production - real S3, EMR for Spark, MSK for Kafka, OpenSearch
500+ cities - extend geographic coverage further via the WAQI global API