Industry: Public Health / Global Development · Stack: MySQL (cleaning + EDA) ·


💡 Bottom line. Life expectancy rose ~5 years globally (66.75 in 2007 → 71.62 in 2022), yet a 12-year gap between Developed and Developing nations persists — and wealth tracks longevity by ~9 years. The story isn't that everyone improved; it's that the bottom is converging upward while the top sits near a ceiling.

Table of Contents

The problem

A global health organization holds 15 years of country-level health data — life expectancy, mortality, disease rates, GDP, schooling — across 193 countries (2007–2022). But the raw data can't be trusted yet: it has duplicate records, missing development-status labels, and gaps in the headline metric itself. Before anyone can ask "what drives a long life?", the data has to be made analysis-ready. This project does both, in two acts: clean it, then explore it.

Dataset: 2,941 rows · 193 countries · 2007–2022 · 18 columns.

Why SQL, not Python — and how it maps to the Microsoft stack

In a modern data stack, data this size lives in a warehouse (Snowflake, BigQuery, Azure Synapse / Microsoft Fabric) and is cleaned in place with SQL — because the data is too big to move and the cleaning logic must re-run on every refresh. Python/pandas is for files on your laptop; SQL is data at warehouse scale. This project deliberately works the way production cleaning actually works: the transform layer.

This is built in MySQL, but the shape is deliberately production-grade — a staging → clean → serve flow, which is the medallion architecture:

Layer This project Microsoft stack
Bronze (raw / landing) untouched staging copy of the raw import landing zone, ingested + orchestrated by Azure Data Factory
Silver (cleaned) Act 1 cleaning → trusted worldlifeexpectancy same logic as T-SQL in a Fabric Warehouse, run by ADF
Gold (serve) Act 2 EDA queries — the business logic warehouse views feeding the Power BI semantic model

On Azure/Fabric the work splits cleanly: ADF orchestrates and lands raw data, the Act 1 cleaning runs as T-SQL transformations in a Fabric Warehouse, and the Act 2 queries become warehouse views a Power BI model builds on. MySQL → T-SQL is a dialect change, not a re-skill — the logic ports directly, written once and inherited by every downstream report.

The deeper point holds even with AI in the loop: an agent doing end-to-end analysis on warehouse data is still writing and running SQL against it. The durable skill isn't typing the query fast — it's knowing what correct looks like in the data and being able to verify the output. This project is built to demonstrate exactly that judgment.

Note: this is a learning project built to production shape, not a live Azure pipeline — the mapping shows where each step would live in a real Fabric environment. This piece deliberately focuses on the SQL transform layer; the BI / visualization layer is a separate piece of work.


Act 1 — Data cleaning

The raw import was 2,941 rows across 193 countries. Before trusting a single average, the data had to be made reliable — so cleaning ran in a deliberate order: structural issues first (duplicates), then categorical gaps (Status), then the headline metric (Life expectancy). Every fix was previewed as a SELECT before being promoted to an UPDATE/DELETE — the discipline that keeps cleaning reversible.

Step 1 · Remove duplicate Country + Year records