CTEs for Data Cleaning and Transformation

Real datasets are messy. CTEs are the natural tool for multi-step data cleaning: each CTE applies one transformation, and the final result is clean data ready for analysis. This is far more maintainable than a single tangled query.


The data cleaning pipeline pattern

Raw data
  → CTE 1: fix data types, handle NULLs
  → CTE 2: normalise values, trim whitespace
  → CTE 3: apply business rules
  → CTE 4: remove duplicates
  → Final: analysis-ready result

Example 1 — cleaning and validating order data

WITH
-- Step 1: identify and flag data quality issues
raw_with_flags AS (
    SELECT
        o.order_id,
        o.customer_id,
        o.created_at,
        o.status,
        CASE WHEN o.customer_id IS NULL              THEN 'missing_customer'
             WHEN o.status NOT IN ('completed','refunded','pending','processing')
                                                     THEN 'invalid_status'
             WHEN o.created_at > NOW()               THEN 'future_date'
             ELSE NULL
        END AS data_issue
    FROM ec_orders o
),
-- Step 2: separate clean from dirty records
clean_orders AS (
    SELECT order_id, customer_id, created_at, status
    FROM raw_with_flags
    WHERE data_issue IS NULL
),
dirty_orders AS (
    SELECT order_id, customer_id, created_at, status, data_issue
    FROM raw_with_flags
    WHERE data_issue IS NOT NULL
)
-- Summary of data quality
SELECT
    (SELECT COUNT(*) FROM clean_orders)  AS clean_records,
    (SELECT COUNT(*) FROM dirty_orders)  AS dirty_records,
    (SELECT COUNT(*) FROM dirty_orders WHERE data_issue = 'missing_customer') AS missing_customer,
    (SELECT COUNT(*) FROM dirty_orders WHERE data_issue = 'invalid_status')   AS invalid_status,
    (SELECT COUNT(*) FROM dirty_orders WHERE data_issue = 'future_date')      AS future_dates;

Example 2 — normalising and standardising

Clean the stock price data to remove any prices where high < low (data entry errors):

WITH
raw_prices AS (
    SELECT
        sp.*,
        c.ticker,
        CASE
            WHEN sp.high_price < sp.low_price    THEN 'high_below_low'
            WHEN sp.close_price < 0              THEN 'negative_price'
            WHEN sp.open_price < 0               THEN 'negative_open'
            WHEN sp.volume = 0                   THEN 'zero_volume'
            ELSE NULL
        END AS quality_issue
    FROM stock_prices sp
    JOIN companies c ON c.company_id = sp.company_id
),
cleaned AS (
    SELECT
        ticker,
        price_date,
        -- Clamp OHLC to logical bounds
        LEAST(open_price, high_price)  AS open_price,
        high_price,
        low_price,
        close_price,
        volume
    FROM raw_prices
    WHERE quality_issue IS NULL
),
with_daily_return AS (
    SELECT
        *,
        ROUND(
            (close_price - LAG(close_price) OVER (PARTITION BY ticker ORDER BY price_date))
            * 100.0
            / NULLIF(LAG(close_price) OVER (PARTITION BY ticker ORDER BY price_date), 0),
            4
        ) AS daily_return_pct
    FROM cleaned
)
SELECT *
FROM with_daily_return
WHERE price_date >= '2025-01-01'
ORDER BY ticker, price_date;

Purchase this course to unlock the full lesson.

Sign up