CTEs for Data Cleaning and Transformation
Real datasets are messy. CTEs are the natural tool for multi-step data cleaning: each CTE applies one transformation, and the final result is clean data ready for analysis. This is far more maintainable than a single tangled query.
The data cleaning pipeline pattern
Raw data
→ CTE 1: fix data types, handle NULLs
→ CTE 2: normalise values, trim whitespace
→ CTE 3: apply business rules
→ CTE 4: remove duplicates
→ Final: analysis-ready result
Example 1 — cleaning and validating order data
WITH
-- Step 1: identify and flag data quality issues
raw_with_flags AS (
SELECT
o.order_id,
o.customer_id,
o.created_at,
o.status,
CASE WHEN o.customer_id IS NULL THEN 'missing_customer'
WHEN o.status NOT IN ('completed','refunded','pending','processing')
THEN 'invalid_status'
WHEN o.created_at > NOW() THEN 'future_date'
ELSE NULL
END AS data_issue
FROM ec_orders o
),
-- Step 2: separate clean from dirty records
clean_orders AS (
SELECT order_id, customer_id, created_at, status
FROM raw_with_flags
WHERE data_issue IS NULL
),
dirty_orders AS (
SELECT order_id, customer_id, created_at, status, data_issue
FROM raw_with_flags
WHERE data_issue IS NOT NULL
)
-- Summary of data quality
SELECT
(SELECT COUNT(*) FROM clean_orders) AS clean_records,
(SELECT COUNT(*) FROM dirty_orders) AS dirty_records,
(SELECT COUNT(*) FROM dirty_orders WHERE data_issue = 'missing_customer') AS missing_customer,
(SELECT COUNT(*) FROM dirty_orders WHERE data_issue = 'invalid_status') AS invalid_status,
(SELECT COUNT(*) FROM dirty_orders WHERE data_issue = 'future_date') AS future_dates;
Example 2 — normalising and standardising
Clean the stock price data to remove any prices where high < low (data entry errors):
WITH
raw_prices AS (
SELECT
sp.*,
c.ticker,
CASE
WHEN sp.high_price < sp.low_price THEN 'high_below_low'
WHEN sp.close_price < 0 THEN 'negative_price'
WHEN sp.open_price < 0 THEN 'negative_open'
WHEN sp.volume = 0 THEN 'zero_volume'
ELSE NULL
END AS quality_issue
FROM stock_prices sp
JOIN companies c ON c.company_id = sp.company_id
),
cleaned AS (
SELECT
ticker,
price_date,
-- Clamp OHLC to logical bounds
LEAST(open_price, high_price) AS open_price,
high_price,
low_price,
close_price,
volume
FROM raw_prices
WHERE quality_issue IS NULL
),
with_daily_return AS (
SELECT
*,
ROUND(
(close_price - LAG(close_price) OVER (PARTITION BY ticker ORDER BY price_date))
* 100.0
/ NULLIF(LAG(close_price) OVER (PARTITION BY ticker ORDER BY price_date), 0),
4
) AS daily_return_pct
FROM cleaned
)
SELECT *
FROM with_daily_return
WHERE price_date >= '2025-01-01'
ORDER BY ticker, price_date;