PERCENT_RANK() and CUME_DIST() — Statistical Distribution
PERCENT_RANK() and CUME_DIST() measure where a row stands relative to all other rows in the window — returning a value between 0.0 and 1.0. They're essential for distribution analysis, scoring systems, and identifying statistical outliers.
PERCENT_RANK() — relative rank as a fraction
PERCENT_RANK() returns the percentile rank of each row: what fraction of rows in the window have a lower value.
PERCENT_RANK = (rank - 1) / (total rows - 1)
The first row is always 0.0. The last row is always 1.0.
WITH product_revenue AS (
SELECT
p.product_id,
p.name,
p.category,
COALESCE(SUM(oi.quantity * oi.unit_price), 0) AS revenue
FROM products p
LEFT JOIN order_items oi ON oi.product_id = p.product_id
GROUP BY p.product_id, p.name, p.category
)
SELECT
name,
category,
revenue,
ROUND(PERCENT_RANK() OVER (ORDER BY revenue)::NUMERIC, 4) AS pct_rank_global,
ROUND(PERCENT_RANK() OVER (PARTITION BY category ORDER BY revenue)::NUMERIC, 4) AS pct_rank_in_cat
FROM product_revenue
ORDER BY revenue DESC;
A pct_rank_global of 0.90 means: 90% of all products have lower revenue than this one. The product is in the 90th percentile.
CUME_DIST() — cumulative distribution
CUME_DIST() returns the fraction of rows with a value less than or equal to the current row's value:
CUME_DIST = (number of rows with value <= current row's value) / total rows
The key difference from PERCENT_RANK:
PERCENT_RANKfor the top row is always 1.0CUME_DISTfor the top row is also 1.0PERCENT_RANKfor the bottom row is always 0.0CUME_DISTfor the bottom row is1 / total_rows(never 0)
WITH product_revenue AS (
SELECT
p.name,
COALESCE(SUM(oi.quantity * oi.unit_price), 0) AS revenue
FROM products p
LEFT JOIN order_items oi ON oi.product_id = p.product_id
GROUP BY p.product_id, p.name
)
SELECT
name,
revenue,
ROUND(PERCENT_RANK() OVER (ORDER BY revenue)::NUMERIC, 4) AS pct_rank,
ROUND(CUME_DIST() OVER (ORDER BY revenue)::NUMERIC, 4) AS cume_dist
FROM product_revenue
ORDER BY revenue DESC;
CUME_DIST is often more intuitive for answering "what fraction of values are at or below this level?" — useful for threshold-based classification.