Hypothetical-Set Aggregates — Rank and Percentile for Hypothetical Values
Hypothetical-Set Aggregates — Rank and Percentile for Hypothetical Values
Hypothetical-set aggregates answer a very specific but practically useful question: "If I added this hypothetical value to my dataset, where would it rank?"
Without these functions, answering "what rank would a £50,000 invoice have if it existed?" requires running a RANK() window function over the full dataset and then filtering. The hypothetical-set aggregate computes this directly in one pass.
The four hypothetical-set aggregates
| Function | What it returns |
|---|---|
RANK(hypothetical_value) WITHIN GROUP (ORDER BY column) |
Rank of the hypothetical value (with gaps for ties) |
DENSE_RANK(hypothetical_value) WITHIN GROUP (ORDER BY column) |
Rank without gaps |
PERCENT_RANK(hypothetical_value) WITHIN GROUP (ORDER BY column) |
Fraction of rows the value exceeds (0.0 to 1.0) |
CUME_DIST(hypothetical_value) WITHIN GROUP (ORDER BY column) |
Fraction of rows ≤ the hypothetical value |
These are called "hypothetical" because the value being ranked doesn't actually have to exist in the data.
Example 1: Where would a new invoice rank?
A client is about to pay a £45,000 invoice. Where does this rank among all invoices in the same industry?
SELECT
c.industry,
COUNT(*) AS existing_invoices,
-- Where would a £45,000 invoice rank?
RANK(45000) WITHIN GROUP (
ORDER BY (SELECT SUM(ii.quantity * ii.unit_price) FROM invoice_items ii WHERE ii.invoice_id = i.invoice_id)
) AS rank_of_45k,
ROUND((PERCENT_RANK(45000) WITHIN GROUP (
ORDER BY (SELECT SUM(ii.quantity * ii.unit_price) FROM invoice_items ii WHERE ii.invoice_id = i.invoice_id)
) * 100)::NUMERIC, 1) AS percentile_of_45k,
ROUND((MIN((SELECT SUM(ii.quantity * ii.unit_price) FROM invoice_items ii WHERE ii.invoice_id = i.invoice_id)))::NUMERIC, 0) AS min_invoice,
ROUND((MAX((SELECT SUM(ii.quantity * ii.unit_price) FROM invoice_items ii WHERE ii.invoice_id = i.invoice_id)))::NUMERIC, 0) AS max_invoice,
ROUND((PERCENTILE_CONT(0.5) WITHIN GROUP (
ORDER BY (SELECT SUM(ii.quantity * ii.unit_price) FROM invoice_items ii WHERE ii.invoice_id = i.invoice_id)
))::NUMERIC, 0) AS median_invoice
FROM clients c
JOIN invoices i ON i.client_id = c.client_id
GROUP BY c.industry
ORDER BY c.industry;
What This Returns
| industry | existing_invoices | rank_of_45k | percentile_of_45k | median_invoice |
|---|---|---|---|---|
| Energy | 45 | 8 | 82.6% | 18,200 |
| Software | 52 | 12 | 77.1% | 14,800 |
A £45,000 invoice ranks in the 82nd percentile for Energy — meaning it's larger than 82.6% of existing Energy invoices. This is immediately actionable: it's a large-ish invoice that deserves closer collection attention.
Example 2: Benchmarking a company against its peer group
This pattern is used in competitive analysis: given a company's revenue, where do they stand relative to their sector peers?
WITH company_totals AS (
SELECT
c.company_id,
c.ticker,
c.sector,
c.market_cap_b,
-- calculate average daily return as a proxy for performance
ROUND(AVG(100.0 * (sp.close_price - sp.open_price) / NULLIF(sp.open_price, 0))::NUMERIC, 3)
AS avg_daily_return_pct
FROM companies c
JOIN stock_prices sp ON sp.company_id = c.company_id
GROUP BY c.company_id, c.ticker, c.sector, c.market_cap_b
)
SELECT
ct.ticker,
ct.sector,
ct.avg_daily_return_pct AS this_company_return,
-- Rank relative to sector peers
RANK(ct.avg_daily_return_pct) WITHIN GROUP (
ORDER BY peer.avg_daily_return_pct
) AS sector_rank,
COUNT(*) AS sector_peers,
ROUND((PERCENT_RANK(ct.avg_daily_return_pct) WITHIN GROUP (
ORDER BY peer.avg_daily_return_pct
) * 100)::NUMERIC, 1) AS sector_percentile
FROM company_totals ct
JOIN company_totals peer ON peer.sector = ct.sector
GROUP BY ct.ticker, ct.sector, ct.avg_daily_return_pct
ORDER BY ct.sector, sector_rank;
Example 3: Detecting if a new order would be an outlier
Before inserting a new order, check whether its value would be an outlier (top 5%) relative to historical orders for that acquisition channel:
-- What's the threshold that puts an order in the top 5% per channel?
WITH channel_distributions AS (
SELECT
c.acquisition_channel,
COUNT(*) AS order_count,
-- The 95th percentile threshold
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY o.total_amount) AS p95_threshold,
-- Where would a £500 order rank (as a hypothetical check)?
CUME_DIST(500) WITHIN GROUP (ORDER BY o.total_amount) AS cume_dist_of_500
FROM ec_orders o
JOIN ec_customers c ON c.customer_id = o.customer_id
WHERE o.status = 'completed'
GROUP BY c.acquisition_channel
)
SELECT
acquisition_channel,
order_count,
ROUND(p95_threshold::NUMERIC, 2) AS top_5pct_threshold,
ROUND(cume_dist_of_500::NUMERIC * 100, 1) AS pct_orders_at_or_below_500
FROM channel_distributions
ORDER BY p95_threshold DESC;
CUME_DIST(500) returns the fraction of orders with value ≤ £500. If this is 0.72, then £500 is at the 72nd percentile — a fairly typical order. If it's 0.97, then £500 is unusually large and might warrant additional review.
Key Takeaway
Hypothetical-set aggregates are niche but precise: they answer "where would X rank in this distribution?" without materialising the full ranking. They're most useful for benchmarking, threshold analysis, and data quality checks. The PERCENT_RANK and CUME_DIST variants are particularly useful for expressing rankings as percentages for non-technical stakeholders.