Hypothetical-Set Aggregates — Rank and Percentile for Hypothetical Values

Hypothetical-Set Aggregates — Rank and Percentile for Hypothetical Values

Hypothetical-set aggregates answer a very specific but practically useful question: "If I added this hypothetical value to my dataset, where would it rank?"

Without these functions, answering "what rank would a £50,000 invoice have if it existed?" requires running a RANK() window function over the full dataset and then filtering. The hypothetical-set aggregate computes this directly in one pass.

The four hypothetical-set aggregates

Function What it returns
RANK(hypothetical_value) WITHIN GROUP (ORDER BY column) Rank of the hypothetical value (with gaps for ties)
DENSE_RANK(hypothetical_value) WITHIN GROUP (ORDER BY column) Rank without gaps
PERCENT_RANK(hypothetical_value) WITHIN GROUP (ORDER BY column) Fraction of rows the value exceeds (0.0 to 1.0)
CUME_DIST(hypothetical_value) WITHIN GROUP (ORDER BY column) Fraction of rows ≤ the hypothetical value

These are called "hypothetical" because the value being ranked doesn't actually have to exist in the data.


Example 1: Where would a new invoice rank?

A client is about to pay a £45,000 invoice. Where does this rank among all invoices in the same industry?

SELECT
    c.industry,
    COUNT(*) AS existing_invoices,
    -- Where would a £45,000 invoice rank?
    RANK(45000) WITHIN GROUP (
        ORDER BY (SELECT SUM(ii.quantity * ii.unit_price) FROM invoice_items ii WHERE ii.invoice_id = i.invoice_id)
    ) AS rank_of_45k,
    ROUND((PERCENT_RANK(45000) WITHIN GROUP (
            ORDER BY (SELECT SUM(ii.quantity * ii.unit_price) FROM invoice_items ii WHERE ii.invoice_id = i.invoice_id)
        ) * 100)::NUMERIC, 1) AS percentile_of_45k,
    ROUND((MIN((SELECT SUM(ii.quantity * ii.unit_price) FROM invoice_items ii WHERE ii.invoice_id = i.invoice_id)))::NUMERIC, 0) AS min_invoice,
    ROUND((MAX((SELECT SUM(ii.quantity * ii.unit_price) FROM invoice_items ii WHERE ii.invoice_id = i.invoice_id)))::NUMERIC, 0) AS max_invoice,
    ROUND((PERCENTILE_CONT(0.5) WITHIN GROUP (
        ORDER BY (SELECT SUM(ii.quantity * ii.unit_price) FROM invoice_items ii WHERE ii.invoice_id = i.invoice_id)
    ))::NUMERIC, 0) AS median_invoice
FROM clients c
JOIN invoices i ON i.client_id = c.client_id
GROUP BY c.industry
ORDER BY c.industry;

What This Returns

industry existing_invoices rank_of_45k percentile_of_45k median_invoice
Energy 45 8 82.6% 18,200
Software 52 12 77.1% 14,800

A £45,000 invoice ranks in the 82nd percentile for Energy — meaning it's larger than 82.6% of existing Energy invoices. This is immediately actionable: it's a large-ish invoice that deserves closer collection attention.


Example 2: Benchmarking a company against its peer group

This pattern is used in competitive analysis: given a company's revenue, where do they stand relative to their sector peers?

WITH company_totals AS (
    SELECT
        c.company_id,
        c.ticker,
        c.sector,
        c.market_cap_b,
        -- calculate average daily return as a proxy for performance
        ROUND(AVG(100.0 * (sp.close_price - sp.open_price) / NULLIF(sp.open_price, 0))::NUMERIC, 3)
            AS avg_daily_return_pct
    FROM companies c
    JOIN stock_prices sp ON sp.company_id = c.company_id
    GROUP BY c.company_id, c.ticker, c.sector, c.market_cap_b
)
SELECT
    ct.ticker,
    ct.sector,
    ct.avg_daily_return_pct AS this_company_return,
    -- Rank relative to sector peers
    RANK(ct.avg_daily_return_pct) WITHIN GROUP (
        ORDER BY peer.avg_daily_return_pct
    ) AS sector_rank,
    COUNT(*) AS sector_peers,
    ROUND((PERCENT_RANK(ct.avg_daily_return_pct) WITHIN GROUP (
            ORDER BY peer.avg_daily_return_pct
        ) * 100)::NUMERIC, 1) AS sector_percentile
FROM company_totals ct
JOIN company_totals peer ON peer.sector = ct.sector
GROUP BY ct.ticker, ct.sector, ct.avg_daily_return_pct
ORDER BY ct.sector, sector_rank;

Example 3: Detecting if a new order would be an outlier

Before inserting a new order, check whether its value would be an outlier (top 5%) relative to historical orders for that acquisition channel:

-- What's the threshold that puts an order in the top 5% per channel?
WITH channel_distributions AS (
    SELECT
        c.acquisition_channel,
        COUNT(*) AS order_count,
        -- The 95th percentile threshold
        PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY o.total_amount) AS p95_threshold,
        -- Where would a £500 order rank (as a hypothetical check)?
        CUME_DIST(500) WITHIN GROUP (ORDER BY o.total_amount) AS cume_dist_of_500
    FROM ec_orders o
    JOIN ec_customers c ON c.customer_id = o.customer_id
    WHERE o.status = 'completed'
    GROUP BY c.acquisition_channel
)
SELECT
    acquisition_channel,
    order_count,
    ROUND(p95_threshold::NUMERIC, 2) AS top_5pct_threshold,
    ROUND(cume_dist_of_500::NUMERIC * 100, 1) AS pct_orders_at_or_below_500
FROM channel_distributions
ORDER BY p95_threshold DESC;

CUME_DIST(500) returns the fraction of orders with value ≤ £500. If this is 0.72, then £500 is at the 72nd percentile — a fairly typical order. If it's 0.97, then £500 is unusually large and might warrant additional review.

Key Takeaway

Hypothetical-set aggregates are niche but precise: they answer "where would X rank in this distribution?" without materialising the full ranking. They're most useful for benchmarking, threshold analysis, and data quality checks. The PERCENT_RANK and CUME_DIST variants are particularly useful for expressing rankings as percentages for non-technical stakeholders.