Ordered-Set Aggregates — Percentiles and Mode

Ordered-Set Aggregates — Percentiles and Mode

Ordered-set aggregates are a category of aggregate functions that require the input rows to be sorted before the aggregation is computed. Unlike SUM or COUNT (which don't care about order), functions like percentile and mode are inherently positional — the result depends on where values fall in the sorted distribution.

PostgreSQL provides three ordered-set aggregates:

  • PERCENTILE_CONT(fraction) — continuous interpolated percentile
  • PERCENTILE_DISC(fraction) — discrete (actual existing value) percentile
  • MODE() — most frequently occurring value

These are written with the WITHIN GROUP (ORDER BY ...) syntax, which looks different from regular aggregates but follows a simple pattern.

Continuous vs Discrete Percentile

The distinction between PERCENTILE_CONT and PERCENTILE_DISC matters when the percentile falls between two values:

  • PERCENTILE_DISC(0.5) returns the value that is at or above 50% of sorted rows — always a value that actually exists in the data
  • PERCENTILE_CONT(0.5) interpolates between the two middle values if the dataset has an even number of rows — can return a value not present in the data (e.g. 47.5 when actual values are 45 and 50)

For payment amounts, PERCENTILE_DISC is usually more meaningful (you want an actual payment amount, not an interpolated one). For continuous measurements like temperature or duration, PERCENTILE_CONT is more appropriate.


Example 1: Invoice amount percentiles by industry

This query computes the full percentile distribution of invoice amounts per industry — the 25th percentile (lower quartile), median, 75th percentile (upper quartile), and 90th percentile. These four numbers fully characterise the distribution without needing to display every raw value.

SELECT
    c.industry,
    COUNT(DISTINCT i.invoice_id) AS invoices,
    ROUND((MIN(inv_total.total))::NUMERIC, 2)  AS min_invoice,
    ROUND((PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY inv_total.total))::NUMERIC, 2) AS p25,
    ROUND((PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY inv_total.total))::NUMERIC, 2) AS median,
    ROUND((PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY inv_total.total))::NUMERIC, 2) AS p75,
    ROUND((PERCENTILE_CONT(0.90) WITHIN GROUP (ORDER BY inv_total.total))::NUMERIC, 2) AS p90,
    ROUND((MAX(inv_total.total))::NUMERIC, 2)  AS max_invoice,
    ROUND((AVG(inv_total.total))::NUMERIC, 2)  AS mean_invoice
FROM clients c
JOIN invoices i ON i.client_id = c.client_id
JOIN (
    SELECT invoice_id, SUM(quantity * unit_price) AS total
    FROM invoice_items GROUP BY invoice_id
) inv_total ON inv_total.invoice_id = i.invoice_id
GROUP BY c.industry
ORDER BY median DESC;

What This Returns

industry invoices min p25 median p75 p90 max mean
Energy 45 2,100 8,400 18,200 42,000 78,000 145,000 24,100

When mean > median, the distribution is right-skewed (a few very large invoices pull the average up). The median is the more honest "typical invoice" for skewed data. The p90 tells you the threshold above which the top 10% of invoices fall — useful for identifying strategic accounts.


Example 2: PERCENTILE_DISC for SLA analysis

When analysing days-to-payment (DSO), you want discrete values — actual payment lag times, not interpolated ones. PERCENTILE_DISC gives you the actual number of days that 50% (or 90%) of invoices were paid within.

SELECT
    c.account_manager,
    COUNT(*) AS paid_invoices,
    ROUND((AVG(p.payment_date - i.issue_date))::NUMERIC, 1) AS avg_days_to_pay,
    PERCENTILE_DISC(0.50) WITHIN GROUP (ORDER BY p.payment_date - i.issue_date)
        AS median_days_to_pay,
    PERCENTILE_DISC(0.75) WITHIN GROUP (ORDER BY p.payment_date - i.issue_date)
        AS p75_days_to_pay,
    PERCENTILE_DISC(0.90) WITHIN GROUP (ORDER BY p.payment_date - i.issue_date)
        AS p90_days_to_pay,
    -- SLA: what % paid within 30 days
    ROUND((100.0 * COUNT(*) FILTER (WHERE p.payment_date - i.issue_date <= 30)
        / COUNT(*))::NUMERIC, 1) AS pct_paid_within_30d
FROM clients c
JOIN invoices i ON i.client_id = c.client_id
JOIN payments p ON p.invoice_id = i.invoice_id
GROUP BY c.account_manager
ORDER BY median_days_to_pay;

What This Returns

The p90_days_to_pay column answers "what payment lag should we expect in the worst case?" — 90% of invoices are paid within this many days. When p90 significantly exceeds the invoice payment terms (typically 30 days), it signals a collection problem that average DSO alone would under-report.


Example 3: MODE() — most frequent value

MODE() returns the most commonly occurring value in the group. It's the aggregate equivalent of "what value appears most often?" — useful for finding the typical payment method, the most common purchase category, the most frequent order size, etc.

SELECT
    c.acquisition_channel,
    COUNT(DISTINCT c.customer_id) AS customers,
    MODE() WITHIN GROUP (ORDER BY p.category) AS top_product_category,
    MODE() WITHIN GROUP (ORDER BY o.promo_code) AS most_used_promo_code,
    MODE() WITHIN GROUP (ORDER BY EXTRACT(DOW FROM DATE_TRUNC('day', o.created_at)::DATE)::INT)
        AS most_common_order_day,  -- 0=Sunday, 6=Saturday
    ROUND((AVG(o.total_amount))::NUMERIC, 2) AS avg_order_value,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY o.total_amount)
        AS median_order_value
FROM ec_customers c
JOIN ec_orders o ON o.customer_id = c.customer_id AND o.status = 'completed'
JOIN ec_order_items oi ON oi.order_id = o.order_id
JOIN ec_products p ON p.product_id = oi.product_id
GROUP BY c.acquisition_channel
ORDER BY customers DESC;

MODE() always returns one value even if there are ties. When multiple values are equally frequent, it returns the last one in sort order. If this matters, use PERCENTILE_DISC(0.5) with ORDER BY value, COUNT(value) DESC in a subquery to control tie-breaking.


Example 4: Multiple percentiles in one pass with ARRAY

Computing many percentiles without repeating the WITHIN GROUP clause:

SELECT
    p.category,
    PERCENTILE_CONT(ARRAY[0.1, 0.25, 0.5, 0.75, 0.9, 0.99])
        WITHIN GROUP (ORDER BY oi.quantity * oi.unit_price) AS revenue_percentiles,
    -- Access individual elements:
    (PERCENTILE_CONT(ARRAY[0.1, 0.25, 0.5, 0.75, 0.9, 0.99])
        WITHIN GROUP (ORDER BY oi.quantity * oi.unit_price))[3] AS median_revenue,
    (PERCENTILE_CONT(ARRAY[0.1, 0.25, 0.5, 0.75, 0.9, 0.99])
        WITHIN GROUP (ORDER BY oi.quantity * oi.unit_price))[5] AS p90_revenue
FROM ec_order_items oi
JOIN ec_products p ON p.product_id = oi.product_id
JOIN ec_orders o ON o.order_id = oi.order_id
WHERE o.status = 'completed'
GROUP BY p.category;

Passing an ARRAY of fractions to PERCENTILE_CONT returns an array of results — all percentiles computed in a single pass. Individual elements are accessed with [n] (1-based in PostgreSQL). This is significantly more efficient than writing six separate PERCENTILE_CONT calls.

Key Takeaway

Ordered-set aggregates are the SQL equivalent of the numpy.percentile or pandas.quantile function — they characterise distributions rather than just computing means. PERCENTILE_CONT for smooth interpolation, PERCENTILE_DISC for actual existing values, MODE for the most common value. The array input form of PERCENTILE_CONT is the most efficient way to compute a full distribution summary in one pass.