Statistical Aggregates — Correlation, Regression, and Variance

Statistical Aggregates — Correlation, Regression, and Variance

PostgreSQL includes a full suite of statistical aggregate functions that go far beyond AVG and STDDEV. These functions enable in-database statistical analysis — computing correlation between variables, fitting linear regression lines, and testing for statistical significance — without exporting to Python or R.

The statistical aggregate functions

Function Returns Use case
CORR(y, x) Pearson correlation coefficient (-1 to 1) How strongly do two variables move together?
REGR_SLOPE(y, x) Slope of the least-squares regression line For each unit increase in x, how much does y change?
REGR_INTERCEPT(y, x) Y-intercept of the regression line What is the predicted y when x = 0?
REGR_R2(y, x) R² coefficient of determination How much of y's variance is explained by x?
REGR_COUNT(y, x) Count of non-null (y, x) pairs Sample size for the regression
STDDEV_SAMP(x) Sample standard deviation Variance of x corrected for sample size
STDDEV_POP(x) Population standard deviation Variance when you have the full population
VARIANCE(x) Sample variance Square of sample standard deviation
COVAR_SAMP(y, x) Sample covariance How y and x vary together

The distinction between _SAMP and _POP matters when your data is a sample: STDDEV_SAMP uses n-1 in the denominator (Bessel's correction), while STDDEV_POP uses n. For most analytics work with subsets of data, use _SAMP.


Example 1: Price-to-volume correlation in stock data

Does higher trading volume correlate with larger price moves? This query computes the Pearson correlation between daily volume and the absolute percentage price change across all stocks:

SELECT
    c.ticker,
    c.sector,
    ROUND(CORR(sp.volume,
               ABS(100.0 * (sp.close_price - sp.open_price) / NULLIF(sp.open_price, 0))
          )::NUMERIC, 3) AS volume_volatility_corr,
    ROUND(REGR_SLOPE(
               ABS(100.0 * (sp.close_price - sp.open_price) / NULLIF(sp.open_price, 0)),
               sp.volume
          )::NUMERIC, 8) AS slope,
    ROUND(REGR_R2(
               ABS(100.0 * (sp.close_price - sp.open_price) / NULLIF(sp.open_price, 0)),
               sp.volume
          )::NUMERIC, 3) AS r_squared,
    COUNT(*) AS trading_days
FROM stock_prices sp
JOIN companies c ON c.company_id = sp.company_id
GROUP BY c.ticker, c.sector
ORDER BY volume_volatility_corr DESC;

What This Returns

ticker sector volume_volatility_corr slope r_squared trading_days
TSLA Automotive 0.634 0.0000003 0.402 522
NVDA Technology 0.521 0.0000002 0.272 522

Interpretation:

  • corr = 0.634 means moderate positive correlation — high volume days tend to have larger price swings for TSLA
  • r_squared = 0.402 means volume explains 40% of the variance in daily price movement
  • slope = 0.0000003 means each additional 1,000,000 shares of volume adds ~0.3% to the expected daily price move

Example 2: Revenue regression model — predicting order value from basket size

Does adding more items to an order reliably increase the total amount? This regression tells you the "marginal revenue per additional item":

SELECT
    p.category,
    COUNT(DISTINCT o.order_id) AS orders,
    ROUND(REGR_SLOPE(o.total_amount, item_count)::NUMERIC, 2) AS revenue_per_extra_item,
    ROUND(REGR_INTERCEPT(o.total_amount, item_count)::NUMERIC, 2) AS base_order_value,
    ROUND(REGR_R2(o.total_amount, item_count)::NUMERIC, 3) AS r_squared,
    ROUND(CORR(o.total_amount, item_count)::NUMERIC, 3) AS correlation,
    -- Predict order value for 3-item basket
    ROUND((REGR_SLOPE(o.total_amount, item_count)
           * 3 + REGR_INTERCEPT(o.total_amount, item_count))::NUMERIC, 2) AS predicted_3_items
FROM ec_orders o
JOIN (
    SELECT order_id, COUNT(*) AS item_count
    FROM order_items GROUP BY order_id
) item_counts ON item_counts.order_id = o.order_id
JOIN ec_order_items oi ON oi.order_id = o.order_id
JOIN ec_products p ON p.product_id = oi.product_id
WHERE o.status = 'completed'
GROUP BY p.category
ORDER BY r_squared DESC;

The predicted_3_items column applies the regression formula y = slope * 3 + intercept directly in SQL — no need to export to Python to make predictions from a linear model.


Example 3: Standard deviation for anomaly detection

The Z-score anomaly detection pattern uses standard deviation to find statistical outliers. This query identifies invoices that are more than 2 standard deviations above the mean for their industry:

WITH industry_stats AS (
    SELECT
        c.industry,
        ROUND((AVG(inv_total.total))::NUMERIC, 2)      AS mean_invoice,
        ROUND(STDDEV_SAMP(inv_total.total), 2) AS stddev_invoice,
        COUNT(*) AS invoice_count
    FROM clients c
    JOIN invoices i ON i.client_id = c.client_id
    JOIN (SELECT invoice_id, SUM(quantity * unit_price) AS total
          FROM invoice_items GROUP BY invoice_id) inv_total
        ON inv_total.invoice_id = i.invoice_id
    GROUP BY c.industry
    HAVING COUNT(*) >= 5  -- enough data for reliable stats
)
SELECT
    c.industry,
    c.company_name,
    i.invoice_no,
    ROUND(inv_total.total, 2) AS invoice_total,
    ROUND(s.mean_invoice, 2) AS industry_mean,
    ROUND(s.stddev_invoice, 2) AS industry_stddev,
    ROUND((inv_total.total - s.mean_invoice) / NULLIF(s.stddev_invoice, 0), 2) AS z_score
FROM clients c
JOIN invoices i ON i.client_id = c.client_id
JOIN (SELECT invoice_id, SUM(quantity * unit_price) AS total
      FROM invoice_items GROUP BY invoice_id) inv_total
    ON inv_total.invoice_id = i.invoice_id
JOIN industry_stats s ON s.industry = c.industry
WHERE (inv_total.total - s.mean_invoice) / NULLIF(s.stddev_invoice, 0) > 2.0
ORDER BY z_score DESC;

What This Returns

Any invoice with z_score > 2.0 is a statistical outlier within its industry — worth reviewing as a potential data entry error, an unusual deal, or a new large client relationship.


Example 4: Covariance matrix for portfolio risk analysis

In financial analysis, the covariance of returns between stocks determines portfolio risk. This query computes pairwise daily return correlations for all stock pairs:

WITH daily_returns AS (
    SELECT
        company_id,
        price_date,
        100.0 * (close_price - LAG(close_price) OVER (PARTITION BY company_id ORDER BY price_date))
            / NULLIF(LAG(close_price) OVER (PARTITION BY company_id ORDER BY price_date), 0)
            AS daily_return_pct
    FROM stock_prices
),
aligned AS (
    SELECT a.price_date, a.company_id AS id_a, b.company_id AS id_b,
        a.daily_return_pct AS ret_a, b.daily_return_pct AS ret_b
    FROM daily_returns a
    JOIN daily_returns b ON b.price_date = a.price_date AND b.company_id > a.company_id
    WHERE a.daily_return_pct IS NOT NULL AND b.daily_return_pct IS NOT NULL
)
SELECT
    ca.ticker AS stock_a,
    cb.ticker AS stock_b,
    ca.sector AS sector_a,
    cb.sector AS sector_b,
    ROUND(CORR(ret_a, ret_b)::NUMERIC, 3) AS correlation,
    ROUND(COVAR_SAMP(ret_a, ret_b)::NUMERIC, 4) AS covariance,
    COUNT(*) AS trading_days
FROM aligned
JOIN companies ca ON ca.company_id = aligned.id_a
JOIN companies cb ON cb.company_id = aligned.id_b
GROUP BY ca.ticker, cb.ticker, ca.sector, cb.sector
ORDER BY ABS(CORR(ret_a, ret_b)::NUMERIC) DESC;

Key Takeaway

PostgreSQL's statistical functions bring the most common operations from statistics packages directly into SQL. CORR and REGR_* functions are particularly valuable for analytics: they let you fit linear models and measure variable relationships without leaving the database. The Z-score pattern ((value - mean) / stddev) is the universal anomaly detection formula and works the same across every domain.