Statistical Aggregates — Correlation, Regression, and Variance
Statistical Aggregates — Correlation, Regression, and Variance
PostgreSQL includes a full suite of statistical aggregate functions that go far beyond AVG and STDDEV. These functions enable in-database statistical analysis — computing correlation between variables, fitting linear regression lines, and testing for statistical significance — without exporting to Python or R.
The statistical aggregate functions
| Function | Returns | Use case |
|---|---|---|
CORR(y, x) |
Pearson correlation coefficient (-1 to 1) | How strongly do two variables move together? |
REGR_SLOPE(y, x) |
Slope of the least-squares regression line | For each unit increase in x, how much does y change? |
REGR_INTERCEPT(y, x) |
Y-intercept of the regression line | What is the predicted y when x = 0? |
REGR_R2(y, x) |
R² coefficient of determination | How much of y's variance is explained by x? |
REGR_COUNT(y, x) |
Count of non-null (y, x) pairs | Sample size for the regression |
STDDEV_SAMP(x) |
Sample standard deviation | Variance of x corrected for sample size |
STDDEV_POP(x) |
Population standard deviation | Variance when you have the full population |
VARIANCE(x) |
Sample variance | Square of sample standard deviation |
COVAR_SAMP(y, x) |
Sample covariance | How y and x vary together |
The distinction between _SAMP and _POP matters when your data is a sample: STDDEV_SAMP uses n-1 in the denominator (Bessel's correction), while STDDEV_POP uses n. For most analytics work with subsets of data, use _SAMP.
Example 1: Price-to-volume correlation in stock data
Does higher trading volume correlate with larger price moves? This query computes the Pearson correlation between daily volume and the absolute percentage price change across all stocks:
SELECT
c.ticker,
c.sector,
ROUND(CORR(sp.volume,
ABS(100.0 * (sp.close_price - sp.open_price) / NULLIF(sp.open_price, 0))
)::NUMERIC, 3) AS volume_volatility_corr,
ROUND(REGR_SLOPE(
ABS(100.0 * (sp.close_price - sp.open_price) / NULLIF(sp.open_price, 0)),
sp.volume
)::NUMERIC, 8) AS slope,
ROUND(REGR_R2(
ABS(100.0 * (sp.close_price - sp.open_price) / NULLIF(sp.open_price, 0)),
sp.volume
)::NUMERIC, 3) AS r_squared,
COUNT(*) AS trading_days
FROM stock_prices sp
JOIN companies c ON c.company_id = sp.company_id
GROUP BY c.ticker, c.sector
ORDER BY volume_volatility_corr DESC;
What This Returns
| ticker | sector | volume_volatility_corr | slope | r_squared | trading_days |
|---|---|---|---|---|---|
| TSLA | Automotive | 0.634 | 0.0000003 | 0.402 | 522 |
| NVDA | Technology | 0.521 | 0.0000002 | 0.272 | 522 |
Interpretation:
corr = 0.634means moderate positive correlation — high volume days tend to have larger price swings for TSLAr_squared = 0.402means volume explains 40% of the variance in daily price movementslope = 0.0000003means each additional 1,000,000 shares of volume adds ~0.3% to the expected daily price move
Example 2: Revenue regression model — predicting order value from basket size
Does adding more items to an order reliably increase the total amount? This regression tells you the "marginal revenue per additional item":
SELECT
p.category,
COUNT(DISTINCT o.order_id) AS orders,
ROUND(REGR_SLOPE(o.total_amount, item_count)::NUMERIC, 2) AS revenue_per_extra_item,
ROUND(REGR_INTERCEPT(o.total_amount, item_count)::NUMERIC, 2) AS base_order_value,
ROUND(REGR_R2(o.total_amount, item_count)::NUMERIC, 3) AS r_squared,
ROUND(CORR(o.total_amount, item_count)::NUMERIC, 3) AS correlation,
-- Predict order value for 3-item basket
ROUND((REGR_SLOPE(o.total_amount, item_count)
* 3 + REGR_INTERCEPT(o.total_amount, item_count))::NUMERIC, 2) AS predicted_3_items
FROM ec_orders o
JOIN (
SELECT order_id, COUNT(*) AS item_count
FROM order_items GROUP BY order_id
) item_counts ON item_counts.order_id = o.order_id
JOIN ec_order_items oi ON oi.order_id = o.order_id
JOIN ec_products p ON p.product_id = oi.product_id
WHERE o.status = 'completed'
GROUP BY p.category
ORDER BY r_squared DESC;
The predicted_3_items column applies the regression formula y = slope * 3 + intercept directly in SQL — no need to export to Python to make predictions from a linear model.
Example 3: Standard deviation for anomaly detection
The Z-score anomaly detection pattern uses standard deviation to find statistical outliers. This query identifies invoices that are more than 2 standard deviations above the mean for their industry:
WITH industry_stats AS (
SELECT
c.industry,
ROUND((AVG(inv_total.total))::NUMERIC, 2) AS mean_invoice,
ROUND(STDDEV_SAMP(inv_total.total), 2) AS stddev_invoice,
COUNT(*) AS invoice_count
FROM clients c
JOIN invoices i ON i.client_id = c.client_id
JOIN (SELECT invoice_id, SUM(quantity * unit_price) AS total
FROM invoice_items GROUP BY invoice_id) inv_total
ON inv_total.invoice_id = i.invoice_id
GROUP BY c.industry
HAVING COUNT(*) >= 5 -- enough data for reliable stats
)
SELECT
c.industry,
c.company_name,
i.invoice_no,
ROUND(inv_total.total, 2) AS invoice_total,
ROUND(s.mean_invoice, 2) AS industry_mean,
ROUND(s.stddev_invoice, 2) AS industry_stddev,
ROUND((inv_total.total - s.mean_invoice) / NULLIF(s.stddev_invoice, 0), 2) AS z_score
FROM clients c
JOIN invoices i ON i.client_id = c.client_id
JOIN (SELECT invoice_id, SUM(quantity * unit_price) AS total
FROM invoice_items GROUP BY invoice_id) inv_total
ON inv_total.invoice_id = i.invoice_id
JOIN industry_stats s ON s.industry = c.industry
WHERE (inv_total.total - s.mean_invoice) / NULLIF(s.stddev_invoice, 0) > 2.0
ORDER BY z_score DESC;
What This Returns
Any invoice with z_score > 2.0 is a statistical outlier within its industry — worth reviewing as a potential data entry error, an unusual deal, or a new large client relationship.
Example 4: Covariance matrix for portfolio risk analysis
In financial analysis, the covariance of returns between stocks determines portfolio risk. This query computes pairwise daily return correlations for all stock pairs:
WITH daily_returns AS (
SELECT
company_id,
price_date,
100.0 * (close_price - LAG(close_price) OVER (PARTITION BY company_id ORDER BY price_date))
/ NULLIF(LAG(close_price) OVER (PARTITION BY company_id ORDER BY price_date), 0)
AS daily_return_pct
FROM stock_prices
),
aligned AS (
SELECT a.price_date, a.company_id AS id_a, b.company_id AS id_b,
a.daily_return_pct AS ret_a, b.daily_return_pct AS ret_b
FROM daily_returns a
JOIN daily_returns b ON b.price_date = a.price_date AND b.company_id > a.company_id
WHERE a.daily_return_pct IS NOT NULL AND b.daily_return_pct IS NOT NULL
)
SELECT
ca.ticker AS stock_a,
cb.ticker AS stock_b,
ca.sector AS sector_a,
cb.sector AS sector_b,
ROUND(CORR(ret_a, ret_b)::NUMERIC, 3) AS correlation,
ROUND(COVAR_SAMP(ret_a, ret_b)::NUMERIC, 4) AS covariance,
COUNT(*) AS trading_days
FROM aligned
JOIN companies ca ON ca.company_id = aligned.id_a
JOIN companies cb ON cb.company_id = aligned.id_b
GROUP BY ca.ticker, cb.ticker, ca.sector, cb.sector
ORDER BY ABS(CORR(ret_a, ret_b)::NUMERIC) DESC;
Key Takeaway
PostgreSQL's statistical functions bring the most common operations from statistics packages directly into SQL. CORR and REGR_* functions are particularly valuable for analytics: they let you fit linear models and measure variable relationships without leaving the database. The Z-score pattern ((value - mean) / stddev) is the universal anomaly detection formula and works the same across every domain.