Query Rewriting — CTEs, Subqueries, and Lateral Joins Compared

Query Rewriting — CTEs, Subqueries, and Lateral Joins Compared

The same logical query can often be expressed in multiple syntactically different ways: correlated subquery, uncorrelated subquery, CTE, lateral join, or window function. Each has different performance characteristics and different readability trade-offs. Knowing which form the planner handles best — and when to override the planner's choice — is an advanced optimisation skill.


Example 1: Correlated vs uncorrelated subqueries

A correlated subquery re-executes for each outer row. An uncorrelated subquery executes once. The difference in performance can be enormous:

-- CORRELATED: executes once per client row (N executions total)
-- This is O(N×M) — extremely slow for large tables
SELECT c.company_name AS name,
    (SELECT COUNT(*) FROM invoices WHERE client_id = c.client_id) AS invoice_count
FROM clients c;

-- UNCORRELATED via JOIN: executes once (O(N+M))
SELECT c.company_name AS name, COALESCE(cnt.invoice_count, 0) AS invoice_count
FROM clients c
LEFT JOIN (
    SELECT client_id, COUNT(*) AS invoice_count
    FROM invoices GROUP BY client_id
) cnt ON cnt.client_id = c.client_id;

-- WINDOW FUNCTION: most elegant for "aggregate + keep detail"
SELECT DISTINCT c.company_name AS name,
    COUNT(i.invoice_id) OVER (PARTITION BY c.client_id) AS invoice_count
FROM clients c
LEFT JOIN invoices i ON i.client_id = c.client_id;

When correlated subqueries are acceptable:

  • EXISTS/NOT EXISTS checks (often optimised to a semi-join by the planner)
  • LATERAL subqueries that reference outer columns (legitimate correlated pattern)
  • Subqueries in SELECT returning a single scalar for a small outer result set

Example 2: CTE optimisation fence — when CTEs hurt performance

In PostgreSQL 12+, non-recursive CTEs are "inlined" by default — the planner can push predicates through them, making them equivalent to subqueries. In PostgreSQL 11 and earlier, every CTE was an "optimisation fence" that prevented predicate pushdown.

-- In PostgreSQL 12+: these are functionally identical in performance
WITH recent_orders AS (
    SELECT * FROM orders WHERE created_at >= '2025-01-01'
)
SELECT * FROM recent_orders WHERE customer_id = 5;

-- Equivalent plan to:
SELECT * FROM orders WHERE created_at >= '2025-01-01' AND customer_id = 5;

-- But MATERIALIZED forces the old fence behaviour (useful sometimes):
WITH expensive_computation AS MATERIALIZED (
    SELECT customer_id,
           CORR(EXTRACT(EPOCH FROM created_at)::float, customer_id::float) AS corr
    FROM orders
    GROUP BY customer_id
)
SELECT c.name AS customer_name, ec.corr
FROM customers c JOIN expensive_computation ec ON ec.customer_id = c.customer_id
WHERE ec.corr > 0.5;
-- Without MATERIALIZED, PostgreSQL 12+ might re-compute ec for each outer row join

-- When MATERIALIZED is good: when the CTE is referenced multiple times in the query
-- Without it, it might be re-evaluated each time it's referenced

Example 3: LATERAL joins — correlated subqueries done right

LATERAL allows a subquery in the FROM clause to reference columns from earlier in the FROM clause. Unlike a correlated subquery in SELECT, a LATERAL join can return multiple rows per outer row and the planner can optimise it more effectively:

-- Top 3 most recent orders per customer (classic top-N per group problem)

-- Old way: correlated subquery (O(N) subquery executions)
SELECT c.customer_id, c.email,
    (SELECT order_id FROM orders o WHERE o.customer_id = c.customer_id
     ORDER BY created_at DESC LIMIT 1) AS latest_order_id
FROM customers c;

-- LATERAL way: same semantics, one reference per outer row, multi-row capable
SELECT c.customer_id, c.email, recent.*
FROM customers c
CROSS JOIN LATERAL (
    SELECT order_id, created_at, status
    FROM orders o
    WHERE o.customer_id = c.customer_id
    ORDER BY created_at DESC
    LIMIT 3
) recent;
-- Returns up to 3 rows per customer

-- LEFT JOIN LATERAL (include customers with no orders as NULL row)
SELECT c.customer_id, c.email, recent.order_id, recent.status
FROM customers c
LEFT JOIN LATERAL (
    SELECT order_id, created_at, status
    FROM orders o
    WHERE o.customer_id = c.customer_id
    ORDER BY created_at DESC
    LIMIT 1
) recent ON TRUE;

The ON TRUE clause is required for LEFT JOIN LATERAL — it means "always join regardless of condition" (the WHERE inside the LATERAL already handles filtering).


Example 4: EXISTS vs IN vs JOIN — choosing the right set membership test

-- Finding clients that have at least one overdue invoice

-- IN: simple but can be slow if subquery returns many rows (duplicates issue)
SELECT company_name FROM clients
WHERE client_id IN (SELECT client_id FROM invoices WHERE status = 'overdue');

-- EXISTS: terminates as soon as the first match is found — fastest for existence tests
SELECT company_name FROM clients c
WHERE EXISTS (
    SELECT 1 FROM invoices i WHERE i.client_id = c.client_id AND i.status = 'overdue'
);

-- SEMI-JOIN via JOIN + DISTINCT: explicit but verbose
SELECT DISTINCT c.company_name AS name FROM clients c
JOIN invoices i ON i.client_id = c.client_id AND i.status = 'overdue';

-- NOT EXISTS vs LEFT JOIN IS NULL (finding clients with NO overdue invoices)
-- NOT EXISTS: clearest intent
SELECT company_name FROM clients c
WHERE NOT EXISTS (
    SELECT 1 FROM invoices i WHERE i.client_id = c.client_id AND i.status = 'overdue'
);

-- LEFT JOIN IS NULL: equivalent, sometimes chosen by planner anyway
SELECT DISTINCT c.company_name AS name FROM clients c
LEFT JOIN invoices i ON i.client_id = c.client_id AND i.status = 'overdue'
WHERE i.invoice_id IS NULL;

Performance rule of thumb:

  • The PostgreSQL planner converts IN (subquery) and EXISTS to semi-joins internally — they're usually identical in performance
  • NOT IN with NULLs is dangerous: NOT IN (1, 2, NULL) returns no rows because x NOT IN (NULL) is unknown
  • NOT EXISTS is always safe with NULLs and is the preferred form

Key Takeaway

Avoid correlated subqueries in SELECT for large outer sets — use LEFT JOIN aggregation instead. In PostgreSQL 12+, CTEs are inlined by default; use MATERIALIZED when you need caching. LATERAL joins are the correct tool for top-N-per-group and correlated-but-set-returning patterns. EXISTS/NOT EXISTS are the safest set membership tests and typically the fastest due to early termination.