JSONB Performance and Production Patterns

JSONB Performance and Production Patterns

JSONB is powerful, but it has performance characteristics that differ significantly from regular columns. Understanding these characteristics — and the indexing strategies to address them — is the difference between JSONB queries that take milliseconds and those that take minutes on large tables.

The fundamental performance trade-off

JSONB is flexible at the cost of type safety and query optimisation. The PostgreSQL query planner cannot use statistics about JSONB values the same way it can for typed columns — it cannot know that attributes ->> 'brand' = 'Apple' matches 20% of rows unless you provide a functional index. This means:

  1. Always index JSONB columns that appear in WHERE clauses
  2. Choose the right index type (GIN for containment/existence, BTREE for equality/range on specific keys)
  3. Use generated columns for frequently extracted JSONB fields

Example 1: Index strategy decision tree

-- SCENARIO 1: Checking if a key exists or if a value is contained
-- → Use GIN index on the whole column
CREATE INDEX idx_gin_attributes ON product_catalog USING GIN (attributes);
-- Queries that use this index:
-- WHERE attributes ? 'discount_pct'
-- WHERE attributes @> '{"brand":"Apple"}'
-- WHERE attributes @> '{"sizes":["L"]}'

-- SCENARIO 2: Equality or range on a specific extracted key (frequently queried)
-- → Use BTREE functional index on the extracted expression
CREATE INDEX idx_brand ON product_catalog USING BTREE ((attributes ->> 'brand'));
CREATE INDEX idx_ram   ON product_catalog USING BTREE (((attributes ->> 'ram_gb')::INT));
-- Queries that use these indexes:
-- WHERE attributes ->> 'brand' = 'Apple'
-- WHERE (attributes ->> 'ram_gb')::INT > 16

-- SCENARIO 3: Frequently extracted field used in many queries
-- → Use a GENERATED ALWAYS AS STORED column + BTREE index
ALTER TABLE product_catalog
    ADD COLUMN brand_extracted TEXT
    GENERATED ALWAYS AS (attributes ->> 'brand') STORED;
CREATE INDEX idx_brand_gen ON product_catalog (brand_extracted);
-- Now use brand_extracted directly — no extraction overhead:
-- WHERE brand_extracted = 'Apple'

-- SCENARIO 4: Partial index — only for a subset of rows
CREATE INDEX idx_electronics_ram ON product_catalog
    USING BTREE (((attributes ->> 'ram_gb')::INT))
    WHERE category = 'Electronics';
-- Smaller index, faster for category-constrained queries

Example 2: Benchmarking JSONB query patterns

-- Set up a large test table to see real performance differences
CREATE TABLE perf_test AS
SELECT
    n AS id,
    jsonb_build_object(
        'user_id', (random() * 10000)::INT,
        'event',   (ARRAY['click','view','purchase','signup'])[1+(random()*3)::INT],
        'value',   ROUND((random() * 1000)::NUMERIC, 2),
        'channel', (ARRAY['email','organic','paid','social'])[1+(random()*3)::INT]
    ) AS data
FROM generate_series(1, 1000000) n;

-- Without any index: sequential scan (slow)
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM perf_test WHERE data @> '{"event":"purchase"}';
-- Expected: Seq Scan, ~500ms on 1M rows

-- Add GIN index
CREATE INDEX idx_perf_gin ON perf_test USING GIN (data);

-- With GIN index: bitmap index scan (fast)
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM perf_test WHERE data @> '{"event":"purchase"}';
-- Expected: Bitmap Index Scan, ~5ms

-- Add functional BTREE for equality on extracted key
CREATE INDEX idx_perf_event ON perf_test USING BTREE ((data ->> 'event'));

EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM perf_test WHERE data ->> 'event' = 'purchase';
-- Expected: Index Scan, ~3ms (faster than GIN for high-selectivity equality)

Example 3: JSONB update performance patterns

UPDATE on JSONB columns rewrites the entire document — even when changing a single field. For large JSONB documents, this can be expensive:

-- Anti-pattern: updating a large JSONB column when only one field changes
-- This rewrites the entire document for each row:
UPDATE events SET payload = jsonb_set(payload, '{processed}', 'true')
WHERE event_type = 'purchase'  -- ❌ illustrative; assumes a `payload` JSONB column;  -- could be millions of rows

-- Better pattern 1: Use a separate column for frequently-updated fields
ALTER TABLE events ADD COLUMN processed BOOLEAN DEFAULT FALSE;
UPDATE events SET processed = TRUE WHERE event_type = 'purchase'  -- ❌ illustrative; assumes a `payload` JSONB column;
-- Now updates don't touch the JSONB at all

-- Better pattern 2: Batch JSONB updates to avoid HOT update penalty
-- Process in chunks to limit write amplification
DO $$
DECLARE batch_size INT := 10000;
DECLARE updated_count INT;
BEGIN
    LOOP
        WITH batch AS (
            SELECT event_id FROM events
            WHERE event_type = 'purchase'  -- ❌ illustrative; assumes a `payload` JSONB column
              AND NOT (payload ? 'processed')
            LIMIT batch_size
            FOR UPDATE SKIP LOCKED
        )
        UPDATE events SET payload = payload || '{"processed":true}'
        FROM batch WHERE events.event_id = batch.event_id;

        GET DIAGNOSTICS updated_count = ROW_COUNT;
        EXIT WHEN updated_count = 0;
        PERFORM pg_sleep(0.01);  -- brief pause to avoid lock contention
    END LOOP;
END;
$$;

Example 4: JSONB in partitioned tables

For very large JSONB event tables, partitioning by time combined with JSONB indexing gives the best of both worlds — partition pruning for time-range queries and GIN indexes per partition for content searches:

-- Partitioned events table
CREATE TABLE events_partitioned (
    event_id    BIGSERIAL,
    customer_id INT,
    event_type  TEXT NOT NULL,
    occurred_at TIMESTAMPTZ NOT NULL,
    payload     JSONB
) PARTITION BY RANGE (occurred_at);

-- Create monthly partitions
CREATE TABLE events_2025_01 PARTITION OF events_partitioned
    FOR VALUES FROM ('2025-01-01') TO ('2025-02-01');
CREATE TABLE events_2025_02 PARTITION OF events_partitioned
    FOR VALUES FROM ('2025-02-01') TO ('2025-03-01');
CREATE TABLE events_2025_03 PARTITION OF events_partitioned
    FOR VALUES FROM ('2025-03-01') TO ('2025-04-01');

-- Index each partition individually
CREATE INDEX idx_events_2025_01_gin ON events_2025_01 USING GIN (payload);
CREATE INDEX idx_events_2025_02_gin ON events_2025_02 USING GIN (payload);
CREATE INDEX idx_events_2025_03_gin ON events_2025_03 USING GIN (payload);

-- Query benefits from both partition pruning AND GIN index
EXPLAIN
SELECT * FROM events_partitioned
WHERE occurred_at >= '2025-03-01' AND occurred_at < '2025-04-01'
  AND payload @> '{"event":"purchase"}';
-- Planner will access only the 2025_03 partition and use its GIN index

Production checklist for JSONB columns

-- 1. Audit your JSONB key distribution regularly
SELECT
    key,
    COUNT(*) AS occurrences,
    ROUND(100.0 * COUNT(*) / (SELECT COUNT(*) FROM product_catalog), 1) AS coverage_pct,
    (SELECT jsonb_typeof(attributes -> key) FROM product_catalog WHERE attributes ? key LIMIT 1) AS sample_type
FROM product_catalog, jsonb_object_keys(attributes) AS key
GROUP BY key ORDER BY occurrences DESC;

-- 2. Check JSONB column sizes (bloated documents slow down everything)
SELECT
    AVG(pg_column_size(attributes)) AS avg_bytes,
    MAX(pg_column_size(attributes)) AS max_bytes,
    PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY pg_column_size(attributes)) AS p99_bytes
FROM product_catalog;

-- 3. Identify unused keys (candidates for removal)
SELECT key, COUNT(*) AS rows_with_key
FROM product_catalog, jsonb_object_keys(attributes) AS key
GROUP BY key
HAVING COUNT(*) < 10  -- appears in fewer than 10 rows — may be cruft
ORDER BY rows_with_key;

Key Takeaway

JSONB performance is determined almost entirely by index strategy: GIN for containment/existence, BTREE functional indexes for specific key equality/range, generated columns for hot extraction paths. Always measure with EXPLAIN (ANALYZE, BUFFERS) before and after adding indexes. For high-volume JSONB update workloads, prefer adding a separate typed column over updating JSONB in place. Partitioning by time + per-partition GIN indexes is the production pattern for billion-row event tables.