Fuzzy Matching — pg_trgm, levenshtein & Similarity Search

Fuzzy Matching — pg_trgm, levenshtein & Similarity Search

Exact string matching fails on typos, abbreviations, and inconsistent data entry. Fuzzy matching finds strings that are close enough — essential for search, deduplication, and record linkage.

-- Enable extensions
CREATE EXTENSION IF NOT EXISTS pg_trgm;     -- trigram similarity
CREATE EXTENSION IF NOT EXISTS fuzzystrmatch; -- levenshtein, soundex, metaphone

-- GIN index on tw_users.username for fast trigram search
CREATE INDEX IF NOT EXISTS idx_tw_users_username_trgm
    ON tw_users USING gin (username gin_trgm_ops);

CREATE INDEX IF NOT EXISTS idx_tw_users_display_trgm
    ON tw_users USING gin (display_name gin_trgm_ops);

pg_trgm — Trigram Similarity

A trigram is every sequence of 3 consecutive characters. Similarity is the fraction of trigrams two strings share. similarity('cat','cats') = 0.5

-- Find users with similar usernames (deduplication)
SELECT
    a.username AS user_a,
    b.username AS user_b,
    ROUND(similarity(a.username, b.username)::numeric, 3) AS sim_score
FROM tw_users a
JOIN tw_users b
    ON a.id < b.id
   AND similarity(a.username, b.username) > 0.7
ORDER BY sim_score DESC
LIMIT 20;

-- Fuzzy search: find users whose display_name is close to a typo
SELECT username, display_name,
       ROUND(similarity(display_name, 'Morgn Smith')::numeric, 3) AS score
FROM tw_users
WHERE display_name % 'Morgn Smith'   -- % operator = similarity > pg_trgm.similarity_threshold
ORDER BY score DESC
LIMIT 10;

-- ILIKE with trigram index — fast pattern search
SELECT username, display_name
FROM tw_users
WHERE display_name ILIKE '%morgan%'  -- uses GIN index when pg_trgm is installed
LIMIT 10;

Purchase this course to unlock the full lesson.

Sign up