Fuzzy Matching — pg_trgm, levenshtein & Similarity Search
Fuzzy Matching — pg_trgm, levenshtein & Similarity Search
Exact string matching fails on typos, abbreviations, and inconsistent data entry. Fuzzy matching finds strings that are close enough — essential for search, deduplication, and record linkage.
-- Enable extensions
CREATE EXTENSION IF NOT EXISTS pg_trgm; -- trigram similarity
CREATE EXTENSION IF NOT EXISTS fuzzystrmatch; -- levenshtein, soundex, metaphone
-- GIN index on tw_users.username for fast trigram search
CREATE INDEX IF NOT EXISTS idx_tw_users_username_trgm
ON tw_users USING gin (username gin_trgm_ops);
CREATE INDEX IF NOT EXISTS idx_tw_users_display_trgm
ON tw_users USING gin (display_name gin_trgm_ops);
pg_trgm — Trigram Similarity
A trigram is every sequence of 3 consecutive characters. Similarity is the
fraction of trigrams two strings share. similarity('cat','cats') = 0.5
-- Find users with similar usernames (deduplication)
SELECT
a.username AS user_a,
b.username AS user_b,
ROUND(similarity(a.username, b.username)::numeric, 3) AS sim_score
FROM tw_users a
JOIN tw_users b
ON a.id < b.id
AND similarity(a.username, b.username) > 0.7
ORDER BY sim_score DESC
LIMIT 20;
-- Fuzzy search: find users whose display_name is close to a typo
SELECT username, display_name,
ROUND(similarity(display_name, 'Morgn Smith')::numeric, 3) AS score
FROM tw_users
WHERE display_name % 'Morgn Smith' -- % operator = similarity > pg_trgm.similarity_threshold
ORDER BY score DESC
LIMIT 10;
-- ILIKE with trigram index — fast pattern search
SELECT username, display_name
FROM tw_users
WHERE display_name ILIKE '%morgan%' -- uses GIN index when pg_trgm is installed
LIMIT 10;