Postgres Fuzzy Matching Notes

Cached this excellent StackOverflow Answer by Erwin Brandsetter.


Pattern matching operators

All of the above can use a trigram index. See below.
For left-anchored patterns, also a B-tree index using COLLATE "C". Or with any other collation and the operator class text_pattern_ops. See below. Or even with an SP-GiST index.

Basics about pattern matching in the manual.

Related operators

Your query

… is pretty much the optimum. Syntax won’t get much shorter, query won’t get much faster:

SELECT name FROM spelers
WHERE  name LIKE 'B%' OR name LIKE 'D%'
ORDER  BY 1;

Or equivalent, using RegEx (slightly more expensive):

... WHERE name ~ '^B' OR name ~ '^D'

A bit shorter (can use the same index in modern Postgres):

... WHERE name LIKE ANY ('{B%,D%}')
... WHERE name ~ ANY ('{^B,^D}')

A regular expression with branches shortens the syntax some more:

... WHERE name ~ '^(B|D).*'

Or a character class (only for the case with a single character):

... WHERE name ~ '^[BD].*'

For bigger tables, index support improves performance by orders of magnitude.

In Postgres 11 or later the new ^@ is more convenient as we can use the unadorned prefix directly - and fast when supported with an SP-GiST index:

... WHERE name ^@ 'B' OR name ^@ 'D'

Or:

... WHERE name ^@ ANY ('{B,D}')

Since Postgres 15, the first variant can also use a B-tree index using COLLATE "C".

db<>fiddle here

Index

If concerned with performance, create an index like this for bigger tables to support left-anchored search patterns (matching from the start of the string):

CREATE INDEX spelers_name_special_idx ON spelers (name COLLATE "C");

Requires per-column collation support added with Postgres 9.1.

See:

In DBs running with the “C” locale (not typical), a plain B-tree index does the job.

In older versions (or still today), you can use the special operator class text_pattern_ops for the same purpose:

CREATE INDEX spelers_name_special_idx ON spelers (name text_pattern_ops);

SIMILAR TO or regular expressions with basic left-anchored expressions can use this index, too. Very old versions failed to use the index for branches (B|D), character classes [BD], or constructs with LIKE ANY or ~ ANY.

Or use an expression index based on a truncated substring for columns with long strings, and repeat the same expression in queries. Or use the ^@ operator with a matching index.

Trigram matching

Trigram matches or text search use special GIN or GiST indexes.

You can install the additional module pg_trgm to provide index support for any LIKE / ILIKE pattern (and simple regexp patterns with ~ / ~*) using a GIN or GiST index.

Details, example and links:

pg_trgm provides additional operators like:

Text search

Is a special type of pattern matching with separate infrastructure and index types. It uses dictionaries and stemming and is a great tool to find words in documents, especially for natural languages.

Prefix matching is also supported:

As well as phrase search since Postgres 9.6:

Consider the introduction in the manual and the overview of operators and functions.

Additional tools for fuzzy string matching

The additional module fuzzystrmatch offers some more options, but performance is generally inferior to all of the above.

In particular, various implementations of the levenshtein() function may be instrumental.

Why are regular expressions (~) always faster than SIMILAR TO?

SIMILAR TO expressions are rewritten into regular expressions internally. For every SIMILAR TO expression, there is at least one faster regular expression (saving the overhead of rewriting the expression). There is no performance gain in using SIMILAR TO ever.

Simple expressions that can make do with LIKE (~~) are faster with LIKE anyway.

SIMILAR TO is only supported in PostgreSQL because it ended up in early drafts of the SQL standard. They still haven’t gotten rid of it. But there are plans to remove it and include regexp matches instead - or so I heard.

EXPLAIN ANALYZE reveals it. Just try with any table yourself!

EXPLAIN ANALYZE SELECT * FROM spelers WHERE name SIMILAR TO 'B%';

Reveals:

...
Seq Scan on spelers  (cost= ...
  Filter: (name ~ '^(?:B.*)$'::text)

SIMILAR TO has been rewritten with a regular expression (~).