Approximate text similarity

Fuzzy String Matching

Annotated references on edit-distance, token-based, trigram, and empirical approaches to approximate string comparison.

Fuzzy string matching sits at the intersection of data cleaning, search, and record linkage. Different comparators reward different kinds of similarity: some focus on character edits, some on token overlap, and some on shared substrings. That means there is no single "best" metric outside a concrete error model.

This topic page collects sources that explain the main families of comparators and how to choose among them in practice. Together they cover definitions, tradeoffs, implementation-oriented guidance, and one empirical study that shows how threshold choice and application domain shape real-world performance.

Sources

5 sources with summaries and tags

Splink: String Comparators

This is one of the clearest compact overviews of common string comparators in practical use. It covers Levenshtein, Damerau-Levenshtein, Jaro, Jaro-Winkler, and Jaccard in one place, which makes it especially useful when you need to compare what each metric is actually sensitive to rather than reading isolated algorithm descriptions.

Its main strength is readability. The page is implementation-oriented because it is part of toolkit documentation, but that also makes it pragmatic: it helps you connect the abstract metric to the kinds of matching problems it handles well, such as transpositions, typos, or token overlap. It is a strong first reference when you want a technical overview before deciding which comparator deserves deeper study.

Splink: Choosing String Comparators

This page is the practical companion to the comparator overview because it shifts the question from definition to selection. Instead of only explaining what the algorithms are, it discusses how to choose among them for names, typos, aliases, and thresholded matching decisions.

The useful part is its attention to failure modes and tradeoffs. It explicitly notes where simple metrics break down, especially around nicknames and aliases, so it helps prevent the common mistake of treating a distance score as a universal notion of semantic sameness. Although the framing is still record-linkage oriented, the decision logic transfers well to search and data-normalization tasks.

Microsoft Learn: Fuzzy Merge in Power Query

This is a good reference for token-based fuzzy matching in a real workflow rather than in the abstract. It explains Jaccard similarity, thresholds, and preprocessing controls in the context of fuzzy merge operations, which makes the strengths of set-based matching concrete and easy to reason about.

Its scope is narrower than the Splink material because it is tied to Power Query, but that narrowness is also the benefit. The examples make it clear when token overlap works well, how normalization choices affect outcomes, and why preprocessing can matter as much as the similarity function itself in messy data-cleaning pipelines.

PostgreSQL pg_trgm Documentation

This documentation is the strongest reference in the set for substring-based matching at scale. It explains trigram similarity as a way to compare strings through overlapping character n-grams, which is valuable when exact tokenization is unreliable or when you want robust partial-match behavior for misspellings and fragments.

It is not a tutorial, and it assumes some database familiarity, but it is still a very useful conceptual reference because it ties the comparator directly to indexed similarity search. That makes it a strong citation when the question is not only how to score approximate matches, but how to do so efficiently over large text collections.

Real World Performance of Approximate String Comparators for Use in Patient Matching

This paper is the evidence-oriented source in the group. Rather than stopping at definitions, it compares approximate string comparators in a real patient-matching setting and reports practical behavior under thresholded linkage decisions.

It is especially useful because it anchors metric choice to observed outcomes. In that domain and at a threshold of 0.8, the authors report the highest linkage sensitivity for Jaro-Winkler, which makes the paper a helpful reminder that comparator performance is application-dependent and should be validated empirically. The tradeoff is scope: it is an older study and a narrow domain reference rather than a general survey of fuzzy matching methods.