Tag Page

information retrieval

3 sources across the archive use this tag. The list below groups them by source while keeping the original topic context visible.

Back to all tags

Fuzzy String Matching

Splink: String Comparators

Also listed on Fuzzy String Matching.

This is one of the clearest compact overviews of common string comparators in practical use. It covers Levenshtein, Damerau-Levenshtein, Jaro, Jaro-Winkler, and Jaccard in one place, which makes it especially useful when you need to compare what each metric is actually sensitive to rather than reading isolated algorithm descriptions.

Its main strength is readability. The page is implementation-oriented because it is part of toolkit documentation, but that also makes it pragmatic: it helps you connect the abstract metric to the kinds of matching problems it handles well, such as transpositions, typos, or token overlap. It is a strong first reference when you want a technical overview before deciding which comparator deserves deeper study.

Fuzzy String Matching

Microsoft Learn: Fuzzy Merge in Power Query

Also listed on Fuzzy String Matching.

This is a good reference for token-based fuzzy matching in a real workflow rather than in the abstract. It explains Jaccard similarity, thresholds, and preprocessing controls in the context of fuzzy merge operations, which makes the strengths of set-based matching concrete and easy to reason about.

Its scope is narrower than the Splink material because it is tied to Power Query, but that narrowness is also the benefit. The examples make it clear when token overlap works well, how normalization choices affect outcomes, and why preprocessing can matter as much as the similarity function itself in messy data-cleaning pipelines.

Fuzzy String Matching

PostgreSQL pg_trgm Documentation

Also listed on Fuzzy String Matching.

This documentation is the strongest reference in the set for substring-based matching at scale. It explains trigram similarity as a way to compare strings through overlapping character n-grams, which is valuable when exact tokenization is unreliable or when you want robust partial-match behavior for misspellings and fragments.

It is not a tutorial, and it assumes some database familiarity, but it is still a very useful conceptual reference because it ties the comparator directly to indexed similarity search. That makes it a strong citation when the question is not only how to score approximate matches, but how to do so efficiently over large text collections.