Tag Page

data cleaning

2 sources across the archive use this tag. The list below groups them by source while keeping the original topic context visible.

Back to all tags

Fuzzy String Matching

Splink: Choosing String Comparators

Also listed on Fuzzy String Matching.

This page is the practical companion to the comparator overview because it shifts the question from definition to selection. Instead of only explaining what the algorithms are, it discusses how to choose among them for names, typos, aliases, and thresholded matching decisions.

The useful part is its attention to failure modes and tradeoffs. It explicitly notes where simple metrics break down, especially around nicknames and aliases, so it helps prevent the common mistake of treating a distance score as a universal notion of semantic sameness. Although the framing is still record-linkage oriented, the decision logic transfers well to search and data-normalization tasks.

Fuzzy String Matching

Microsoft Learn: Fuzzy Merge in Power Query

Also listed on Fuzzy String Matching.

This is a good reference for token-based fuzzy matching in a real workflow rather than in the abstract. It explains Jaccard similarity, thresholds, and preprocessing controls in the context of fuzzy merge operations, which makes the strengths of set-based matching concrete and easy to reason about.

Its scope is narrower than the Splink material because it is tied to Power Query, but that narrowness is also the benefit. The examples make it clear when token overlap works well, how normalization choices affect outcomes, and why preprocessing can matter as much as the similarity function itself in messy data-cleaning pipelines.