Deduplication¶
dedupe(source) finds duplicates within one dataset. It links the dataset
against itself and suppresses self-pairs internally — there is no
suppress_self_pairs knob. The same DenseLinker
config works for both link and dedupe; only the method name changes.
src = Source(df, id_column="id", serializer=TemplateSerializer("{name} — {city}"))
result = linker.dedupe(src) # -> LinkageResult; self-pairs suppressed
From pairs to entities¶
Pairwise matches are transitive evidence, not final groups. Collapse them into
entity clusters with connected_components():
from denselinkage import connected_components
clusters = connected_components(result) # -> ClusteringResult
clusters.to_frame()
Tip
A record that produced no candidate pair is absent from result, so it is
missing from the clustering — and clustering_metrics would then score only the
records that did appear, silently inflating B³ recall. Pass the full id set so
unmatched records count as singletons and B³ is complete:
ids = src.frame[src.id_column].astype(str)
clusters = connected_components(result, all_record_ids=ids)
Warning
connected_components is transitive: if A matches B and B matches C, all
three land in one cluster even if A and C were never matched directly. With a
noisy matcher this can snowball into oversized clusters — watch for cluster
quality (recall ≫ precision) using clustering metrics.
Evaluate against order-insensitive gold¶
For dedup, a pair is unordered — ("1","2") and ("2","1") are the same. The
metrics canonicalize each pair, so pass your gold once and let the evaluator
handle direction (see the directed flag in Evaluation).
Full example¶
Runnable on the dependency-free stack (dedupe → connected_components → B³):
"""Example 04 — Deduplication on the dependency-free stack (runnable).
The same dedup workflow as ``02``, but wired on the lexical reference stack
(``HashedNGramEmbedder`` + ``NumpyFlatIndex`` + ``ThresholdMatcher`` via
``DenseLinker.with_defaults``), so it runs with no extras: dedupe -> cluster ->
B3 quality, all on numpy + pandas. ``02`` shows the production semantic + LLM
shape (needs the ``[all]`` extras + a live LLM); this is the runnable counterpart.
"""
import pandas as pd
from denselinkage import (
DenseLinker,
LabeledPairs,
Source,
TemplateSerializer,
connected_components,
)
from denselinkage.metrics import clustering_metrics
def main() -> None:
df = pd.DataFrame(
{
"id": ["1", "2", "3", "4", "5"],
"name": [
"Apple Inc",
"Apple Incorporated",
"Microsoft Corp",
"Microsoft Corporation",
"Google LLC",
],
"city": ["Cupertino", "Cupertino", "Redmond", "Redmond", "Mountain View"],
}
)
# Gold dedup pairs are order-insensitive (see 02). Google (5) is a singleton.
gold = LabeledPairs.from_pairs([("1", "2"), ("3", "4")])
linker = DenseLinker.with_defaults() # lexical reference stack
src = Source(df, id_column="id", serializer=TemplateSerializer("{name} — {city}"))
result = linker.dedupe(src) # -> LinkageResult; self-pairs suppressed internally
print("--- Pairwise matches ---")
print(result.to_frame())
# result.errors is empty here: the threshold matcher decides every scored
# pair; an LLM matcher (see 02) surfaces undecided pairs as MatchErrors.
# connected_components is TRANSITIVE (A~B, B~C -> one cluster); with a noisy
# matcher this can over-merge. Watch for B3 recall >> precision + big clusters.
clusters = connected_components(result) # -> ClusteringResult
print(f"\n--- {clusters.n_clusters} clusters ---")
print(clusters.to_frame())
cm = clustering_metrics(clusters, gold=gold) # B3 (Bagga-Baldwin) quality
print(f"\nB3 P/R/F1: {cm.b3_precision:.3f} {cm.b3_recall:.3f} {cm.b3_f1:.3f}")
if __name__ == "__main__":
main()
Note
For the production semantic + LLM shape, see
examples/02_deduplication.py
— it uses the FAISS / sentence-transformers / LangChain adapters, so it needs
those extras (pip install "denselinkage[all]") and an OPENAI_API_KEY to run.