Deduplication¶

dedupe(source) finds duplicates within one dataset. It links the dataset against itself and suppresses self-pairs internally — there is no suppress_self_pairs knob. The same DenseLinker config works for both link and dedupe; only the method name changes.

src = Source(df, id_column="id", serializer=TemplateSerializer("{name} — {city}"))
result = linker.dedupe(src)         # -> LinkageResult; self-pairs suppressed

From pairs to entities¶

Pairwise matches are transitive evidence, not final groups. Collapse them into entity clusters with connected_components():

from denselinkage import connected_components

clusters = connected_components(result)   # -> ClusteringResult
clusters.to_frame()

Tip

A record that produced no candidate pair is absent from result, so it is missing from the clustering — and clustering_metrics would then score only the records that did appear, silently inflating B³ recall. Pass the full id set so unmatched records count as singletons and B³ is complete:

ids = src.frame[src.id_column].astype(str)
clusters = connected_components(result, all_record_ids=ids)

Warning

connected_components is transitive: if A matches B and B matches C, all three land in one cluster even if A and C were never matched directly. With a noisy matcher this can snowball into oversized clusters — watch for cluster quality (recall ≫ precision) using clustering metrics.

Evaluate against order-insensitive gold¶

For dedup, a pair is unordered — ("1","2") and ("2","1") are the same. The metrics canonicalize each pair, so pass your gold once and let the evaluator handle direction (see the directed flag in Evaluation).

Full example¶

Runnable on the dependency-free stack (dedupe → connected_components → B³):

examples/04_dedupe.py¶

"""Example 04 — Deduplication on the dependency-free stack (runnable).

The same dedup workflow as ``02``, but wired on the lexical reference stack
(``HashedNGramEmbedder`` + ``NumpyFlatIndex`` + ``ThresholdMatcher`` via
``DenseLinker.with_defaults``), so it runs with no extras: dedupe -> cluster ->
B3 quality, all on numpy + pandas. ``02`` shows the production semantic + LLM
shape (needs the ``[all]`` extras + a live LLM); this is the runnable counterpart.
"""

import pandas as pd

from denselinkage import (
    DenseLinker,
    LabeledPairs,
    Source,
    TemplateSerializer,
    connected_components,
)
from denselinkage.metrics import clustering_metrics


def main() -> None:
    df = pd.DataFrame(
        {
            "id": ["1", "2", "3", "4", "5"],
            "name": [
                "Apple Inc",
                "Apple Incorporated",
                "Microsoft Corp",
                "Microsoft Corporation",
                "Google LLC",
            ],
            "city": ["Cupertino", "Cupertino", "Redmond", "Redmond", "Mountain View"],
        }
    )
    # Gold dedup pairs are order-insensitive (see 02). Google (5) is a singleton.
    gold = LabeledPairs.from_pairs([("1", "2"), ("3", "4")])

    linker = DenseLinker.with_defaults()  # lexical reference stack
    src = Source(df, id_column="id", serializer=TemplateSerializer("{name} — {city}"))

    result = linker.dedupe(src)  # -> LinkageResult; self-pairs suppressed internally
    print("--- Pairwise matches ---")
    print(result.to_frame())
    # result.errors is empty here: the threshold matcher decides every scored
    # pair; an LLM matcher (see 02) surfaces undecided pairs as MatchErrors.

    # connected_components is TRANSITIVE (A~B, B~C -> one cluster); with a noisy
    # matcher this can over-merge. Watch for B3 recall >> precision + big clusters.
    clusters = connected_components(result)  # -> ClusteringResult
    print(f"\n--- {clusters.n_clusters} clusters ---")
    print(clusters.to_frame())

    cm = clustering_metrics(clusters, gold=gold)  # B3 (Bagga-Baldwin) quality
    print(f"\nB3 P/R/F1: {cm.b3_precision:.3f} {cm.b3_recall:.3f} {cm.b3_f1:.3f}")


if __name__ == "__main__":
    main()

Note

For the production semantic + LLM shape, see examples/02_deduplication.py — it uses the FAISS / sentence-transformers / LangChain adapters, so it needs those extras (pip install "denselinkage[all]") and an OPENAI_API_KEY to run.