End-to-end tutorial: record linkage with dense blocking¶

Who this is for. You already know entity resolution — the blocking → comparison → classification → clustering pipeline, pair completeness vs. reduction ratio, Fellegi–Sunter, B³. This tutorial maps dense blocking onto that mental model so you can recognise what denselinkage does and drive it in a few lines. We link two company tables end to end, then deduplicate one.

If you just want to run something first, do the 5-minute quickstart; for the API mental model, see Key concepts. This page is the bridge: the why and the how, stage by stage.

The pipeline you already know¶

denselinkage does not invent a new pipeline. It is the same four stages you have built before — it only changes how blocking and comparison are done.

        flowchart LR
    R[Records] --> B["Blocking<br/><i>dense: embed → ANN top-k</i>"]
    B --> C["Comparison<br/><i>cosine similarity from the embeddings</i>"]
    C --> D["Classification<br/><i>ThresholdMatcher or LLM</i>"]
    D --> E["Clustering<br/><i>connected components</i>"]

Here is the direct translation from the vocabulary you know to the package:

ER stage	Classic approach	denselinkage	API surface	Quality metric
Representation	comparison attributes / blocking-key construction	serialise each row to one text string	`Serializer`	—
Blocking / indexing	standard blocking, sorted neighbourhood, canopy, q-gram	embed records, retrieve top-k nearest in a vector index	`Embedder` + `VectorIndex` (via `DenseBlocker`)	pair completeness@k, reduction ratio
Comparison	per-field similarity → comparison vector	cosine similarity, carried on each pair	`CandidatePair.similarity_score`	—
Classification	Fellegi–Sunter, threshold, supervised ML	similarity threshold or an LLM	`Matcher` (`ThresholdMatcher`, `LangChainMatcher`)	precision / recall / F1
Clustering / merge	transitive closure, hierarchical	connected components (pluggable)	`connected_components()` / `Clusterer`	B³ precision / recall / F1

If you can read that table, you already know the library. The rest of this page is detail.

What dense blocking changes (and what it doesn’t)¶

Traditional blocking buckets records by a discrete key — Soundex of a surname, a q-gram, a ZIP prefix — and compares only within a bucket. It is fast and exact, but the key is hand-crafted, and a record whose key is wrong (a typo in the surname, a missing ZIP) silently never meets its true match.

Dense blocking replaces the key with a vector: serialise the record to text, embed it, and retrieve the top-k nearest neighbours from an approximate nearest-neighbour (ANN) index. “Same bucket” becomes “close in vector space.”

        flowchart TB
    subgraph trad["Traditional blocking"]
        direction LR
        t1[Records] --> t2["Blocking key<br/>(Soundex, q-gram, ZIP)"]
        t2 --> t3["Exact-match buckets"]
        t3 --> t4["Pairs within a bucket"]
    end
    subgraph dense["Dense blocking"]
        direction LR
        d1[Records] --> d2["Embedding<br/>(lexical or semantic)"]
        d2 --> d3["Vector index (ANN)"]
        d3 --> d4["Top-k nearest as pairs"]
    end

What carries over — the recall/cost trade-off is the same one you tune today. top_k and similarity_threshold play the role your blocking key’s selectivity played: raise them for higher pair completeness, lower them for a tighter reduction ratio. You still measure the blocker with pair completeness.

What changes — there are no keys to design, and “closeness” is fuzzy by construction, so abbreviations, punctuation, and (with semantic embeddings) rewordings cluster together without bespoke rules. The cost is that retrieval is approximate: a true match can fall outside the top-k, exactly as a bad key can put it in the wrong bucket. denselinkage counts that honestly — recall in linkage_metrics() charges every missed gold pair as a false negative, including those the blocker never surfaced, so your reported recall is end-to-end, not conditional on blocking.

One more recognition point: in dense blocking the comparison vector collapses to a single number — the embedding cosine similarity — carried on every CandidatePair. The richer per-field comparison you may be used to moves into the classifier when that classifier is an LLM reading the record text.

The data flow, end to end¶

        flowchart TB
    subgraph build["Index the reference table — built once"]
        direction TB
        FA[DataFrame A] --> SA["Serializer<br/>row → text"]
        SA --> EA["Embedder<br/>text → vector"]
        EA --> VB["VectorIndex.build()"]
        VB --> IDX[("SearchableIndex<br/>immutable artifact")]
    end
    subgraph query["Query — per link() call"]
        direction TB
        FB[DataFrame B] --> SB["Serializer<br/>row → text"]
        SB --> EB["Embedder<br/>text → vector"]
    end
    EB -- "top-k nearest" --> IDX
    IDX --> CP["CandidatePairs<br/>(carry similarity)"]
    CP --> MM["Matcher<br/>threshold / LLM"]
    MM --> RES["LinkageResult<br/>decisions + errors"]
    RES --> CC["connected_components()"]
    CC --> CL[ClusteringResult]

The dashed split is deliberate: the reference table is embedded and indexed once into an immutable artifact, which then answers many query batches. That is the design-time/runtime separation — DenseLinker (config) vs LinkageIndex (prepared state) — covered in Key concepts and Reusing an index.

Build it: link two tables, stage by stage¶

We will link two tables of companies with different schemas. The runnable path below uses the dependency-free lexical stack and is exactly what the quickstart runs; afterwards we swap in the semantic + LLM stack for production.

Stage 0 — the data¶

import pandas as pd

df_a = pd.DataFrame({
    "id":   ["A1", "A2", "A3"],
    "name": ["Apple Inc", "Microsoft Corp", "Google LLC"],
    "city": ["Cupertino", "Redmond", "Mountain View"],
})
df_b = pd.DataFrame({
    "id":   ["B1", "B2", "B3"],
    "name": ["Apple Incorporated", "Microsoft", "Google"],
    "city": ["Cupertino", "Redmond", "Mountain View"],
})

Stage 1 — representation: rows → text¶

ER analog: choosing comparison attributes / building the blocking representation. Instead of selecting fields and a key, you serialise each row to one string. A Source binds a frame to its id column and a serializer; with serializer=None it uses the whole-row default.

from denselinkage import Source

left  = Source(df_a, id_column="id")   # whole-row serializer by default
right = Source(df_b, id_column="id")

When schemas differ, give each source its own serializer and a column_mapping to a shared template — see Linking two tables. The linker never learns either schema; the schema travels with the frame.

Stage 2 — blocking representation: text → vectors¶

ER analog: this is where a discrete blocking key would be computed. Here an Embedder turns text into a vector.

Lexical (HashedNGramEmbedder): character n-gram feature hashing. Recovers Apple Inc ≈ Apple Incorporated, Microsoft Corp ≈ Microsoft. Dependency-free. Misses semantic rewrites (Google → Alphabet).
Semantic (SentenceTransformerEmbedder): sentence embeddings that capture meaning, behind the [sentence-transformers] extra.

Stage 3 — blocking: index and retrieve top-k¶

ER analog: indexing and candidate generation. A VectorIndex is a spec; its build() produces an immutable SearchableIndex over the reference vectors. DenseBlocker composes the embedder and the index and exposes top_k / similarity_threshold — your recall knobs. Each surviving neighbour becomes a CandidatePair carrying its cosine similarity.

from denselinkage import DenseLinker
from denselinkage.blocking import DenseBlocker
from denselinkage.embedding import HashedNGramEmbedder
from denselinkage.indexing import NumpyFlatIndex
from denselinkage.matching import ThresholdMatcher

# This is exactly what DenseLinker.with_defaults() assembles for you:
linker = DenseLinker(
    blocker=DenseBlocker(
        embedder=HashedNGramEmbedder(n_features=1024, ngram=3),
        vector_index=NumpyFlatIndex(),
        similarity_threshold=0.0,   # keep all top_k neighbours; let the matcher gate
        top_k=5,
    ),
    matcher=ThresholdMatcher(threshold=0.5),
)

Tip

Measure the blocker, like any ER pipeline. The blocker is your recall ceiling. denselinkage exposes pair_completeness_at_k() and blocking_metrics() (PC@k) over candidate pairs for exactly this check — sweep top_k and watch pair completeness before you spend anything on the classifier.

Stage 4 — classification: decide each pair¶

ER analog: the classifier. Matcher takes candidate pairs and returns one outcome each:

ThresholdMatcher gates on the carried similarity — the dense-blocking analog of a single Fellegi–Sunter threshold.
LangChainMatcher (extra [langchain]) asks an LLM to read the two records and return a typed decision with a rationale — the modern replacement for a hand-built comparison-vector classifier.

A pair the matcher cannot decide becomes a MatchError, not an exception and not a false match — it is parked in a separate channel (next stage).

Stage 5 — read the result¶

result = linker.link(left, right)   # -> LinkageResult; one call, no mutation
print(result.to_frame())            # left_id, right_id, similarity, match, confidence, reason

ER analog: the linked set with match status — but LinkageResult keeps decisions and failures in separate channels. to_frame() is decided pairs only; undecided pairs are MatchErrors in result.errors, counted but never merged in. There is no eager “golden record” merge: provenance is preserved, which is the audit-friendly posture you want in regulated linkage.

Stage 6 — evaluate¶

from denselinkage import LabeledPairs
from denselinkage.metrics import linkage_metrics

gold = LabeledPairs.from_pairs([("A1", "B1"), ("A2", "B2"), ("A3", "B3")])
m = linkage_metrics(result, gold=gold)        # -> LinkageMetrics
print(f"P/R/F1: {m.precision:.3f} {m.recall:.3f} {m.f1:.3f}")
print(f"undecided (errors): {m.n_errors}")

On the lexical stack above this prints P/R/F1 = 1.000 — the gold here is lexically recoverable on purpose. For dedupe, pass directed=False so ("1","2") and ("2","1") count as the same pair (see Evaluation).

Stage 7 — deduplicate and cluster¶

The within-one-table variant is the same config, a different verb. Pairwise matches are transitive evidence; collapse them into entities with connected_components(), then score with B³:

from denselinkage import connected_components
from denselinkage.metrics import clustering_metrics

result   = linker.dedupe(src)              # self-pairs suppressed internally
clusters = connected_components(result)    # -> ClusteringResult (transitive closure)
cm = clustering_metrics(clusters, gold=gold)
print(f"B3 P/R/F1: {cm.b3_precision:.3f} {cm.b3_recall:.3f} {cm.b3_f1:.3f}")

Warning

Connected components is transitive: A~B and B~C put A, B, C in one cluster even if A and C never matched. With a noisy classifier this snowballs into mega-clusters — the classic dedup failure mode. Watch B³ recall ≫ precision.

Note

The entire dependency-free evaluation path shown here — linkage_metrics, blocking_metrics / pair_completeness_at_k, and B³ clustering_metrics — runs today on numpy + pandas. The semantic / FAISS / LLM components (example 01) are implemented too, behind their extras ([all]) and a live LLM — see Semantic + LLM matching.

The runnable example and the production target¶

The lexical, dependency-free version you just walked through is the quickstart — run it today:

examples/00_quickstart.py — runs on numpy + pandas alone¶

"""Example 00 — Quickstart (the shortest real path).

Schema-aligned data + ``Source`` defaulting to the whole-row serializer means
this is genuinely minimal: no template, no column mapping, one call.

``DenseLinker.with_defaults()`` is the low-floor entry point: it wires the
dependency-free reference stack (``HashedNGramEmbedder`` + ``NumpyFlatIndex``
behind a ``DenseBlocker``, plus ``ThresholdMatcher``). Pass ``blocker=`` /
``matcher=`` to override either half (see ``01``/``03`` for full control).

The default stack is **lexical** (``HashedNGramEmbedder`` is character n-gram
feature hashing): it recovers abbreviations, punctuation and typos
(``Apple Inc`` / ``Apple Incorporated``; ``Microsoft Corp`` / ``Microsoft``;
``Google LLC`` / ``Google``) but not semantic renames such as
``Google`` -> ``Alphabet``. The gold below is lexically recoverable on purpose;
for semantic matches reach for ``SentenceTransformerEmbedder`` (see ``01``).
"""

import pandas as pd

from denselinkage import DenseLinker, LabeledPairs, Source
from denselinkage.metrics import linkage_metrics


def main() -> None:
    df_a = pd.DataFrame(
        {
            "id": ["A1", "A2", "A3"],
            "name": ["Apple Inc", "Microsoft Corp", "Google LLC"],
            "city": ["Cupertino", "Redmond", "Mountain View"],
        }
    )
    df_b = pd.DataFrame(
        {
            "id": ["B1", "B2", "B3"],
            "name": ["Apple Incorporated", "Microsoft", "Google"],
            "city": ["Cupertino", "Redmond", "Mountain View"],
        }
    )
    gold = LabeledPairs.from_pairs([("A1", "B1"), ("A2", "B2"), ("A3", "B3")])

    linker = DenseLinker.with_defaults()  # picks a sensible embedder/index/matcher
    left = Source(df_a, id_column="id")  # serializer=None -> whole-row default
    right = Source(df_b, id_column="id")

    result = linker.link(left, right)  # -> LinkageResult

    # columns: left_id, right_id, similarity, match, confidence, reason
    print(result.to_frame())
    metrics = linkage_metrics(result, gold=gold)  # -> LinkageMetrics
    print(f"P/R/F1: {metrics.precision:.3f} {metrics.recall:.3f} {metrics.f1:.3f}")


if __name__ == "__main__":
    main()

The production assembly swaps in semantic embeddings, a FAISS index, and an LLM matcher (extra [all]). Same pipeline, heavier components:

examples/01_end_to_end_linkage.py — semantic + FAISS + LLM¶

"""Example 01 — End-to-End Dense Linkage (full control).

Explicitly assembled components: a dense blocker (SentenceTransformer
embeddings + a FAISS index) then an LLM matcher. Vector indexes live in
``denselinkage.indexing`` (their own port, parallel to embedders).

The prompt carries ONLY the semantic question: the matcher owns output
structure and returns typed ``MatchDecision``s, so a brittle "Answer YES or
NO" instruction is neither needed nor wanted.

NOTE: Runs with the heavy extras installed (``pip install "denselinkage[all]"``)
and an ``OPENAI_API_KEY`` — it uses ``SentenceTransformerEmbedder``,
``FaissFlatIndex``, and a ``LangChainMatcher`` over a live LLM. It is type-checked
and compiled in CI but not executed there (needs the extras + a key). For a
runnable end-to-end on the dependency-free stack, see ``00`` / ``04``.
"""

import pandas as pd
from langchain_openai import ChatOpenAI

from denselinkage import DenseLinker, LabeledPairs, Source, TemplateSerializer
from denselinkage.blocking import DenseBlocker
from denselinkage.embedding import SentenceTransformerEmbedder
from denselinkage.indexing import FaissFlatIndex
from denselinkage.matching import LangChainMatcher, RetryPolicy
from denselinkage.metrics import linkage_metrics


def main() -> None:
    df_a = pd.DataFrame(
        {
            "id_a": ["A1", "A2", "A3"],
            "name": ["Apple Inc", "Microsoft Corp", "Google LLC"],
            "city": ["Cupertino", "Redmond", "Mountain View"],
        }
    )
    df_b = pd.DataFrame(
        {
            "id_b": ["B1", "B2", "B3"],
            "company_name": ["Apple Incorporated", "Microsoft", "Alphabet"],
            "headquarters": ["Cupertino, CA", "Redmond, WA", "Mountain View, CA"],
        }
    )
    gold = LabeledPairs.from_pairs([("A1", "B1"), ("A2", "B2"), ("A3", "B3")])

    # Blocker: embedder and vector index injected independently.
    blocker = DenseBlocker(
        embedder=SentenceTransformerEmbedder("all-MiniLM-L6-v2"),
        vector_index=FaissFlatIndex(),
        similarity_threshold=0.80,  # retrieve top_k, then keep >= this
        top_k=5,
    )

    # Matcher: the LLM is injected; model / operational / domain config stay
    # separate. The prompt is just the question — no format instruction.
    matcher = LangChainMatcher(
        llm=ChatOpenAI(model="gpt-4o-mini", temperature=0.0, seed=42),
        prompt=(
            "Are these two records the same real-world entity?\n"
            "Record A: {record_a}\n"
            "Record B: {record_b}"
        ),
        retry=RetryPolicy(max_retries=3),
        max_concurrency=8,
    )

    # The linker is pure config — no data, no schema, nothing fitted.
    linker = DenseLinker(blocker=blocker, matcher=matcher)

    # Schema travels with each frame. df_b maps its columns onto the shared
    # template; the linker never learns either schema.
    template = "Name: {name}, City: {city}"
    left = Source(df_a, id_column="id_a", serializer=TemplateSerializer(template))
    right = Source(
        df_b,
        id_column="id_b",
        serializer=TemplateSerializer(
            template,
            column_mapping={"company_name": "name", "headquarters": "city"},
        ),
    )

    result = linker.link(left, right)  # one call, no mutation

    print("\n--- Match Results ---")
    # Fixed schema, independent of input column names:
    # left_id, right_id, similarity (float), match (bool),
    # confidence (float|None), reason (str|None). One row per *decided*
    # pair (matches AND non-matches); pairs the matcher could not decide
    # are in result.errors, not rows here.
    print(result.to_frame())

    print("\n--- Evaluation Metrics ---")
    metrics = linkage_metrics(result, gold=gold)  # -> LinkageMetrics
    print(f"Precision: {metrics.precision:.4f}")
    print(f"Recall:    {metrics.recall:.4f}")
    print(f"F1 Score:  {metrics.f1:.4f}")
    if metrics.n_errors:
        print(f"(errored pairs excluded from P/R: {metrics.n_errors})")


if __name__ == "__main__":
    main()

Tuning and evaluation, per stage¶

Stage	Knob	Measure with
Blocking	`top_k`, `similarity_threshold` on `DenseBlocker`	`pair_completeness_at_k()` (PC@k), reduction ratio
Classification	`ThresholdMatcher(threshold=...)` or the LLM prompt	`linkage_metrics()` (P/R/F1)
Clustering	the `Clusterer` strategy	`clustering_metrics()` (B³)

The discipline is the one you already follow: fix blocking recall first (a match the blocker drops is unrecoverable downstream), then tune the classifier for precision, then check clustering for over-merging.

Where denselinkage fits, versus tools you know¶

Tool	Blocking	Classifier
Splink	blocking rules / keys	Fellegi–Sunter (EM-trained)
dedupe	learned predicates	logistic regression (active learning)
Magellan / recordlinkage	keys + feature engineering	supervised ML on comparison vectors
denselinkage	dense retrieval (embeddings + ANN)	threshold or LLM, pluggable

Bring your ER instincts: swap the hand-tuned blocking key for dense retrieval, and pick a threshold or an LLM as the classifier. denselinkage is single-node by design (batched encoding; no Spark/Flink layer) and contract-first — every stage is a typing.Protocol port you can replace, which is the subject of Custom components and Architecture.

Next steps¶

Linking two tables — full component control and schema mapping.
Deduplication — the within-one-table variant in depth.
Evaluation — pair completeness, P/R/F1, and B³.
Custom components — implement a port of your own.