End-to-end tutorial: record linkage with dense blocking

Who this is for. You already know entity resolution — the blocking → comparison → classification → clustering pipeline, pair completeness vs. reduction ratio, Fellegi–Sunter, B³. This tutorial maps dense blocking onto that mental model so you can recognise what denselinkage does and drive it in a few lines. We link two company tables end to end, then deduplicate one.

If you just want to run something first, do the 5-minute quickstart; for the API mental model, see Key concepts. This page is the bridge: the why and the how, stage by stage.

The pipeline you already know

denselinkage does not invent a new pipeline. It is the same four stages you have built before — it only changes how blocking and comparison are done.

        flowchart LR
    R[Records] --> B["Blocking<br/><i>dense: embed → ANN top-k</i>"]
    B --> C["Comparison<br/><i>cosine similarity from the embeddings</i>"]
    C --> D["Classification<br/><i>ThresholdMatcher or LLM</i>"]
    D --> E["Clustering<br/><i>connected components</i>"]
    

Here is the direct translation from the vocabulary you know to the package:

ER stage

Classic approach

denselinkage

API surface

Quality metric

Representation

comparison attributes / blocking-key construction

serialise each row to one text string

Serializer

Blocking / indexing

standard blocking, sorted neighbourhood, canopy, q-gram

embed records, retrieve top-k nearest in a vector index

Embedder + VectorIndex (via DenseBlocker)

pair completeness@k, reduction ratio

Comparison

per-field similarity → comparison vector

cosine similarity, carried on each pair

CandidatePair.similarity_score

Classification

Fellegi–Sunter, threshold, supervised ML

similarity threshold or an LLM

Matcher (ThresholdMatcher, LangChainMatcher)

precision / recall / F1

Clustering / merge

transitive closure, hierarchical

connected components (pluggable)

connected_components() / Clusterer

B³ precision / recall / F1

If you can read that table, you already know the library. The rest of this page is detail.

What dense blocking changes (and what it doesn’t)

Traditional blocking buckets records by a discrete key — Soundex of a surname, a q-gram, a ZIP prefix — and compares only within a bucket. It is fast and exact, but the key is hand-crafted, and a record whose key is wrong (a typo in the surname, a missing ZIP) silently never meets its true match.

Dense blocking replaces the key with a vector: serialise the record to text, embed it, and retrieve the top-k nearest neighbours from an approximate nearest-neighbour (ANN) index. “Same bucket” becomes “close in vector space.”

        flowchart TB
    subgraph trad["Traditional blocking"]
        direction LR
        t1[Records] --> t2["Blocking key<br/>(Soundex, q-gram, ZIP)"]
        t2 --> t3["Exact-match buckets"]
        t3 --> t4["Pairs within a bucket"]
    end
    subgraph dense["Dense blocking"]
        direction LR
        d1[Records] --> d2["Embedding<br/>(lexical or semantic)"]
        d2 --> d3["Vector index (ANN)"]
        d3 --> d4["Top-k nearest as pairs"]
    end
    

What carries over — the recall/cost trade-off is the same one you tune today. top_k and similarity_threshold play the role your blocking key’s selectivity played: raise them for higher pair completeness, lower them for a tighter reduction ratio. You still measure the blocker with pair completeness.

What changes — there are no keys to design, and “closeness” is fuzzy by construction, so abbreviations, punctuation, and (with semantic embeddings) rewordings cluster together without bespoke rules. The cost is that retrieval is approximate: a true match can fall outside the top-k, exactly as a bad key can put it in the wrong bucket. denselinkage counts that honestly — recall in linkage_metrics() charges every missed gold pair as a false negative, including those the blocker never surfaced, so your reported recall is end-to-end, not conditional on blocking.

One more recognition point: in dense blocking the comparison vector collapses to a single number — the embedding cosine similarity — carried on every CandidatePair. The richer per-field comparison you may be used to moves into the classifier when that classifier is an LLM reading the record text.

The data flow, end to end

        flowchart TB
    subgraph build["Index the reference table — built once"]
        direction TB
        FA[DataFrame A] --> SA["Serializer<br/>row → text"]
        SA --> EA["Embedder<br/>text → vector"]
        EA --> VB["VectorIndex.build()"]
        VB --> IDX[("SearchableIndex<br/>immutable artifact")]
    end
    subgraph query["Query — per link() call"]
        direction TB
        FB[DataFrame B] --> SB["Serializer<br/>row → text"]
        SB --> EB["Embedder<br/>text → vector"]
    end
    EB -- "top-k nearest" --> IDX
    IDX --> CP["CandidatePairs<br/>(carry similarity)"]
    CP --> MM["Matcher<br/>threshold / LLM"]
    MM --> RES["LinkageResult<br/>decisions + errors"]
    RES --> CC["connected_components()"]
    CC --> CL[ClusteringResult]
    

The dashed split is deliberate: the reference table is embedded and indexed once into an immutable artifact, which then answers many query batches. That is the design-time/runtime separation — DenseLinker (config) vs LinkageIndex (prepared state) — covered in Key concepts and Reusing an index.

The runnable example and the production target

The lexical, dependency-free version you just walked through is the quickstart — run it today:

examples/00_quickstart.py — runs on numpy + pandas alone
"""Example 00 — Quickstart (the shortest real path).

Schema-aligned data + ``Source`` defaulting to the whole-row serializer means
this is genuinely minimal: no template, no column mapping, one call.

``DenseLinker.with_defaults()`` is the low-floor entry point: it wires the
dependency-free reference stack (``HashedNGramEmbedder`` + ``NumpyFlatIndex``
behind a ``DenseBlocker``, plus ``ThresholdMatcher``). Pass ``blocker=`` /
``matcher=`` to override either half (see ``01``/``03`` for full control).

The default stack is **lexical** (``HashedNGramEmbedder`` is character n-gram
feature hashing): it recovers abbreviations, punctuation and typos
(``Apple Inc`` / ``Apple Incorporated``; ``Microsoft Corp`` / ``Microsoft``;
``Google LLC`` / ``Google``) but not semantic renames such as
``Google`` -> ``Alphabet``. The gold below is lexically recoverable on purpose;
for semantic matches reach for ``SentenceTransformerEmbedder`` (see ``01``).
"""

import pandas as pd

from denselinkage import DenseLinker, LabeledPairs, Source
from denselinkage.metrics import linkage_metrics


def main() -> None:
    df_a = pd.DataFrame(
        {
            "id": ["A1", "A2", "A3"],
            "name": ["Apple Inc", "Microsoft Corp", "Google LLC"],
            "city": ["Cupertino", "Redmond", "Mountain View"],
        }
    )
    df_b = pd.DataFrame(
        {
            "id": ["B1", "B2", "B3"],
            "name": ["Apple Incorporated", "Microsoft", "Google"],
            "city": ["Cupertino", "Redmond", "Mountain View"],
        }
    )
    gold = LabeledPairs.from_pairs([("A1", "B1"), ("A2", "B2"), ("A3", "B3")])

    linker = DenseLinker.with_defaults()  # picks a sensible embedder/index/matcher
    left = Source(df_a, id_column="id")  # serializer=None -> whole-row default
    right = Source(df_b, id_column="id")

    result = linker.link(left, right)  # -> LinkageResult

    # columns: left_id, right_id, similarity, match, confidence, reason
    print(result.to_frame())
    metrics = linkage_metrics(result, gold=gold)  # -> LinkageMetrics
    print(f"P/R/F1: {metrics.precision:.3f} {metrics.recall:.3f} {metrics.f1:.3f}")


if __name__ == "__main__":
    main()

The production assembly swaps in semantic embeddings, a FAISS index, and an LLM matcher (extra [all]). Same pipeline, heavier components:

examples/01_end_to_end_linkage.py — semantic + FAISS + LLM
"""Example 01 — End-to-End Dense Linkage (full control).

Explicitly assembled components: a dense blocker (SentenceTransformer
embeddings + a FAISS index) then an LLM matcher. Vector indexes live in
``denselinkage.indexing`` (their own port, parallel to embedders).

The prompt carries ONLY the semantic question: the matcher owns output
structure and returns typed ``MatchDecision``s, so a brittle "Answer YES or
NO" instruction is neither needed nor wanted.

NOTE: Runs with the heavy extras installed (``pip install "denselinkage[all]"``)
and an ``OPENAI_API_KEY`` — it uses ``SentenceTransformerEmbedder``,
``FaissFlatIndex``, and a ``LangChainMatcher`` over a live LLM. It is type-checked
and compiled in CI but not executed there (needs the extras + a key). For a
runnable end-to-end on the dependency-free stack, see ``00`` / ``04``.
"""

import pandas as pd
from langchain_openai import ChatOpenAI

from denselinkage import DenseLinker, LabeledPairs, Source, TemplateSerializer
from denselinkage.blocking import DenseBlocker
from denselinkage.embedding import SentenceTransformerEmbedder
from denselinkage.indexing import FaissFlatIndex
from denselinkage.matching import LangChainMatcher, RetryPolicy
from denselinkage.metrics import linkage_metrics


def main() -> None:
    df_a = pd.DataFrame(
        {
            "id_a": ["A1", "A2", "A3"],
            "name": ["Apple Inc", "Microsoft Corp", "Google LLC"],
            "city": ["Cupertino", "Redmond", "Mountain View"],
        }
    )
    df_b = pd.DataFrame(
        {
            "id_b": ["B1", "B2", "B3"],
            "company_name": ["Apple Incorporated", "Microsoft", "Alphabet"],
            "headquarters": ["Cupertino, CA", "Redmond, WA", "Mountain View, CA"],
        }
    )
    gold = LabeledPairs.from_pairs([("A1", "B1"), ("A2", "B2"), ("A3", "B3")])

    # Blocker: embedder and vector index injected independently.
    blocker = DenseBlocker(
        embedder=SentenceTransformerEmbedder("all-MiniLM-L6-v2"),
        vector_index=FaissFlatIndex(),
        similarity_threshold=0.80,  # retrieve top_k, then keep >= this
        top_k=5,
    )

    # Matcher: the LLM is injected; model / operational / domain config stay
    # separate. The prompt is just the question — no format instruction.
    matcher = LangChainMatcher(
        llm=ChatOpenAI(model="gpt-4o-mini", temperature=0.0, seed=42),
        prompt=(
            "Are these two records the same real-world entity?\n"
            "Record A: {record_a}\n"
            "Record B: {record_b}"
        ),
        retry=RetryPolicy(max_retries=3),
        max_concurrency=8,
    )

    # The linker is pure config — no data, no schema, nothing fitted.
    linker = DenseLinker(blocker=blocker, matcher=matcher)

    # Schema travels with each frame. df_b maps its columns onto the shared
    # template; the linker never learns either schema.
    template = "Name: {name}, City: {city}"
    left = Source(df_a, id_column="id_a", serializer=TemplateSerializer(template))
    right = Source(
        df_b,
        id_column="id_b",
        serializer=TemplateSerializer(
            template,
            column_mapping={"company_name": "name", "headquarters": "city"},
        ),
    )

    result = linker.link(left, right)  # one call, no mutation

    print("\n--- Match Results ---")
    # Fixed schema, independent of input column names:
    # left_id, right_id, similarity (float), match (bool),
    # confidence (float|None), reason (str|None). One row per *decided*
    # pair (matches AND non-matches); pairs the matcher could not decide
    # are in result.errors, not rows here.
    print(result.to_frame())

    print("\n--- Evaluation Metrics ---")
    metrics = linkage_metrics(result, gold=gold)  # -> LinkageMetrics
    print(f"Precision: {metrics.precision:.4f}")
    print(f"Recall:    {metrics.recall:.4f}")
    print(f"F1 Score:  {metrics.f1:.4f}")
    if metrics.n_errors:
        print(f"(errored pairs excluded from P/R: {metrics.n_errors})")


if __name__ == "__main__":
    main()

Tuning and evaluation, per stage

Stage

Knob

Measure with

Blocking

top_k, similarity_threshold on DenseBlocker

pair_completeness_at_k() (PC@k), reduction ratio

Classification

ThresholdMatcher(threshold=...) or the LLM prompt

linkage_metrics() (P/R/F1)

Clustering

the Clusterer strategy

clustering_metrics() (B³)

The discipline is the one you already follow: fix blocking recall first (a match the blocker drops is unrecoverable downstream), then tune the classifier for precision, then check clustering for over-merging.

Where denselinkage fits, versus tools you know

Tool

Blocking

Classifier

Splink

blocking rules / keys

Fellegi–Sunter (EM-trained)

dedupe

learned predicates

logistic regression (active learning)

Magellan / recordlinkage

keys + feature engineering

supervised ML on comparison vectors

denselinkage

dense retrieval (embeddings + ANN)

threshold or LLM, pluggable

Bring your ER instincts: swap the hand-tuned blocking key for dense retrieval, and pick a threshold or an LLM as the classifier. denselinkage is single-node by design (batched encoding; no Spark/Flink layer) and contract-first — every stage is a typing.Protocol port you can replace, which is the subject of Custom components and Architecture.

Next steps