Quickstart¶

The shortest real path: two schema-aligned tables, the default stack, one call. This example runs today — we’ll build it up a few lines at a time, then show the complete script.

Step 1 — Imports¶

import pandas as pd

from denselinkage import DenseLinker, LabeledPairs, Source
from denselinkage.metrics import linkage_metrics

Everything you need for a basic run comes from the package root (the prelude) plus the metrics module. You only reach into submodules when you swap components — see Choosing components.

Step 2 — The two tables¶

df_a = pd.DataFrame({
    "id":   ["A1", "A2", "A3"],
    "name": ["Apple Inc", "Microsoft Corp", "Google LLC"],
    "city": ["Cupertino", "Redmond", "Mountain View"],
})
df_b = pd.DataFrame({
    "id":   ["B1", "B2", "B3"],
    "name": ["Apple Incorporated", "Microsoft", "Google"],
    "city": ["Cupertino", "Redmond", "Mountain View"],
})

Two ordinary DataFrames describing the same companies with surface differences (Apple Inc vs Apple Incorporated, Microsoft Corp vs Microsoft). The columns line up here, so no serializer template is needed — when they don’t, each Source carries its own serializer and a column_mapping (see Linking two tables).

Step 3 — Ground truth (optional)¶

gold = LabeledPairs.from_pairs([("A1", "B1"), ("A2", "B2"), ("A3", "B3")])

LabeledPairs is the known-correct matches. It’s only used for scoring in Step 7 — linking itself needs no labels.

Step 4 — Configure the linker and wrap the inputs¶

linker = DenseLinker.with_defaults()   # picks a sensible embedder + index + matcher
left  = Source(df_a, id_column="id")   # serializer=None -> whole-row default
right = Source(df_b, id_column="id")

DenseLinker.with_defaults() wires the dependency-free reference stack (a HashedNGramEmbedder and NumpyFlatIndex behind a DenseBlocker, plus a ThresholdMatcher). The linker is immutable config — no data, nothing fitted. Each Source binds a frame to its id column; the schema travels with the frame, not the linker (the design-time / runtime split in Key concepts).

Step 5 — Link¶

result = linker.link(left, right)   # -> LinkageResult

One call — no fit, no predict, no mutation of the linker or the inputs. Under the hood this embeds both tables, retrieves nearest neighbours as candidate pairs, and decides each one. You get back a LinkageResult holding decisions and any failures in separate channels.

Step 6 — Inspect the results¶

print(result.to_frame())

  left_id right_id  similarity  match confidence reason
    A1       B1    0.762443   True       None   None
    A2       B1    0.188329  False       None   None
    A3       B1    0.151794  False       None   None
    A2       B2    0.833908   True       None   None
    A3       B2    0.183309  False       None   None
    A1       B2    0.160128  False       None   None
    A3       B3    0.864126   True       None   None
    A1       B3    0.188713  False       None   None
    A2       B3    0.178685  False       None   None

to_frame() is a fixed schema, independent of your input column names: left_id, right_id, similarity, match, confidence, reason. Each row is one decided candidate pair — the blocker surfaced nine (every query record against its nearest neighbours), and the matcher flagged the three real matches on the diagonal. confidence / reason are None because the threshold matcher doesn’t produce them (an LLM matcher would). Pairs the matcher couldn’t decide would be in result.errors, never in this frame.

Step 7 — Score it¶

metrics = linkage_metrics(result, gold=gold)   # -> LinkageMetrics
print(f"P/R/F1: {metrics.precision:.3f} {metrics.recall:.3f} {metrics.f1:.3f}")

P/R/F1: 1.000 1.000 1.000

linkage_metrics() scores the run against your gold. metrics.n_errors reports undecided pairs separately (they’re excluded from precision/recall, never silently dropped). For a dedupe run, pass directed=False — see Evaluation.

The complete script¶

All seven steps, as the runnable example file:

examples/00_quickstart.py¶

"""Example 00 — Quickstart (the shortest real path).

Schema-aligned data + ``Source`` defaulting to the whole-row serializer means
this is genuinely minimal: no template, no column mapping, one call.

``DenseLinker.with_defaults()`` is the low-floor entry point: it wires the
dependency-free reference stack (``HashedNGramEmbedder`` + ``NumpyFlatIndex``
behind a ``DenseBlocker``, plus ``ThresholdMatcher``). Pass ``blocker=`` /
``matcher=`` to override either half (see ``01``/``03`` for full control).

The default stack is **lexical** (``HashedNGramEmbedder`` is character n-gram
feature hashing): it recovers abbreviations, punctuation and typos
(``Apple Inc`` / ``Apple Incorporated``; ``Microsoft Corp`` / ``Microsoft``;
``Google LLC`` / ``Google``) but not semantic renames such as
``Google`` -> ``Alphabet``. The gold below is lexically recoverable on purpose;
for semantic matches reach for ``SentenceTransformerEmbedder`` (see ``01``).
"""

import pandas as pd

from denselinkage import DenseLinker, LabeledPairs, Source
from denselinkage.metrics import linkage_metrics


def main() -> None:
    df_a = pd.DataFrame(
        {
            "id": ["A1", "A2", "A3"],
            "name": ["Apple Inc", "Microsoft Corp", "Google LLC"],
            "city": ["Cupertino", "Redmond", "Mountain View"],
        }
    )
    df_b = pd.DataFrame(
        {
            "id": ["B1", "B2", "B3"],
            "name": ["Apple Incorporated", "Microsoft", "Google"],
            "city": ["Cupertino", "Redmond", "Mountain View"],
        }
    )
    gold = LabeledPairs.from_pairs([("A1", "B1"), ("A2", "B2"), ("A3", "B3")])

    linker = DenseLinker.with_defaults()  # picks a sensible embedder/index/matcher
    left = Source(df_a, id_column="id")  # serializer=None -> whole-row default
    right = Source(df_b, id_column="id")

    result = linker.link(left, right)  # -> LinkageResult

    # columns: left_id, right_id, similarity, match, confidence, reason
    print(result.to_frame())
    metrics = linkage_metrics(result, gold=gold)  # -> LinkageMetrics
    print(f"P/R/F1: {metrics.precision:.3f} {metrics.recall:.3f} {metrics.f1:.3f}")


if __name__ == "__main__":
    main()

The default stack is lexical¶

HashedNGramEmbedder is character-n-gram feature hashing: it recovers abbreviations, punctuation, and typos (Apple Inc / Apple Incorporated) but not semantic renames such as Google → Alphabet. For semantic matching, swap in a SentenceTransformerEmbedder — see Choosing components.

Next steps¶

End-to-end tutorial — the full pipeline explained with diagrams, for readers who know other entity-resolution methods.
Key concepts — the mental model behind these seven steps.
Linking two tables — full component control.
Custom components — implement your own port.