Quickstart

The shortest real path: two schema-aligned tables, the default stack, one call. This example runs today — we’ll build it up a few lines at a time, then show the complete script.

Step 1 — Imports

import pandas as pd

from denselinkage import DenseLinker, LabeledPairs, Source
from denselinkage.metrics import linkage_metrics

Everything you need for a basic run comes from the package root (the prelude) plus the metrics module. You only reach into submodules when you swap components — see Choosing components.

Step 2 — The two tables

df_a = pd.DataFrame({
    "id":   ["A1", "A2", "A3"],
    "name": ["Apple Inc", "Microsoft Corp", "Google LLC"],
    "city": ["Cupertino", "Redmond", "Mountain View"],
})
df_b = pd.DataFrame({
    "id":   ["B1", "B2", "B3"],
    "name": ["Apple Incorporated", "Microsoft", "Google"],
    "city": ["Cupertino", "Redmond", "Mountain View"],
})

Two ordinary DataFrames describing the same companies with surface differences (Apple Inc vs Apple Incorporated, Microsoft Corp vs Microsoft). The columns line up here, so no serializer template is needed — when they don’t, each Source carries its own serializer and a column_mapping (see Linking two tables).

Step 3 — Ground truth (optional)

gold = LabeledPairs.from_pairs([("A1", "B1"), ("A2", "B2"), ("A3", "B3")])

LabeledPairs is the known-correct matches. It’s only used for scoring in Step 7 — linking itself needs no labels.

Step 4 — Configure the linker and wrap the inputs

linker = DenseLinker.with_defaults()   # picks a sensible embedder + index + matcher
left  = Source(df_a, id_column="id")   # serializer=None -> whole-row default
right = Source(df_b, id_column="id")

DenseLinker.with_defaults() wires the dependency-free reference stack (a HashedNGramEmbedder and NumpyFlatIndex behind a DenseBlocker, plus a ThresholdMatcher). The linker is immutable config — no data, nothing fitted. Each Source binds a frame to its id column; the schema travels with the frame, not the linker (the design-time / runtime split in Key concepts).

Step 6 — Inspect the results

print(result.to_frame())
  left_id right_id  similarity  match confidence reason
0      A1       B1    0.762443   True       None   None
1      A2       B1    0.188329  False       None   None
2      A3       B1    0.151794  False       None   None
3      A2       B2    0.833908   True       None   None
4      A3       B2    0.183309  False       None   None
5      A1       B2    0.160128  False       None   None
6      A3       B3    0.864126   True       None   None
7      A1       B3    0.188713  False       None   None
8      A2       B3    0.178685  False       None   None

to_frame() is a fixed schema, independent of your input column names: left_id, right_id, similarity, match, confidence, reason. Each row is one decided candidate pair — the blocker surfaced nine (every query record against its nearest neighbours), and the matcher flagged the three real matches on the diagonal. confidence / reason are None because the threshold matcher doesn’t produce them (an LLM matcher would). Pairs the matcher couldn’t decide would be in result.errors, never in this frame.

Step 7 — Score it

metrics = linkage_metrics(result, gold=gold)   # -> LinkageMetrics
print(f"P/R/F1: {metrics.precision:.3f} {metrics.recall:.3f} {metrics.f1:.3f}")
P/R/F1: 1.000 1.000 1.000

linkage_metrics() scores the run against your gold. metrics.n_errors reports undecided pairs separately (they’re excluded from precision/recall, never silently dropped). For a dedupe run, pass directed=False — see Evaluation.

The complete script

All seven steps, as the runnable example file:

examples/00_quickstart.py
"""Example 00 — Quickstart (the shortest real path).

Schema-aligned data + ``Source`` defaulting to the whole-row serializer means
this is genuinely minimal: no template, no column mapping, one call.

``DenseLinker.with_defaults()`` is the low-floor entry point: it wires the
dependency-free reference stack (``HashedNGramEmbedder`` + ``NumpyFlatIndex``
behind a ``DenseBlocker``, plus ``ThresholdMatcher``). Pass ``blocker=`` /
``matcher=`` to override either half (see ``01``/``03`` for full control).

The default stack is **lexical** (``HashedNGramEmbedder`` is character n-gram
feature hashing): it recovers abbreviations, punctuation and typos
(``Apple Inc`` / ``Apple Incorporated``; ``Microsoft Corp`` / ``Microsoft``;
``Google LLC`` / ``Google``) but not semantic renames such as
``Google`` -> ``Alphabet``. The gold below is lexically recoverable on purpose;
for semantic matches reach for ``SentenceTransformerEmbedder`` (see ``01``).
"""

import pandas as pd

from denselinkage import DenseLinker, LabeledPairs, Source
from denselinkage.metrics import linkage_metrics


def main() -> None:
    df_a = pd.DataFrame(
        {
            "id": ["A1", "A2", "A3"],
            "name": ["Apple Inc", "Microsoft Corp", "Google LLC"],
            "city": ["Cupertino", "Redmond", "Mountain View"],
        }
    )
    df_b = pd.DataFrame(
        {
            "id": ["B1", "B2", "B3"],
            "name": ["Apple Incorporated", "Microsoft", "Google"],
            "city": ["Cupertino", "Redmond", "Mountain View"],
        }
    )
    gold = LabeledPairs.from_pairs([("A1", "B1"), ("A2", "B2"), ("A3", "B3")])

    linker = DenseLinker.with_defaults()  # picks a sensible embedder/index/matcher
    left = Source(df_a, id_column="id")  # serializer=None -> whole-row default
    right = Source(df_b, id_column="id")

    result = linker.link(left, right)  # -> LinkageResult

    # columns: left_id, right_id, similarity, match, confidence, reason
    print(result.to_frame())
    metrics = linkage_metrics(result, gold=gold)  # -> LinkageMetrics
    print(f"P/R/F1: {metrics.precision:.3f} {metrics.recall:.3f} {metrics.f1:.3f}")


if __name__ == "__main__":
    main()

The default stack is lexical

HashedNGramEmbedder is character-n-gram feature hashing: it recovers abbreviations, punctuation, and typos (Apple Inc / Apple Incorporated) but not semantic renames such as GoogleAlphabet. For semantic matching, swap in a SentenceTransformerEmbedder — see Choosing components.

Next steps