Linking two tables

link(left, right) matches records across two datasets. The quickstart uses with_defaults(); this page shows full component control — assembling the blocker and matcher yourself.

Assemble the components

A DenseLinker is pure configuration. You inject a Blocker (here a DenseBlocker composing an embedder and a vector index) and a Matcher:

from denselinkage import DenseLinker, Source, TemplateSerializer
from denselinkage.blocking import DenseBlocker
from denselinkage.embedding import SentenceTransformerEmbedder
from denselinkage.indexing import FaissFlatIndex
from denselinkage.matching import LangChainMatcher

linker = DenseLinker(
    blocker=DenseBlocker(
        embedder=SentenceTransformerEmbedder("all-MiniLM-L6-v2"),
        vector_index=FaissFlatIndex(),
        similarity_threshold=0.80,   # retrieve top_k, then keep >= this
        top_k=5,
    ),
    matcher=LangChainMatcher(llm=...),
)

The embedder and the vector index are independent ports — swap the index (NumPy ↔ FAISS) without touching the embedder, and vice versa.

Reconcile differing schemas

Each Source carries its own serializer, so two tables with different column names share one template via column_mapping:

template = "Name: {name}, City: {city}"
left  = Source(df_a, id_column="id_a", serializer=TemplateSerializer(template))
right = Source(df_b, id_column="id_b", serializer=TemplateSerializer(
    template, column_mapping={"company_name": "name", "headquarters": "city"}))

result = linker.link(left, right)   # one call, no mutation
result.to_frame()                   # left_id, right_id, similarity, match, confidence, reason

Full example

examples/01_end_to_end_linkage.py
"""Example 01 — End-to-End Dense Linkage (full control).

Explicitly assembled components: a dense blocker (SentenceTransformer
embeddings + a FAISS index) then an LLM matcher. Vector indexes live in
``denselinkage.indexing`` (their own port, parallel to embedders).

The prompt carries ONLY the semantic question: the matcher owns output
structure and returns typed ``MatchDecision``s, so a brittle "Answer YES or
NO" instruction is neither needed nor wanted.

NOTE: Runs with the heavy extras installed (``pip install "denselinkage[all]"``)
and an ``OPENAI_API_KEY`` — it uses ``SentenceTransformerEmbedder``,
``FaissFlatIndex``, and a ``LangChainMatcher`` over a live LLM. It is type-checked
and compiled in CI but not executed there (needs the extras + a key). For a
runnable end-to-end on the dependency-free stack, see ``00`` / ``04``.
"""

import pandas as pd
from langchain_openai import ChatOpenAI

from denselinkage import DenseLinker, LabeledPairs, Source, TemplateSerializer
from denselinkage.blocking import DenseBlocker
from denselinkage.embedding import SentenceTransformerEmbedder
from denselinkage.indexing import FaissFlatIndex
from denselinkage.matching import LangChainMatcher, RetryPolicy
from denselinkage.metrics import linkage_metrics


def main() -> None:
    df_a = pd.DataFrame(
        {
            "id_a": ["A1", "A2", "A3"],
            "name": ["Apple Inc", "Microsoft Corp", "Google LLC"],
            "city": ["Cupertino", "Redmond", "Mountain View"],
        }
    )
    df_b = pd.DataFrame(
        {
            "id_b": ["B1", "B2", "B3"],
            "company_name": ["Apple Incorporated", "Microsoft", "Alphabet"],
            "headquarters": ["Cupertino, CA", "Redmond, WA", "Mountain View, CA"],
        }
    )
    gold = LabeledPairs.from_pairs([("A1", "B1"), ("A2", "B2"), ("A3", "B3")])

    # Blocker: embedder and vector index injected independently.
    blocker = DenseBlocker(
        embedder=SentenceTransformerEmbedder("all-MiniLM-L6-v2"),
        vector_index=FaissFlatIndex(),
        similarity_threshold=0.80,  # retrieve top_k, then keep >= this
        top_k=5,
    )

    # Matcher: the LLM is injected; model / operational / domain config stay
    # separate. The prompt is just the question — no format instruction.
    matcher = LangChainMatcher(
        llm=ChatOpenAI(model="gpt-4o-mini", temperature=0.0, seed=42),
        prompt=(
            "Are these two records the same real-world entity?\n"
            "Record A: {record_a}\n"
            "Record B: {record_b}"
        ),
        retry=RetryPolicy(max_retries=3),
        max_concurrency=8,
    )

    # The linker is pure config — no data, no schema, nothing fitted.
    linker = DenseLinker(blocker=blocker, matcher=matcher)

    # Schema travels with each frame. df_b maps its columns onto the shared
    # template; the linker never learns either schema.
    template = "Name: {name}, City: {city}"
    left = Source(df_a, id_column="id_a", serializer=TemplateSerializer(template))
    right = Source(
        df_b,
        id_column="id_b",
        serializer=TemplateSerializer(
            template,
            column_mapping={"company_name": "name", "headquarters": "city"},
        ),
    )

    result = linker.link(left, right)  # one call, no mutation

    print("\n--- Match Results ---")
    # Fixed schema, independent of input column names:
    # left_id, right_id, similarity (float), match (bool),
    # confidence (float|None), reason (str|None). One row per *decided*
    # pair (matches AND non-matches); pairs the matcher could not decide
    # are in result.errors, not rows here.
    print(result.to_frame())

    print("\n--- Evaluation Metrics ---")
    metrics = linkage_metrics(result, gold=gold)  # -> LinkageMetrics
    print(f"Precision: {metrics.precision:.4f}")
    print(f"Recall:    {metrics.recall:.4f}")
    print(f"F1 Score:  {metrics.f1:.4f}")
    if metrics.n_errors:
        print(f"(errored pairs excluded from P/R: {metrics.n_errors})")


if __name__ == "__main__":
    main()

Note

This example uses the heavy adapters (SentenceTransformerEmbedder, FaissFlatIndex, LangChainMatcher) behind the [all] extra and a live LLM, so it needs those extras and an OPENAI_API_KEY to run. It is type-checked and compiled in CI but not executed there. See Semantic + LLM matching for a walk-through of the knobs.

See also