Semantic + LLM matching

The dependency-free stack (Linking) blocks with lexical n-gram similarity and decides with a similarity threshold. The headline stack swaps in three heavy adapters for what lexical matching misses — semantic renames (Google / Alphabet, which share no characters) and pairs that need a judgment call:

Stage

Dependency-free

Heavy adapter

Extra

Embed

HashedNGramEmbedder

SentenceTransformerEmbedder

[sentence-transformers]

Index

NumpyFlatIndex

FaissFlatIndex

[faiss]

Match

ThresholdMatcher

LangChainMatcher

[langchain]

They are drop-in behind the same ports — the linker, serializers, metrics and clustering are unchanged. Each swap is independent: take semantic embeddings without an LLM matcher, or an LLM matcher over lexical blocking.

Install

pip install "denselinkage[all]"      # all three extras
# or pick what you need:
pip install "denselinkage[sentence-transformers]"
pip install "denselinkage[faiss]"
pip install "denselinkage[langchain]"

The core still imports without any of them — import denselinkage pulls in no heavy backend. A missing extra raises a ModuleNotFoundError naming the exact pip install to run, and only when you construct the adapter.

Semantic embeddings — SentenceTransformerEmbedder

from denselinkage.embedding import SentenceTransformerEmbedder

embedder = SentenceTransformerEmbedder("all-MiniLM-L6-v2")

The argument is any sentence-transformers checkpoint. all-MiniLM-L6-v2 (384-dim) is a small, fast default; larger models (e.g. all-mpnet-base-v2, 768-dim) trade speed for accuracy. The model loads eagerly, so a bad name fails immediately.

encode returns L2-normalized float32 vectors, so the inner product equals cosine — identical similarity semantics to HashedNGramEmbedder. A similarity_threshold you tuned on the lexical stack therefore keeps its meaning here.

Note

The first construction downloads the model (cached afterwards under ~/.cache/huggingface). Embedding is CPU-bound; pass batch_size= and show_progress=True to encode for large reference sets.

ANN index — FaissFlatIndex

from denselinkage.indexing import FaissFlatIndex

vector_index = FaissFlatIndex()   # exact inner-product (IndexFlatIP) search

A drop-in for NumpyFlatIndex behind the VectorIndex port — same neighbours, same scores (a differential test pins them together), because it uses FAISS’s IndexFlatIP and so reports cosine for the normalized vectors above. Reach for it when the brute-force numpy search becomes the bottleneck on a large reference set; for a few thousand records NumpyFlatIndex is fine.

Note

Incremental extended() and persistence of a FAISS-backed index are out of scope for v1 — the Reference Store persists the numpy stack only.

LLM matching — LangChainMatcher

The matcher replaces the similarity gate with a language-model judgment.

from langchain_openai import ChatOpenAI
from denselinkage.matching import LangChainMatcher, RetryPolicy

matcher = LangChainMatcher(
    llm=ChatOpenAI(model="gpt-4o-mini", temperature=0.0, seed=42),
    prompt=(
        "Are these two records the same real-world entity?\n"
        "Record A: {record_a}\n"
        "Record B: {record_b}"
    ),
    retry=RetryPolicy(max_retries=3),
    max_concurrency=8,
)

The llm is injected — any LangChain chat model works (ChatOpenAI, ChatAnthropic, …); model and operational config live on the model object, not on the matcher. Set the provider’s credentials in the environment (e.g. OPENAI_API_KEY).

Prompt contract. The prompt carries only the semantic question and must reference {record_a} and {record_b} — the matcher fills them with the two records’ serialized text. Do not ask for an output format: the matcher binds structured output and returns typed MatchDecisions, so you never parse text and a brittle “answer YES or NO” instruction is neither needed nor wanted. Each decision carries is_match plus optional confidence / rationale when the model supplies them.

Operational knobs.

  • retry=RetryPolicy(max_retries=…, backoff_seconds=…) — per-pair retries on a transient backend error.

  • max_concurrency=… — how many pairs are sent to the model in parallel.

Failures are soft. A pair the matcher cannot decide after its retries becomes a MatchError in result.errors — never an exception into the batch, so one bad call cannot abort a run. Errored pairs are excluded from precision/recall and counted as LinkageMetrics.n_errors:

result = linker.link(left, right)
for err in result.errors:
    print(err.reason)

Tip

Cost & determinism. Every surviving candidate pair is one LLM call, so cost scales with top_k × queries — tune top_k / similarity_threshold on the blocker (or pre-filter with SimilarityThresholdFilter) to keep the matcher’s workload down. Use temperature=0.0 (and a seed where supported) for reproducible decisions.

Putting it together

from denselinkage import DenseLinker
from denselinkage.blocking import DenseBlocker

blocker = DenseBlocker(
    embedder=SentenceTransformerEmbedder("all-MiniLM-L6-v2"),
    vector_index=FaissFlatIndex(),
    similarity_threshold=0.80,   # retrieve top_k, then keep >= this
    top_k=5,
)
linker = DenseLinker(blocker=blocker, matcher=matcher)
result = linker.link(left, right)   # the same call as the lexical stack

The full runnable assembly — with Source / serializers and evaluation — is examples/01_end_to_end_linkage.py; the deduplication shape is examples/02_deduplication.py. Both type-check and compile in CI but need the extras + an OPENAI_API_KEY to run.

See also