Semantic + LLM matching¶
The dependency-free stack (Linking) blocks with lexical n-gram similarity and decides with a similarity threshold. The headline stack swaps in three heavy adapters for what lexical matching misses — semantic renames (Google / Alphabet, which share no characters) and pairs that need a judgment call:
Stage |
Dependency-free |
Heavy adapter |
Extra |
|---|---|---|---|
Embed |
|
|
|
Index |
|
|
|
Match |
|
|
|
They are drop-in behind the same ports — the linker, serializers, metrics and clustering are unchanged. Each swap is independent: take semantic embeddings without an LLM matcher, or an LLM matcher over lexical blocking.
Install¶
pip install "denselinkage[all]" # all three extras
# or pick what you need:
pip install "denselinkage[sentence-transformers]"
pip install "denselinkage[faiss]"
pip install "denselinkage[langchain]"
The core still imports without any of them — import denselinkage pulls in no
heavy backend. A missing extra raises a ModuleNotFoundError naming the exact
pip install to run, and only when you construct the adapter.
Semantic embeddings — SentenceTransformerEmbedder¶
from denselinkage.embedding import SentenceTransformerEmbedder
embedder = SentenceTransformerEmbedder("all-MiniLM-L6-v2")
The argument is any sentence-transformers
checkpoint. all-MiniLM-L6-v2 (384-dim) is a small, fast default; larger models
(e.g. all-mpnet-base-v2, 768-dim) trade speed for accuracy. The model loads
eagerly, so a bad name fails immediately.
encode returns L2-normalized float32 vectors, so the inner product equals
cosine — identical similarity semantics to HashedNGramEmbedder. A
similarity_threshold you tuned on the lexical stack therefore keeps its meaning
here.
Note
The first construction downloads the model (cached afterwards under
~/.cache/huggingface). Embedding is CPU-bound; pass batch_size= and
show_progress=True to encode for large reference sets.
ANN index — FaissFlatIndex¶
from denselinkage.indexing import FaissFlatIndex
vector_index = FaissFlatIndex() # exact inner-product (IndexFlatIP) search
A drop-in for NumpyFlatIndex behind the VectorIndex port — same neighbours,
same scores (a differential test pins them together), because it uses FAISS’s
IndexFlatIP and so reports cosine for the normalized vectors above. Reach for it
when the brute-force numpy search becomes the bottleneck on a large reference set;
for a few thousand records NumpyFlatIndex is fine.
Note
Incremental extended() and persistence of a FAISS-backed index are out of scope
for v1 — the Reference Store persists the numpy stack only.
LLM matching — LangChainMatcher¶
The matcher replaces the similarity gate with a language-model judgment.
from langchain_openai import ChatOpenAI
from denselinkage.matching import LangChainMatcher, RetryPolicy
matcher = LangChainMatcher(
llm=ChatOpenAI(model="gpt-4o-mini", temperature=0.0, seed=42),
prompt=(
"Are these two records the same real-world entity?\n"
"Record A: {record_a}\n"
"Record B: {record_b}"
),
retry=RetryPolicy(max_retries=3),
max_concurrency=8,
)
The llm is injected — any LangChain chat model works (ChatOpenAI,
ChatAnthropic, …); model and operational config live on the model object, not on
the matcher. Set the provider’s credentials in the environment (e.g.
OPENAI_API_KEY).
Prompt contract. The prompt carries only the semantic question and must
reference {record_a} and {record_b} — the matcher fills them with the two
records’ serialized text. Do not ask for an output format: the matcher binds
structured output and returns typed MatchDecisions, so you never parse text and
a brittle “answer YES or NO” instruction is neither needed nor wanted. Each
decision carries is_match plus optional confidence / rationale when the model
supplies them.
Operational knobs.
retry=RetryPolicy(max_retries=…, backoff_seconds=…)— per-pair retries on a transient backend error.max_concurrency=…— how many pairs are sent to the model in parallel.
Failures are soft. A pair the matcher cannot decide after its retries becomes a
MatchError in result.errors — never an exception into the batch, so one bad
call cannot abort a run. Errored pairs are excluded from precision/recall and
counted as LinkageMetrics.n_errors:
result = linker.link(left, right)
for err in result.errors:
print(err.reason)
Tip
Cost & determinism. Every surviving candidate pair is one LLM call, so cost
scales with top_k × queries — tune top_k / similarity_threshold on the
blocker (or pre-filter with SimilarityThresholdFilter) to keep the matcher’s
workload down. Use temperature=0.0 (and a seed where supported) for
reproducible decisions.
Putting it together¶
from denselinkage import DenseLinker
from denselinkage.blocking import DenseBlocker
blocker = DenseBlocker(
embedder=SentenceTransformerEmbedder("all-MiniLM-L6-v2"),
vector_index=FaissFlatIndex(),
similarity_threshold=0.80, # retrieve top_k, then keep >= this
top_k=5,
)
linker = DenseLinker(blocker=blocker, matcher=matcher)
result = linker.link(left, right) # the same call as the lexical stack
The full runnable assembly — with Source / serializers and evaluation — is
examples/01_end_to_end_linkage.py;
the deduplication shape is
examples/02_deduplication.py.
Both type-check and compile in CI but need the extras + an OPENAI_API_KEY to run.
See also¶
Choosing components — when each fork is worth its cost.
Linking — the orchestration verbs and the dependency-free path.
Reusing an index — persist embeddings to skip re-encoding.