denselinkage¶
Record linkage and deduplication — dense blocking, optional LLM matching, and evaluation built in.
Find the records that refer to the same real-world entity, across two datasets (linkage) or within one (deduplication). denselinkage shrinks the all-pairs comparison with embedding-based blocking, decides each candidate with a pluggable matcher — a similarity threshold or an LLM — then clusters and scores the result. The core runs on a dependency-free numpy + pandas stack; FAISS, sentence-transformers, and LangChain are opt-in extras you add when you need them.
from denselinkage import DenseLinker, LabeledPairs, Source
from denselinkage.metrics import linkage_metrics
linker = DenseLinker.with_defaults() # sensible embedder + index + matcher
result = linker.link(Source(df_a, id_column="id"), Source(df_b, id_column="id"))
result.to_frame() # left_id, right_id, similarity, match, ...
linkage_metrics(result, gold=LabeledPairs.from_pairs([("A1", "B1")]))
Highlights¶
Dependency-free core —
pip install denselinkageis just numpy + pandas; the heavy backends are opt-in and a CI job proves they never leak into the core.Swap any stage — the embedder, vector index, and matcher are independent ports: lexical → semantic, brute-force → FAISS, threshold → LLM, no pipeline rewrite.
End to end — block → match → cluster → evaluate, with linkage, blocking, and clustering (B³) metrics included.
Immutable & typed — single
link/dedupe/match_pairscalls, a frozen 1.0 contract, strictmypy, and 100% branch coverage.
New here? The tutorial links two tables stage by stage; the Semantic + LLM guide covers the production stack.
Install, then link two tables in under five minutes.
Task recipes: linking, deduplication, the semantic + LLM stack, and evaluation.
The curated surface — prelude, adapters, and the core contract.
Ports and adapters, the spec→artifact law, and the design record.