denselinkage¶

Record linkage and deduplication — dense blocking, optional LLM matching, and evaluation built in.

Find the records that refer to the same real-world entity, across two datasets (linkage) or within one (deduplication). denselinkage shrinks the all-pairs comparison with embedding-based blocking, decides each candidate with a pluggable matcher — a similarity threshold or an LLM — then clusters and scores the result. The core runs on a dependency-free numpy + pandas stack; FAISS, sentence-transformers, and LangChain are opt-in extras you add when you need them.

from denselinkage import DenseLinker, LabeledPairs, Source
from denselinkage.metrics import linkage_metrics

linker = DenseLinker.with_defaults()        # sensible embedder + index + matcher
result = linker.link(Source(df_a, id_column="id"), Source(df_b, id_column="id"))
result.to_frame()                           # left_id, right_id, similarity, match, ...
linkage_metrics(result, gold=LabeledPairs.from_pairs([("A1", "B1")]))

Highlights¶

Dependency-free core — pip install denselinkage is just numpy + pandas; the heavy backends are opt-in and a CI job proves they never leak into the core.
Swap any stage — the embedder, vector index, and matcher are independent ports: lexical → semantic, brute-force → FAISS, threshold → LLM, no pipeline rewrite.
End to end — block → match → cluster → evaluate, with linkage, blocking, and clustering (B³) metrics included.
Immutable & typed — single link / dedupe / match_pairs calls, a frozen 1.0 contract, strict mypy, and 100% branch coverage.