denselinkage.linkage.DenseLinker

class denselinkage.linkage.DenseLinker(*, blocker: Blocker | None = None, matcher: Matcher)[source]

Bases: object

Pure config. link(a, b) == index(a).query(b).

kw_only so blocker is optional while matcher stays required (and to forbid positional ambiguity — callers always write DenseLinker(blocker=..., matcher=...)).

blocker is optional: developers who already have candidate pairs from rule-based / external blocking have no blocker. link/dedupe/ index require one and raise ValueError if it is None; match_pairs does not (it is inference with no blocking step — no learning on the linker, so the immutable-config contract holds). with_defaults always yields a linker with a blocker.

classmethod with_defaults(*, blocker: Blocker | None = None, matcher: Matcher | None = None) DenseLinker[source]

Low-floor entry point: wire the dependency-free reference stack (HashedNGramEmbedder + NumpyFlatIndex behind DenseBlocker, plus ThresholdMatcher). Pass blocker= / matcher= to override either half. The default stack is lexical (character n-gram hashing) — it recovers abbreviations/punctuation/typos, not semantic renames. Imports are local so import denselinkage stays light.

index(source: Source) LinkageIndex[source]

Build the prepared linkage state by delegating indexing to self.blocker.build (which returns a fresh BlockingIndex — this frozen config is never mutated) and composing it with self.matcher.

Raises ValueError if blocker is None. From the RecordReader seam: UnknownIdColumn, EmptySource, DuplicateRecordId; InvalidTopK if the blocker’s top_k <= 0; DimensionMismatch if the embedder width differs from the index. All denselinkage.core.errors subclasses.

Two-table linkage.

Raises ValueError if blocker is None; otherwise the same denselinkage.core.errors taxonomy as index (UnknownIdColumn, EmptySource, DuplicateRecordId, InvalidTopK, DimensionMismatch), evaluated for each of left/right.

block(left: Source, right: Source, *, top_k: int | None = None, similarity_threshold: float | None = None) list[CandidatePair][source]

Blocking-only two-table affordance: the blocker’s CandidatePair objects for left / right without matching. block(a, b) mirrors link(a, b) == index(a).query(b) (here index(a).candidates(b)); feed the result to blocking_metrics / pair_completeness_at_k.

top_k / similarity_threshold override the blocker’s spec for this call (e.g. a large top_k to sweep pair-completeness). Raises ValueError if blocker is None (via index()); otherwise the same denselinkage.core.errors taxonomy as link().

dedupe(source: Source) LinkageResult[source]

Single-table dedupe (self-pairs suppressed).

Raises ValueError if blocker is None; otherwise the same denselinkage.core.errors taxonomy as index.

match_pairs(candidates: Sequence[CandidatePair]) LinkageResult[source]

Matcher-only path: score externally supplied candidate pairs (e.g. rule-based / pre-blocked) with self.matcher, skipping blocking. Does not require blocker. Result flows through the same LinkageResult / metrics path as link.

To build the input from a DataFrame of id-pairs, use denselinkage.candidate_pairs_from_frame (an id-pair frame + the two sources -> list[CandidatePair]). Raises no Source-validation errors here (it takes pre-built CandidatePair``s, whose ``similarity_score may be None); backend matcher failures surface per-pair as MatchError, never as exceptions.