Choosing components¶

denselinkage ships a dependency-free default stack and heavier adapters for when you outgrow it. This page is the decision guide.

The default stack¶

DenseLinker.with_defaults() wires:

Stage	Default	Extra
Embedder	`HashedNGramEmbedder` (lexical)	none
Vector index	`NumpyFlatIndex`	none
Blocker	`DenseBlocker`	none
Matcher	`ThresholdMatcher`	none

It runs with only NumPy and pandas — good for small-to-medium data and for matches recoverable from surface text.

Lexical vs semantic embeddings¶

Lexical (HashedNGramEmbedder): character n-gram hashing. Recovers abbreviations, punctuation, and typos (Apple Inc / Apple Incorporated). Fast, dependency-free. Misses semantic renames (Google → Alphabet).
Semantic (SentenceTransformerEmbedder, extra [sentence-transformers]): sentence embeddings that capture meaning. Reach for it when matches need world knowledge rather than shared characters.

Scaling vector search¶

NumpyFlatIndex is exact and dependency-free but brute-force. For large reference sets, switch to FaissFlatIndex (extra [faiss]) — same VectorIndex port, drop-in replacement.

Threshold vs LLM matching¶

ThresholdMatcher (default): decides on the blocker’s similarity score. Zero cost, fully deterministic. Use it as the second gate above the blocker’s retrieval threshold.
LangChainMatcher (extra [langchain]): an LLM reads each pair and returns a typed decision with a rationale. Use it for hard pairs where similarity alone is ambiguous; pair it with a RetryPolicy.

Rule of thumb¶

Start with with_defaults(). Swap the embedder first if you are missing semantic matches, the index if search is too slow, and the matcher last if similarity cannot separate true from false pairs. Every swap is one constructor argument — see Custom components.