Choosing components¶
denselinkage ships a dependency-free default stack and heavier adapters for when you outgrow it. This page is the decision guide.
The default stack¶
DenseLinker.with_defaults() wires:
Stage |
Default |
Extra |
|---|---|---|
Embedder |
|
none |
Vector index |
none |
|
Blocker |
none |
|
Matcher |
none |
It runs with only NumPy and pandas — good for small-to-medium data and for matches recoverable from surface text.
Lexical vs semantic embeddings¶
Lexical (
HashedNGramEmbedder): character n-gram hashing. Recovers abbreviations, punctuation, and typos (Apple Inc/Apple Incorporated). Fast, dependency-free. Misses semantic renames (Google→Alphabet).Semantic (
SentenceTransformerEmbedder, extra[sentence-transformers]): sentence embeddings that capture meaning. Reach for it when matches need world knowledge rather than shared characters.
Scaling vector search¶
NumpyFlatIndex is exact and dependency-free but
brute-force. For large reference sets, switch to
FaissFlatIndex (extra [faiss]) — same
VectorIndex port, drop-in replacement.
Threshold vs LLM matching¶
ThresholdMatcher(default): decides on the blocker’s similarity score. Zero cost, fully deterministic. Use it as the second gate above the blocker’s retrieval threshold.LangChainMatcher(extra[langchain]): an LLM reads each pair and returns a typed decision with a rationale. Use it for hard pairs where similarity alone is ambiguous; pair it with aRetryPolicy.
Rule of thumb¶
Start with with_defaults(). Swap the embedder first if you are missing
semantic matches, the index if search is too slow, and the matcher last
if similarity cannot separate true from false pairs. Every swap is one
constructor argument — see Custom components.