Choosing components

denselinkage ships a dependency-free default stack and heavier adapters for when you outgrow it. This page is the decision guide.

The default stack

DenseLinker.with_defaults() wires:

Stage

Default

Extra

Embedder

HashedNGramEmbedder (lexical)

none

Vector index

NumpyFlatIndex

none

Blocker

DenseBlocker

none

Matcher

ThresholdMatcher

none

It runs with only NumPy and pandas — good for small-to-medium data and for matches recoverable from surface text.

Lexical vs semantic embeddings

  • Lexical (HashedNGramEmbedder): character n-gram hashing. Recovers abbreviations, punctuation, and typos (Apple Inc / Apple Incorporated). Fast, dependency-free. Misses semantic renames (GoogleAlphabet).

  • Semantic (SentenceTransformerEmbedder, extra [sentence-transformers]): sentence embeddings that capture meaning. Reach for it when matches need world knowledge rather than shared characters.

Threshold vs LLM matching

  • ThresholdMatcher (default): decides on the blocker’s similarity score. Zero cost, fully deterministic. Use it as the second gate above the blocker’s retrieval threshold.

  • LangChainMatcher (extra [langchain]): an LLM reads each pair and returns a typed decision with a rationale. Use it for hard pairs where similarity alone is ambiguous; pair it with a RetryPolicy.

Rule of thumb

Start with with_defaults(). Swap the embedder first if you are missing semantic matches, the index if search is too slow, and the matcher last if similarity cannot separate true from false pairs. Every swap is one constructor argument — see Custom components.