Reusing an index¶
link and dedupe build everything in one call. When you query the same
reference dataset many times — a parameter sweep, streaming queries, or matching
several frames against one master table — build the index once and reuse it.
idx = linker.index(left) # build the blocking index once -> LinkageIndex
idx.query(right_a) # -> LinkageResult
idx.query(right_b) # reuses the built index; no re-embedding of `left`
index() returns a
LinkageIndex — the prepared, per-dataset state
separated from the linker’s configuration.
Why this is safe to reuse¶
The index is an immutable artifact. Internally a
Blocker is a stateless spec whose build()
produces a fresh BlockingIndex holding the
indexed vectors; the spec mutates neither itself nor its inputs. So one built
index answers many queries with no risk of cross-query state leaking, and the
same DenseLinker can build many indexes over different datasets.
This is the spec → artifact law that runs through the whole library; see Architecture.
Sweeping query-time parameters¶
top_k and similarity_threshold are query-time parameters with sensible
defaults from the blocker spec, so a threshold/top_k sweep reuses one built
index instead of rebuilding it — the expensive embedding work happens once.
Persisting a built index¶
Building the index embeds the whole reference set — the expensive step. Save the
built LinkageIndex and reload it later to query new
data with no re-embedding:
idx = linker.index(left)
idx.save("master_index") # writes a vectors.npy + meta.json bundle
from denselinkage import LinkageIndex
from denselinkage.embedding import HashedNGramEmbedder
from denselinkage.matching import ThresholdMatcher
reloaded = LinkageIndex.load(
"master_index",
embedder=HashedNGramEmbedder(n_features=1024, ngram=3), # must match the saved model
matcher=ThresholdMatcher(threshold=0.5),
)
reloaded.query(right) # reuses the stored embeddings of `left`
The store records the embedder’s model_id and embedding_dim as provenance:
if the embedder you pass to load doesn’t match, it raises
IncompatibleStore — a persisted index cannot be
queried with a different embedding model. The matcher is not persisted (supply any
one at load). Persistence currently covers the dependency-free reference stack;
other backends raise NotImplementedError.