Linking two tables¶
link(left, right) matches records across two datasets. The
quickstart uses with_defaults(); this page
shows full component control — assembling the blocker and matcher yourself.
Assemble the components¶
A DenseLinker is pure configuration. You inject a
Blocker (here a
DenseBlocker composing an embedder and a vector
index) and a Matcher:
from denselinkage import DenseLinker, Source, TemplateSerializer
from denselinkage.blocking import DenseBlocker
from denselinkage.embedding import SentenceTransformerEmbedder
from denselinkage.indexing import FaissFlatIndex
from denselinkage.matching import LangChainMatcher
linker = DenseLinker(
blocker=DenseBlocker(
embedder=SentenceTransformerEmbedder("all-MiniLM-L6-v2"),
vector_index=FaissFlatIndex(),
similarity_threshold=0.80, # retrieve top_k, then keep >= this
top_k=5,
),
matcher=LangChainMatcher(llm=...),
)
The embedder and the vector index are independent ports — swap the index (NumPy ↔ FAISS) without touching the embedder, and vice versa.
Reconcile differing schemas¶
Each Source carries its own serializer, so two tables with different column
names share one template via column_mapping:
template = "Name: {name}, City: {city}"
left = Source(df_a, id_column="id_a", serializer=TemplateSerializer(template))
right = Source(df_b, id_column="id_b", serializer=TemplateSerializer(
template, column_mapping={"company_name": "name", "headquarters": "city"}))
result = linker.link(left, right) # one call, no mutation
result.to_frame() # left_id, right_id, similarity, match, confidence, reason
Full example¶
"""Example 01 — End-to-End Dense Linkage (full control).
Explicitly assembled components: a dense blocker (SentenceTransformer
embeddings + a FAISS index) then an LLM matcher. Vector indexes live in
``denselinkage.indexing`` (their own port, parallel to embedders).
The prompt carries ONLY the semantic question: the matcher owns output
structure and returns typed ``MatchDecision``s, so a brittle "Answer YES or
NO" instruction is neither needed nor wanted.
NOTE: Runs with the heavy extras installed (``pip install "denselinkage[all]"``)
and an ``OPENAI_API_KEY`` — it uses ``SentenceTransformerEmbedder``,
``FaissFlatIndex``, and a ``LangChainMatcher`` over a live LLM. It is type-checked
and compiled in CI but not executed there (needs the extras + a key). For a
runnable end-to-end on the dependency-free stack, see ``00`` / ``04``.
"""
import pandas as pd
from langchain_openai import ChatOpenAI
from denselinkage import DenseLinker, LabeledPairs, Source, TemplateSerializer
from denselinkage.blocking import DenseBlocker
from denselinkage.embedding import SentenceTransformerEmbedder
from denselinkage.indexing import FaissFlatIndex
from denselinkage.matching import LangChainMatcher, RetryPolicy
from denselinkage.metrics import linkage_metrics
def main() -> None:
df_a = pd.DataFrame(
{
"id_a": ["A1", "A2", "A3"],
"name": ["Apple Inc", "Microsoft Corp", "Google LLC"],
"city": ["Cupertino", "Redmond", "Mountain View"],
}
)
df_b = pd.DataFrame(
{
"id_b": ["B1", "B2", "B3"],
"company_name": ["Apple Incorporated", "Microsoft", "Alphabet"],
"headquarters": ["Cupertino, CA", "Redmond, WA", "Mountain View, CA"],
}
)
gold = LabeledPairs.from_pairs([("A1", "B1"), ("A2", "B2"), ("A3", "B3")])
# Blocker: embedder and vector index injected independently.
blocker = DenseBlocker(
embedder=SentenceTransformerEmbedder("all-MiniLM-L6-v2"),
vector_index=FaissFlatIndex(),
similarity_threshold=0.80, # retrieve top_k, then keep >= this
top_k=5,
)
# Matcher: the LLM is injected; model / operational / domain config stay
# separate. The prompt is just the question — no format instruction.
matcher = LangChainMatcher(
llm=ChatOpenAI(model="gpt-4o-mini", temperature=0.0, seed=42),
prompt=(
"Are these two records the same real-world entity?\n"
"Record A: {record_a}\n"
"Record B: {record_b}"
),
retry=RetryPolicy(max_retries=3),
max_concurrency=8,
)
# The linker is pure config — no data, no schema, nothing fitted.
linker = DenseLinker(blocker=blocker, matcher=matcher)
# Schema travels with each frame. df_b maps its columns onto the shared
# template; the linker never learns either schema.
template = "Name: {name}, City: {city}"
left = Source(df_a, id_column="id_a", serializer=TemplateSerializer(template))
right = Source(
df_b,
id_column="id_b",
serializer=TemplateSerializer(
template,
column_mapping={"company_name": "name", "headquarters": "city"},
),
)
result = linker.link(left, right) # one call, no mutation
print("\n--- Match Results ---")
# Fixed schema, independent of input column names:
# left_id, right_id, similarity (float), match (bool),
# confidence (float|None), reason (str|None). One row per *decided*
# pair (matches AND non-matches); pairs the matcher could not decide
# are in result.errors, not rows here.
print(result.to_frame())
print("\n--- Evaluation Metrics ---")
metrics = linkage_metrics(result, gold=gold) # -> LinkageMetrics
print(f"Precision: {metrics.precision:.4f}")
print(f"Recall: {metrics.recall:.4f}")
print(f"F1 Score: {metrics.f1:.4f}")
if metrics.n_errors:
print(f"(errored pairs excluded from P/R: {metrics.n_errors})")
if __name__ == "__main__":
main()
Note
This example uses the heavy adapters (SentenceTransformerEmbedder,
FaissFlatIndex, LangChainMatcher) behind the [all] extra and a live LLM, so
it needs those extras and an OPENAI_API_KEY to run. It is type-checked and
compiled in CI but not executed there. See Semantic + LLM matching
for a walk-through of the knobs.
See also¶
Reusing an index — amortize the build across many queries.
Evaluation — score the run with
linkage_metrics.Choosing components — lexical vs semantic vs LLM.