Evaluation & metrics

The evaluators are pure functions over already-computed outputs. Each takes ground truth as a keyword-only gold argument and returns a typed report. They live in denselinkage.metrics.

Ground truth

Express gold matches as LabeledPairs:

from denselinkage import LabeledPairs

gold = LabeledPairs.from_pairs([("A1", "B1"), ("A2", "B2")])

To tune on one slice and evaluate on another, hold out a test split. It is seeded for reproducibility and split by pair (a record may appear in both halves — fine for tuning a scalar threshold):

train, test = gold.split(test_size=0.2, seed=0)   # -> (LabeledPairs, LabeledPairs)

Linkage quality

linkage_metrics() scores decided pairs against gold:

from denselinkage.metrics import linkage_metrics

m = linkage_metrics(result, gold=gold)     # -> LinkageMetrics
print(m.precision, m.recall, m.f1)
print(m.n_errors)                          # undecided pairs, excluded from P/R

Direction matters for dedup

link pairs are directed (left vs right is meaningful); dedup pairs are not. Control this with the directed flag:

linkage_metrics(result, gold=gold, directed=False)   # canonicalize unordered pairs

Use directed=False when scoring a dedupe run so ("1","2") and ("2","1") count as the same gold pair.

Blocking quality

Before matching, check that blocking actually retrieved the true pairs — otherwise no matcher can recover them:

from denselinkage import DenseLinker
from denselinkage.metrics import blocking_metrics, pair_completeness_at_k

# `block` returns the blocker's candidate pairs without matching. Use a large
# `top_k` so pair-completeness can be swept over several k.
candidates = DenseLinker.with_defaults().block(left, right, top_k=10)
bm = blocking_metrics(candidates, gold=gold, ks=[1, 5, 10])  # -> BlockingMetrics
print(bm.pc_at(5))
pc = pair_completeness_at_k(candidates, gold=gold, k=5)       # recall@k of the blocker

For a single-table dedupe blocker, query the index against its own source — DenseLinker.with_defaults().index(source).candidates(source).

Clustering quality

After connected_components(), measure over-/under- merging with B³ (Bagga–Baldwin):

from denselinkage.metrics import clustering_metrics

cm = clustering_metrics(clusters, gold=gold)            # -> ClusteringMetrics
print(cm.b3_precision, cm.b3_recall, cm.b3_f1)

Threshold tuning

tune_threshold() sweeps the decision threshold over scored candidates and returns the full P/R/F1 curve as a ThresholdSweep; pick an operating point with best_f1() or at_recall(target):

from denselinkage import DenseLinker
from denselinkage.metrics import tune_threshold

candidates = DenseLinker.with_defaults().block(left, right, top_k=10)
sweep = tune_threshold(candidates, gold=gold)    # -> ThresholdSweep
threshold, m = sweep.best_f1()                    # the F1-optimal cut
print(threshold, m.precision, m.recall, m.f1)

By default the grid is the candidates’ distinct scores — the only cuts that change the prediction — so best_f1() finds the true optimum.

Separating blocker recall from matcher recall

A low end-to-end recall has two possible culprits: the blocker never surfaced the true pair, or the matcher saw it and said no. adjusted_metrics() separates them, reporting recall_adjusted = matcher.recall × pc@k where matcher.recall is measured only over the gold pairs blocking actually surfaced:

from denselinkage.metrics import adjusted_metrics

adj = adjusted_metrics(result, candidates, gold=gold, k=10)  # -> AdjustedMetrics
print(adj.blocking_recall_at_k, adj.recall_adjusted, adj.f1_adjusted)