denselinkage.core.results.LabeledPairs

class denselinkage.core.results.LabeledPairs(pairs: frozenset[tuple[str, str]])[source]

Bases: object

The gold set of true matches — one type everywhere.

Pairs are stored exactly as given (ordered); no symmetrization happens at construction. Each tuple is (left_id, right_id). Evaluation comparison depends on the verb:

  • link (two sources): the order is meaningful — a gold (a, b) matches a result pair whose left id is a and right id is b.

  • dedupe (one source): left/right is arbitrary, so metrics canonicalize both gold and result pairs to an unordered key (frozenset({a, b})) before comparing. This removes the silent recall/precision fork.

See the matching docstrings of linkage_metrics / pair_completeness_at_k for which comparison each applies.

split(*, test_size: float, seed: int | None = None) tuple[LabeledPairs, LabeledPairs][source]

Partition the gold pairs into (train, test).

test_size is the fraction in [0.0, 1.0] routed to test (round(test_size * n) pairs); the rest go to train. Pairs are sorted before a seeded shuffle, so the split is reproducible given seed (None = nondeterministic). Raises ValueError if test_size is outside [0.0, 1.0].

The split is pair-level: a record/entity may appear in both halves, which is fine for tuning a scalar decision threshold. For entity-disjoint evaluation (e.g. of a trained matcher) split by gold cluster instead.