denselinkage.clustering.connected_components

denselinkage.clustering.connected_components(result: LinkageResult, *, all_record_ids: Iterable[str] | None = None) ClusteringResult[source]

Connected-components clustering: transitively close the matched pairs in result and label each record with its component id. Convenience form of ConnectedComponentsClusterer; kept in the prelude.

Nodes are every record the pipeline compared — those in result.decisions (matched or not) or result.errors — so a record that was never matched (including pairs the matcher could not decide) becomes its own singleton cluster; edges are the matched pairs only. Clustering is transitive: if A matches B and B matches C, all three share a cluster even if A and C were never matched directly. Cluster ids are 0..n_clusters-1, assigned deterministically by each component’s smallest record id.

A record that produced no candidate pair at all (e.g. dedupe with a small top_k whose only neighbour was itself) does not appear in result. Pass all_record_ids to seed the clustering universe with the full id set: every listed record is labelled — an unmatched one becomes its own singleton — so clustering_metrics reports a complete B³ over all records instead of one inflated by the dropped records. Ids are stringified to match record ids; None (the default) keeps the universe to the records seen in result.