Instance matching strategy

Purpose

To create a record matching strategy that steadily improves the accuracy of instance matching.

Matching Strategy

  • As each record comes in we generate match points for it, which are just UUIDs based on the data in specified fields.

  • We then use those points to find cluster records with the same match points

  • If we match 0 clusters, then we add a new cluster record for this current resource

  • If we match exactly 1 cluster record then 1 of 3 things happens:

    1. If the current resource generate a low number of points (currently 1 or 2), then we treat this as a match and add it to the cluster, but log the fact that the resource generated a small number of comparison points.

    2. If current resource generates many points for comparison but we only match 1 we treat this as not confident enough at this stage to cluster, and so a new cluster record is created for this resource. As the process continues that resource may be consumed if more rich records match them too but initially we don't cluster it. We need a second step here really.

    3. Multiple match points match a single cluster record. Add this resource to that cluster.

  • Multiple cluster records match from multiple different points - Currently we just combine the clusters and add the resource. This needs extension and better "merging" rules but this is what we have ATM

Refinements

Data normalization

  • “Data coherence” where the resolution of multiple variant strings with the same meaning are reduced to a single equivalent standardized string.

    • Example 1: Vol. and v. and volume have the same meaning. As a result, for the sake of coherence, we can distill all three strings down to one string, “Volume”.

    • Example 2: Large print.

    • Example 3:

Operated as a Community Resource by the Open Library Foundation