Testing clustering with the DCB hub discovery scaffold app

Introduction

The DCB hub discovery scaffold app is a lightweight search application that gives users an “under the hood” view of the DCB union catalog. For our purposes, it can be used to identify clustered instance records and corresponding holdings. When users click on a search results entry, they are redirected to a type of full record display, useful for identifying not just holdings but enough information to discern whether or not clustering is happening according to the current day algorithm.

To understand what the scaffold app has that will help identify clustering, a marked up version of the detailed record display follows this section. To duplicate this display on the scaffold, search for "piggies" and select the first result (Piggies / written by Don and Audrey Wood; illustrated by Don Wood. (1st ed).

Scaffold screenshot

image-20230926-003831.png

Terms

Selected record ID: the UUID of a DCB instance record. There is one DCB instance record, originating from a contributing library’s catalog, selected to represent the related clustered bib records.

Cluster ID: the DCB’s UUID for the group of clustered instances.

Local bib record ID: the local bib record unique identifier.

Identifiers () : the record identifiers used to determine if one instance is the same as another instance for the purposes of clustering.

Guidance

  • Find a few records in your library’s catalog that you know can be found in other working group member library catalogs. I have found popular works and western classics (e.g. To Kill a Mockingbird, Plato’s The Republic, Great Expectations) as a likely source of records in common.

  • When you have identified a handful of records and have confirmed their presence in the scaffold app, copy the title, the DCB cluster bib id, and the URL to a document used for your testing.

  • For those records that are already clustered, create a copy of each from within your catalog and suppress each record before saving it. You will use these as your test records.

  • With each test record, you will remove one of the identifiers and then unsuppress the bib. Observe what happens with the clustering/unclustering in DCB hub within a Jira testing feedback issue comment field. Repeat this process until your source record is unclustered and then begin to add identifiers one-by-one, recording results as you go into the Jira testing feedback issue until it returns to it’s original state and is re. Please note, in production, that the harvesting process is set to run anywhere from 4-24 hours (configured according to amount of expected traffic). For our testing, we’ve set the DCB to harvest each library at a cycle of every 2 minutes to effectively test.

  • Please record any other relevant observations.

Clustering algorithm

Currently we have the "quick" match process which is designed to be the first iteration algorithm and is expected to capture the vast majority of matches.

  • As each record comes in, DCB generates match points for it, which are treated as UUIDs based on the data in specified fields (r.g. ISBN, ISSN, OCLC, LCCN, Title Key).

  • We then use those points to find cluster records with the same match points

  • If there are no matches on clusters, then we add a new cluster record for this current resource

  • If there is a match on exactly 1 cluster record then 1 of 3 things happens:

    • If the current resource generates a low number of points (currently 1 or 2), then the records are considered as a match, add it to the cluster, but log the fact that the resource generated a small number of comparison points.

    • If the current resource generates many points for comparison but there is only a match on 1 field, the candidate record treats this as not confident enough at this stage to cluster, and so a new cluster record is created for this resource.

    • Multiple match points match a single cluster record. Add this resource to that cluster.

  • Multiple cluster records match from multiple different points - Currently the clusters add the resource.

Operated as a Community Resource by the Open Library Foundation