Clean up data in OpenRefine (KBART files)
When viewing a GOKb project in OpenRefine, you will see a series of error and warning messages in the left hand navigation pane. Fixing these errors ensures that the data being ingested into GOKb is of uniform high quality. Primarily, they check to make sure that all required fields are present and that the data in each cell is formatted correctly.
Error messages must be resolved before you will be allowed to ingest a file into GOKb. Warning messages do not need to be resolved – they highlight missing data that you may want to add, but that is not required.
Steps
Step 1: Open your project
- From the GOKb tab in OpenRefine, click the "check out" link next to the project you wish you to open.
- If your project doesn't appear in the list, you may need to check it in for the first time.
- If your project is checked out by another user, you will not be able to open or edit it. Checked out projects list the email address of the current user, and you can can contact them if necessary.
Step 2: Rename columns
- GOKb requires standardized names for each column that will be ingested into GOKb.
- For KBART files, you can use a macro to rename all of the columns in one step.
- Click Edit in any cell and then right-click to show the "Apply Macro" option.
- Click "Apply Macro" and then search for the KBART column transformation.
- Double-click the KBART macro and then wait while the application processes.
- Your columns will be renamed and you should see several error messages disappear.
Step 3: Add missing columns
- KBART files are generally missing several required columns for GOKb, which must be added manually.
- Each additional missing column should generate an error message.
- Add each column by clicking the menu button next to the error message and selecting Quick Resolution>Append a Blank Column.
- You will need to populate each blank column with appropriate data
- The list of GOKb columns contains information about what type of data should be in each column.
- The following three columns will always need to be added, and will require you to look up a controlled value to populate the data.
- platform.host.name
- package.name column
- org.publisher.name (Note that your KBART file will contain a column called "publisher_name." Often the publisher will use this column for information that GOKb would consider as an imprint, rather than a publisher. See Step 5 below for more information about how to address this column.)
- The following columns are optional, and you may choose to add them if your data happens to contain extra information about these fields. If you are missing this information, you can omit these columns:
- TIPPPayment
- TIPPStatus
- Title.OAStatus
- The following column is no longer used and you can ignore this warning in all cases:
- PrimaryTIPP
Step 4: Address invalid data
- GOKb requires that the data in certain columns be formatted a certain way.
- You will receive an error message for each column that contains improperly formatted data.
- Common data errors include:
One or more rows contains invalid dates in the column "DateFirstPackageIssue" or One or more rows contains invalid dates in the column "DateLastPackageIssue"
- The easiest way to resolve these errors is to select Quick Resolution>Attempt Automatic Conversion.
- If the conversion is not successful, see the dates page.
- The easiest way to resolve these errors is to select Quick Resolution>Attempt Automatic Conversion.
Data in the column "KBARTEmbargo" must follow the KBART guidelines for an embargo.
Data in the column "CoverageDepth" must follow the KBART guidelines for an coverage depth.
- Follow the link in the error message to see the correct format for KBART embargos and coverage depths.
- Since publishers format this information in a variety of ways, you'll need to use the features of OpenRefine to figure out how to resolve each error on a case by case bases.
One or more rows contain duplicated data for the column "title.identifier.issn."
One or more rows contain duplicated data for the column "title.identifier.eissn."
- It is important to resolve these errors, because duplicate ISSNs and eISSNs will cause two titles to merge into a single record in GOKb.
- To fix these errors, choose Quick Resolution>Facet. This will show you all of the duplicated values.
- For each group of duplicates, you will need to research the titles involved and determine which one the ISSN correctly belongs to.
- You can leave the ISSN as is for the correct title.
- For the incorrect titles, you must delete the ISSN. You can either leave the field blank or populate it with the correct ISSN if one exists. Click "Capture Edit" to save this change for future package updates.
- For more information on how to address repeat errors in a package, see Dealing with repeated errors in OpenRefine projects.
Step 5: Review additional fields
- Your GOKb project may contain additional fields that are not standard, but that may still be useful. You can load these fields into GOKb as custom columns.
- There are some common fields that occur in KBART files that we would like to preserve when present.
- To keep this data standardized, please use the following custom field names:
KBART column name | GOKb custom column name |
---|---|
title_id (when using a provider's proprietary ID) | title.identifier.{authorizedprovidername} Ex. title.identifier.sage |
title_id (when using a DOI) | title.identifier.doi |
preceding_publication_identifier | gokb.ti.precedingPublicationID |
publisher_name (when used for imprints) | gokb.ti.imprint |
Next step
Once you've resolved all of your error messages and any warning messages you choose, you will be ready to Ingest a Project Into GOKb.
Operated as a Community Resource by the Open Library Foundation