Dealing with repeated errors in OpenRefine projects

This page documents ways that users can address repeat errors in package files through the OpenRefine ingest process. Repeat errors may include: missing or incorrect ISSNs or eISSNs, missing dates, spelling errors, or other typos.

Macros

We have created macros for existing GOKb data providers and will continue to create new macros as needed. To run a macro in an OpenRefine project:

  • In any cell, click Edit
  • Right-click in the box and select Apply Macro
  • In the search box, type the name of your provider to locate the correct macro
  • Click Ok  to run the macro. 
    • Note: the macro will run automatically and should only take a few seconds. When the macro functions are complete, you should see fewer Error and Warning messages in the left-hand column.

Capture-Edit

In addition to the Macros, you can also save changes you make that the cell-level by selecting "Capture Edit" from the dialog box. This will generate valid JSON code, which can then be copied and used the next time you update the package. For example, to save the change of a missing or incorrect ISSN:

  • Click in the cell you would like to edit
  • Update the ISSN to the correct one
  • Check the box next to Capture Edit
  • Check to see that the PublicationTitle is selected in the drop-down menu
  • Click Apply  to complete the change and save the edit
    • To view the saved edit, go to the Undo/Redo tab and you will see your edit in the last documented step.

  • After you have completed all edits for this package, go to the Undo/Redo tab and select Extract. Copy and paste the JSON into a text file and then you can reuse this code next time you update the package so that you do not have to recreate all of the editing steps.

Document repeated errors

If you work with the same data every month, you'll quickly realize how frustrating it is to fix the same errors again and again. One useful strategy is to document repeated errors so you don't have to research them each time you process a file. You can use an Excel template to document two kinds of errors – cell level errors and rows to delete.

For cell-level errors, you should document the title that is affected, the field name of the cell that you have to change, and the original and new values of that cell.

For entire rows that need to be deleted, you will need to document enough information to identify the row in the future – usually the title and at least one identifier.

Download the template

Work with the supplier

You may want to pursue working with the data supplier (usually a publisher or aggregator) to see if they are willing to fix the errors at the source. You may want to start the conversation by asking if they would be interested in receiving notice of errors in their data and their preferred format for receiving them.

 

Operated as a Community Resource by the Open Library Foundation