Dealing with repeated errors in OpenRefine projects
This page documents ways that users can address repeat errors in package files through the OpenRefine ingest process. Repeat errors may include: missing or incorrect ISSNs or eISSNs, missing dates, spelling errors, or other typos.
Macros
We have created macros for existing GOKb data providers and will continue to create new macros as needed. To run a macro in an OpenRefine project:
- In any cell, click Edit
- Right-click in the box and select Apply Macro
- In the search box, type the name of your provider to locate the correct macro
- Click Ok to run the macro.
- Note: the macro will run automatically and should only take a few seconds. When the macro functions are complete, you should see fewer Error and Warning messages in the left-hand column.
Capture-Edit
In addition to the Macros, you can also save changes you make that the cell-level by selecting "Capture Edit" from the dialog box. This will generate valid JSON code, which can then be copied and used the next time you update the package. For example, to save the change of a missing or incorrect ISSN:
- Click in the cell you would like to edit
- Update the ISSN to the correct one
- Check the box next to Capture Edit
- Check to see that the PublicationTitle is selected in the drop-down menu
- Click Apply to complete the change and save the edit
- To view the saved edit, go to the Undo/Redo tab and you will see your edit in the last documented step.
- To view the saved edit, go to the Undo/Redo tab and you will see your edit in the last documented step.
- After you have completed all edits for this package, go to the Undo/Redo tab and select Extract. Copy and paste the JSON into a text file and then you can reuse this code next time you update the package so that you do not have to recreate all of the editing steps.
Document repeated errors
If you work with the same data every month, you'll quickly realize how frustrating it is to fix the same errors again and again. One useful strategy is to document repeated errors so you don't have to research them each time you process a file. You can use an Excel template to document two kinds of errors – cell level errors and rows to delete.
For cell-level errors, you should document the title that is affected, the field name of the cell that you have to change, and the original and new values of that cell.
For entire rows that need to be deleted, you will need to document enough information to identify the row in the future – usually the title and at least one identifier.
Work with the supplier
You may want to pursue working with the data supplier (usually a publisher or aggregator) to see if they are willing to fix the errors at the source. You may want to start the conversation by asking if they would be interested in receiving notice of errors in their data and their preferred format for receiving them.
Operated as a Community Resource by the Open Library Foundation