De-duplicating Lists

De-duplicate based on a single column

You can de-duplicate a list in Refine based on the data values in a single column. This approach relies on a function called 'Blank Down' - this works through line by line, and removes a value from a cell if it is identical to the value in the same column in the immediately preceding row.

To de-duplicate based on the PublicationTitle column, use the following process:

  • Click on the PublicationTitle column header and choose 'Sort...' from the drop down
  • Choose 'Sort cell values as text', and choose a-z or z-a - the order here does not matter, then click OK
  • A new 'Sort' option will appear at the top of the sheet - click this and choose 'Reorder rows permanently' in the drop down
  • Click on the PublicationTitle column header and choose 'Edit Cells' -> 'Blank down'
    • This will remove the value in a cell where it is identical to the value in the row above. This means that in each group of duplicate rows only the first one will have the title in the PublicationTitle cell
  • Click on the PublicationTitle column header and choose 'Facet' -> 'Customised Facet' -> 'Facet by blank'
  • Use the Facet to filter the list to those rows with a blank PublicationTitle
  • Click on the 'All' column header (always the left most column) and choose 'Edit rows' -> 'Remove all matching rows'
  • Remove the Facet - and you have removed all rows that have a duplicate PublicationTitle

NB This approach relies on the PublicationTitle column having no blank values to start with. This seems like a relatively safe bet, but if there was a situation where you have blank PublicationTitles for a reason, you would need to exclude such rows from your work before you started the process above - you can do this by identifying the rows you want to exclude from the process, starring (or flagging) them, creating a facet by star (or flag) and choose the facet to only work with unstarred (or unflagged) rows.

Operated as a Community Resource by the Open Library Foundation