De-duplicate data

Data de-duplication is about finding records in a table of data that are the same as others and removing them. This is a simple process when looking at records that are identical and gets more complicated when looking for 'similar' duplicated data to remove.

Different scenarios require different tools.

Duplicate values in column(s)

This is the most common use case for identical duplicate values (if you're a MS Excel user, think of the Remove duplicates option).

  1. Add a Group step /action.
  2. Move the column(s) with values you want to deduplicate to the right.
  3. (Optional) Delete the default Count column (assuming you don’t want to know the number of duplicate values per group).

If there are other columns in your table you don’t want to de-duplicate, but want to keep:

  1. Add a new Aggregate.
    a) Change the column selection type to Multiple.
    b) Move all the columns to the right (except those in step 2 above).
    c) Select the aggregate type as First value.
  2. Click Apply.

Group values into a list

There are a number of aggregates in addition to First value such as Count, Sum, Maximum, Last value etc.

One of these is Grouped values. Instead of returning the first value in the column, for each group it will collate all values into a comma separated list. Find out more about the Group step.

Duplicate values in a list

A list of comma separated values might contain duplicate values. There are several useful List Functions such as deduplicating values, determining the most/least common value or add a frequency count for each value.

Similar duplicate values

Values that are not exactly the same; are of different datatype, case, spacing, spelling, etc. will require some pre-processing before they can be de-duplicated. Either transform the column(s) or clone them and transform the cloned values, so these can be matched whilst retaining the original values. Find out more about the Transform step.

Finding similar values can be challenging in larger datasets. The Relationships option can help to highlight data inconsistencies in your data (such as United States/USA/United States of America). Find out more about Identify data conflicts.

Similar customer records

The Find duplicates Workflow step uses powerful standardization and matching algorithms to group together records containing similar contact data.

Complex unsorted records

For duplicate unsorted records with complex merge requirements, use the Harmonize duplicates Workflow step. It takes records that have already been identified as duplicates (clustered) and allows you to define the preferred method to merge records together or to identify the best record in the cluster.

Harmonize duplicates can keep all columns, including some that may be added in future, without needing to update step settings or the downstream steps that data is being fed into. This is better than the group approach where columns are assumed to be fixed.

First value also assumes that data can be pre-sorted into the required order. The Harmonize duplicates step doesn't require sorted data and can use a custom Score to determine the priority.