When combining values from multiple records (i.e. either the Select and complete best record or Merge best values processing method was chosen), we strongly recommend that your harmonization rules are applied to column groups instead of individual columns.

For example, you wouldn't want to use Address Line 1 column from one record and Address Line 2 from another as they could represent two entirely separate addresses.

Instead, you'd group all the address columns together and apply business rules to the group in order to find the best overall address.
Similarly, you may want to group columns such as {Latitude and Longitude}, and {DOBYear, DOBMonth, DOBDay}.

Create a Column Group

To create a column group, click Additional options in the Harmonize duplicates step then click New group.

There are two ways to specify which columns will be added to the group:

  • Column List: pick the individual columns by column name (click Edit);
  • Data Tag: select all columns with the required tag.

The column group allows you to choose any of the value priority types to apply to all columns in the group together.
You will often want to apply different priority types to different column groups (for example you might favor using mailing addresses from your e-commerce delivery system, but email addresses from a social media system).

All columns not targeted by a column group will be processed by the default processing behavior as defined in the Harmonize duplicates step.

Record Priority

Harmonization models can be applied to each column group and are selected from the record priority setting.

When configuring column groups, you are able to select one of the 'smart' models for the record priority.

These models utilize neural networks for value comparisons, each having been trained for distinct intended applications.

String similarity model (beta)

A Smart Harmonize model which is applicable to generic string types such as names; attempts to select the ideal cluster value using length, Jaro-Winkler and Soundex comparisons.

Formats model (beta)

A Smart Harmonize model which is applicable to columns with few desirable formats; attempts to select the most appropriate cluster value given strict and simplified format distributions within its column.

Values model (beta)

A Smart Harmonize model which is applicable to columns with few desirable values; attempts to select the most appropriate cluster value given the distribution of the value and its format within its column.

Overlapping groups

Each column can only be processed once but could be targeted by multiple column groups (e.g. if it has multiple data tags applied). In this case, the priority is given to the first column group in the list, and that column will be ignored by any subsequent column groups.

To make this more obvious, each column group has a list of Affected columns containing the names of all columns targeted by this group. If any columns in this list are also targeted by any column groups higher up the list, then they are shown to be crossed-out because they will be ignored in this instance.

If you have more than one column group, you can move the groups up/down in the list to change the order they will be processed in.

Ignoring incorrect values

In some cases, it's possible to exclude low quality/incorrect values from the harmonization results. If you specify a Filter column for a column group, then only the records that contain the Filter value in this column will be considered.

For example, you may have an email column in your data but want to make sure that no invalid emails make it into your 'surviving' record.

To achieve this

  1. Create a column that dictates whether the email is valid or not (see the Validate emails step).
  2. Create a column group containing just the 'Email'
  3. Set the 'Is Valid Email' column as the filter column, and the filter value would be true.

The resulting harmonization would ignore any values from the 'Email' column where the 'Is Valid Email' column does not contain true, so all emails in the output are valid.

This is used in situations where you would rather have no email address than an invalid one.
Filter columns are applied before the prioritization process takes place (so, for example, the 'most common' priority rule would only count occurrences of valid emails, and ignore all invalid emails).

Another situation where Filter Columns are useful is where a source system outputs an 'inactive' or 'do not use' flag against legacy or outdated records, where you may want to avoid using such data in some or all fields.

If you want to apply more complex criteria when deciding which values should be ignored by harmonization, it's often easiest to add a Transform step earlier in the workflow, to output a new column indicating whether or not each value should be considered.