Data Quality user documentation | Record/value priority

This determines the criteria that will be applied when selecting the best record/values. There are four methods for selecting which records/values to give priority to.

Most populated (default option)

This option is only available when Select best record or Select and complete best record processing methods are selected.

Selects the record with the highest number of non-null column values. This option is essentially selecting the 'most complete' record without any regard to the quality of the data within it.

Total length
Selects the record/value that has the highest total character count (the length of the value in each column added together).

Score based
This is the most powerful and flexible of all the options - it's essentially a custom method for determining the priority of records/values. You identify a column which will contain a score for the record/value:

a numeric column - the record with the highest value will be selected
a column containing true/false values - the priority will be given to records with true values

Note that alphanumeric values are currently not supported. If unsure, you can use the Profile step to check the format of the score column before using it.

The Lowest score first option allows you to flip the priority so that records with the lowest score (or that are false) are selected where possible.

Typically, a score column (or set of columns) will be added in a previous Transform step, where all of the power of Data Studio functions can be used to execute the business logic required to determine the suitability of a record or its values.

Example 1:
Your records contain a Date last updated column and you want the surviving record to be based on whichever record was updated the most recently. To do this, add a new Days since updated column in a Transform step and using the appropriate functions derive this value from the Date last updated column. This new column can now be used as the 'score' column.

Example 2:
You have a column indicating the source system for the record and you have a preference for certain systems in terms of the quality of the data they're likely to contain. A new 'score' column could be added that derives a score on the basis of the source system. You can apply even more complex rules, such as preferring different source systems for customer records from different categories or geographies.

Most common

This option is only available when Merge best values record processing method is selected.

When there are more than two records in a cluster and they contain inconsistent values, it often makes sense to choose the value that appears the highest number of times.

For independent columns, this is straightforward and the value that occurs most frequently within the column will be selected. However, processing multiple columns together is more complex. Firstly, the most common non-null value in the first column in the group is identified, then any rows that do not share the value are discarded. This is repeated on the second and the subsequent columns until a unique row has been identified.

Tie-breaking

The processing rules can always result in two records/values being tied (same number of non-null columns, same length values, same score, etc).

In this case, we first tie-break by using the Total length rule. If we're still tied after that, then we sort the values of the first column, then second, and so on, until we have a record at the top of the sorting.

Was this helpful?

Previous: Record processing

Next: Unselected records

Record/value priority

Tie-breaking

Aperture Data Studio v1

Harmonize duplicates step