Record/value priority

This determines the criteria that will be applied when selecting the best record/values. There are four methods for selecting which records/values to give priority to.

Most populated (default option)

Selects the record with the highest number of non-null column values. This option is essentially selecting the 'most complete' record without any regard to the quality of the data within it.

Total length
Selects the record/value that has the highest total character count (the length of the value in each column added together).

Score based
This is the most powerful and flexible of all the options - it's essentially a custom method for determining the priority of records/values. You identify a column which will contain a score for the record/value:

  • a numeric column - the record with the highest value will be selected
  • a column containing true/false values - the priority will be given to records with true values

The Lowest score first option allows you to flip the priority so that records with the lowest score (or that are false) are selected where possible.

Typically, a score column (or set of columns) will be added in a previous Transform step, where all of the power of Data Studio functions can be used to execute the business logic required to determine the suitability of a record or its values.

Example 1:
Your records contain a Date last updated column and you want the surviving record to be based on whichever record was updated the most recently. To do this, add a new Days since updated column in a Transform step and using the appropriate functions derive this value from the Date last updated column. This new column can now be used as the 'score' column.

Example 2:
You have a column indicating the source system for the record and you have a preference for certain systems in terms of the quality of the data they're likely to contain. A new 'score' column could be added that derives a score on the basis of the source system. You can apply even more complex rules, such as preferring different source systems for customer records from different categories or geographies.

Most common

When there are more than two records in a cluster and they contain inconsistent values, it often makes sense to choose the value that appears the highest number of times.

For independent columns, this is straightforward and the value that occurs most frequently within the column will be selected. However, processing multiple columns together is more complex. Firstly, the most common non-null value in the first column in the group is identified, then any rows that do not share the value are discarded. This is repeated on the second and the subsequent columns until a unique row has been identified.

Tie-breaking

The processing rules can always result in two records/values being tied (same number of non-null columns, same length values, same score, etc).

In this case, we first tie-break by using the Total length rule. If we're still tied after that, then we sort the values of the first column, then second, and so on, until we have a record at the top of the sorting.