Identify data conflicts

Any Dataset or View can be analyzed, using the Relationships button in Explore mode, to:

  • Ensure that the expected relationships (e.g. a Primary Key) hold true.
  • Discover previously unknown relationships in your data.
  • Find bad data where the expected relationships do not hold (because of missing/misspelled values or correct but different values, e.g. Manhattan vs New York City as a locale).
  • Determine if tables can be normalized or split into multiple smaller tables removing duplicated data using the left-hand-side as a common key.

Each relationship is one or more left-hand columns tested against a single right-hand column within the same table of data. The relationship is said to be a dependency when each left-hand-side values only matches to a single right-hand-side value. Where this does not hold true, the rows with nulls or many right-hand values will be in conflict.

Example of analyzed relationships.

Quality score
The selected columns will all be tested against each other as both the left and right side of the relationship. Each relationship will then be given a Quality score, which is the percentage of rows where the dependency holds.

The quality score is calculated as [(T-C) / T] * 100, where T is the total number of rows and C is the count of the fewest conflicting rows preventing the dependency.

Examples:

  • A left-hand column whose values are completely unique, such as a primary key, will always return a quality score of 100% for every column it is tested against.
  • A left-hand column, such as salutation ("Mr.", "Mrs."…) will contain a few values repeated multiple times. Testing this against the column gender ("M", "F") might be expected to always return the same value, i.e. Mr=M for all rows, so a quality score of 100%, but generic entries such as "Dr" would return both M and F for different rows and so lower the score.
  • Imagine an Orders table where each order can have multiple rows for each product purchased. Using order_id as the left-hand-side would return a low score with many conflicts a column like quantity. However, combining "order_id, line_item" as the left-hand-side you are able to determine the right-hand-side and so get a high quality score.

Analyze relationship settings

Relationship quality threshold (%)
The default value of 98% will return all dependencies and any relationships with few conflicts, which may be worth drilling into to see if there are data quality issues that can be fixed to improve the score. Setting this below 90% will likely return uninteresting and coincidental relationships, which for larger data can also significantly increase the analysis time.

Left-hand columns combined
Up to three columns can be combined automatically on the left side. Selecting Complex analysis will find the best possible quality score, but will increase the analysis time. Unselecting means that if a relationship is found that meets the threshold then no further columns will be added to the left-hand side.

Quality status
A status of 'Estimated' means that the Quality score is close to the relationship quality threshold. Data Studio can determine this quickly without needing to count every single value. This significantly reduces the total time taken to perform the analysis so that focus can be put on the relationships that have the best likelihood of a dependency, which will have a 'Confirmed' status. Estimates are never based on a sample of data and the score is always rounded down. If it is a relationship you are particularly interested in confirming then it is possible to try again with a different threshold value setting.

Conflicts

Once the relationships have been analyzed, any scores lower than 100% indicate that there are rows in conflict. This means that there are multiple different values (including nulls) in the right-hand column against the same left-hand column.

To see what these values are for a specific relationship, select the desired relationship in the data grid and select the Conflicts action (or choose it when right-clicking the row).

The Conflicts drilldown alternates in row colour for each new left-hand side value to better show when moving through different conflicts, and highlights the right-hand column value in blue.

Each conflict has a:

  • Row count, which is useful to determine if there is a correct value. For example if 3 rows contain United States but 997 rows contain United States of America, the latter is more likely to be the recognized value.
  • Conflict reason, which is either Different right-hand side values, Null on left-hand side or Null on right-hand side.

Selecting one or more conflicts enables the Rows action, which will display all the original rows from the data that have that conflict.