The Compare records section of the workbench lets you compare two records from your duplicate store to see exactly why they were or were not matched as duplicates by the Find duplicates step. This makes it easy to identify potential configuration changes that may be needed.

Search records

To begin, enter a search term to find the two records that you would like to compare. The search term can be present in any column of the input records and is case-insensitive. The entire search term doesn't need to match a single record, so for example, you can also enter multiple unique IDs separated by a space.

If it was enabled on the load duplicate store screen, you can also use Lucene query syntax to perform a more specific search. For example, to search for all records that have a forename field starting with "John" and a surname field of "Doe", you could use the search: forename:john* AND surname:doe. When using Lucene query syntax, use the input dataset column names in lowercase as field names.

Once you have carried out your search, you will be presented with a list of records to choose from. Pick the two records that you would like to compare by selecting the checkbox to the left of each record and then click the Analyze button.

View standardization

You will then be able to view each record in its standardized form so you can see the final values used by the Find duplicate step to determine if the two records are duplicates. The following example shows two records after they have been standardized:

standardization_example.png

Blocking keys

Below standardization, you will find a comparison of each of the generated blocking keys. A blocking key highlighted in green means this blocking key resulted in the two records being scored together. If no blocking keys are highlighted, it means the two records did not reach the scoring stage and will not be considered a match regardless of how they are evaluated against the match rules.

Visualize rules

The visualize rules section shows how the two selected records are evaluated against the rules. The leftmost nodes represent the different match levels, each of which can be expanded by clicking on them to show the tree of rules which were evaluated to cause the pair of records to either be matched or not matched. You can further expand any of the rule nodes to see the sub-rules which compose that rule for a complete breakdown of the final match result.

There are four different rule nodes:

  • A green rule node means the rule evaluated to true and could lead to a potential duplicate match.
  • A red rule node means the rule evaluated to false so the rule doesn't consider the two records a match.
  • A brown rule node means it was not necessary to evaluate the rule so it was skipped.
  • A dotted green line leading to a green node means that the rule wasn't explicitly evaluated but inferred to have passed based on the results of other, stricter rules.

In addition, an AND node is used to represent the logical AND operation for rules. All rule nodes directly following an AND node must pass for that node to be considered a pass.

In the following example:

visualize_rules.png

  • The highest match rule to pass (green rule node) indicates that these two records matched at L2 (shown here as MATCH.PROBABLE).
  • The L1 match rule (MATCH.CLOSE) failed because the ADDRESS.CLOSE rule failed (red rule node).
  • The FORENAMESANDADDRESS.PROBABLE node is brown, meaning it was skipped. This is because the MATCH.PROBABLE rule is an OR condition (second path without the AND node) so only one of its sub-rules needs to pass.
  • The NAME.PROBABLE rule was inferred to have passed (dotted green line to a green node). This is because the NAME.CLOSE rule passed, and because NAME.CLOSE is a stricter match level than NAME.PROBABLE the Find duplicates step infers it to be a pass also.