This section covers key concepts of Aperture Match.
A duplicate store is a reusable data object in Aperture Data Studio that holds records identified as duplicates during data matching processes. It allows you to persist, manage, and reuse these records across workflows, supporting ongoing data quality efforts.
Each match between two records will have one of the following confidence levels:
| Match status | Description |
|---|---|
| Exact (0) | Each individual field that makes up the record matches exactly. |
| Close (1) | Records might have some fields that match exactly, and some fields that are very similar. |
| Probable (2) | Records might have some fields that match exactly, some fields that are very similar, and some fields that differ a little more. |
| Possible (3) | Records contain the majority of fields that have a number of similarities, but do not match exactly. |
| None (4) | Records do not match. |
Rules and blocking keys define how records are grouped into clusters within a duplicate store. Aperture Match first creates blocks of similar records, which reduces the number of records that need to be compared. This is done to make the duplicate detection process more efficient. Blocks of records are generated based on a blocking key, which is made up of a combination of record elements.
Every pair of records in the resulting block is then compared using a set of matching rules, which are logical expressions that control the match level returned.
Combinations of matching rules and blocking keys can be stored in Step settings. During Duplicate store configuration, rules and keys can either be entered manually or selected from the available sets in Step Settings.
Additional sets of matching rules can be defined for use when searching the duplicate store using the bulk or real-time Search steps. Search rules can be configured to use all, or a subset of, the blocking keys defined for the store.
A cluster is a collection of records that have been identified as representing the same entity using the rules defined for the duplicate store. Each top level cluster is identified by a unique cluster ID.
Aperture Match introduces a new approach to clustering. The previous version of Find Duplicates assigned a single match level to an entire cluster which, for clusters with more than two records, potentially hides useful information. Hierarchical clustering allows each record to belong to multiple clusters, each with different match levels. For example, a record could be in an “Exact” level cluster with one other record and also in a “Probable” level cluster with the second record and two other records.
This approach provides greater visibility into the structure of match results, particularly for larger clusters. If you want to see which records are really similar to a record, you can find those that share its “Exact” level cluster or to see records that are at least a little similar, you can look at its “Possible” level cluster. The Possible level cluster will be a superset of the "Exact" level cluster.
Cluster refinement allows you more control over how matched records are grouped into clusters. After records are clustered based on the configured matching rules, you can manually refine clusters either by using match level review, which presents clusters containing lower confidence matches (probable or possible) for review, or through selective review, where you can search for specific records or clusters to review and refine.