Data Quality user documentation

Duplicate store objects enable the management and persistence of Duplicate stores in Data Studio. Stores can be both established and updated using the Find duplicates step, and custom settings can be maintained for each store.

When the Find duplicates step is first run the configured store will be established on the Postgres Database. Subsequent runs using the same store will insert or update records in the store. Once the store has been established it can also be used with the Find duplicates query and delete steps as well as Realtime Workflow steps.

Duplicate stores list screen

This screen provides information on each of the stores accessible in the current Space. From here users can create, edit, delete, clear, and set the sharing options of Duplicate stores. If duplicate stores are enabled for cluster management there is an additional option to Review Duplicate stores. Clicking the button at the top of the screen will take you to the screen that displays all reviewable stores. Here you can choose to perform a match level or selective review. More information can be found in the cluster refinement section.

Deleting a Duplicate store will remove it completely from the application, including the store configuration and all the data in the store. Clearing a Duplicate store retains the store configuration but clears all store data, allowing the store to be reconfigured and reused.

Creation and management

From the Duplicate stores list screen click Create new Duplicate store, or to edit an existing store, select the Edit details action.

The External label (Duplicate store ID) is the name of the store in the Postgres database. This will be generated automatically based on the Duplicate store name but formatted to be compliant with external label restrictions and to be unique in this environment.

The external label for a Duplicate store will always be unique within an Environment, but this uniqueness is not enforced across Environments or across different instances of Data Studio. To avoid unintended changes to the same physical store, take care when creating Duplicate stores in different Environments.

Selecting Include timestamp columns will store timestamps for record creation and last update in the Duplicate store. The columns ‘Created timestamp’ and ‘Updated timestamp’ will be included in the output of the Find Duplicates step to indicate when records were created and last modified.

The server location for the Duplicate store defaults to the server configured in Settings > Workflow steps > Find duplicates.

On the next screen click Add blocking keys to either select an existing set of blocking keys (created via Step settings > Find duplicates settings) or enter your blocking keys manually. It is also possible to set the blocking key limit on this screen. This is the limit at which point we ignore the potential matches generated by the blocking key value. Once you have added blocking keys and clicked Apply, you can edit the individual keys to set key specific blocking key limits. It is recommended that you consult with a support representative before changing this value from the default of 500.

To add matching rules for your store, click on Rules and then Add rules and either select an existing set of rules or enter your rules manually. You can set the rules purpose to either Clustering or Search only. Clustering rules are used to form clusters as records are added to the store. Search only rules can be used when searching the duplicate store for matching records using the bulk or Realtime search workflow steps. Only one clustering ruleset can be specified for a store but multiple sets of search only rules can be added to the store. Clustering rules can only be added or edited when the store is not populated whereas search only rules can be added at any time. If you do not require a clustered store there is no need to add clustering rules.

When adding a set of rules, you can select a subset of the available blocking keys to be associated with the rules. All are selected by default. Only the associated blocking keys will be used to generate potential matches.

Selecting Enable cluster management will make the store available for review on the Review Duplicate stores list screen. This setting cannot be changed when the store is populated.

Once records have been inserted in a Duplicate store some options will not be editable unless the store is cleared.

You can share Duplicate stores globally or with specific Spaces. From the Duplicate stores list page, click Sharing options. To use a Duplicate store that's been shared with you, click Include from another Space.

Importing a Duplicate store to a different Space in the same Environment will result in a new store with a unique Duplicate store ID being established. If you want to make updates to the same store across different Spaces in an Environment, we recommend sharing instead of importing it.

The Refresh option ensures that the Duplicate store is up-to-date with the store on the Find duplicates server, should its records be impacted externally or via a different Environment.

Find duplicate steps in Aperture Match

The Find Duplicates Workflow steps have been updated to support the enhanced functionality in Aperture Match. If you previously used Find Duplicates and have migrated to Aperture Match, you will need to review and update any existing workflows if you intend to reuse them.

Prepare your data

Before using the Find duplicates step it is important to tag your data to make mapping columns easier. If the columns in your data are tagged already, the Find duplicates step will recognize the tagged columns and automatically assign a relevant Find duplicates column mapping to them. Otherwise, you can manually assign the column mappings based on your knowledge of the data.

This step will only recognize the following:

If your data has columns tagged already, this step will recognize the tagged columns and automatically assign a relevant Find duplicates column mapping to them. Otherwise, you can manually assign the column mappings based on your knowledge of the data.

This step will only recognize the following system-defined tags:

Company
Address
- City
- Country
- County
- Locality
- Postal Code
- Premise And Street
- Province
- State
- Zip Code
Date
Email
Generic String
Phone
Name
- Forenames
- Surname
- Title
Unique Id

It is important to map your columns as accurately as possible before using the Find duplicates step to make the matching process more efficient. For example, mapping a column as Address when it contains primarily company or name information will lead to less accurate results.

Additionally, using the more granular address element mappings such as Premise and Street and Locality as opposed to the higher level Address mapping (providing your data is divided in such a way) will mean that less effort is required to identify individual address components.

The standardization process running as part of the Find Duplicates step attempts to recognize the Company in the context of an Address. We therefore recommend that you always provide an address (in a standard address element order of the relevant country) to optimize the standardization results.

For more information on how Find duplicates utilizes these column mappings, you can refer to the advanced configuration page.

Group IDs

You can apply different rulesets to columns with the same tag by using group IDs.

For example, you may have delivery and billing addresses that you want to treat differently. You would tag both as an address, but create separate group IDs, allowing you to apply different rulesets: only accept an exact match for the billing address, but a close one for the delivery address.

Steps

For Bulk operations refer to the Workflow Steps.

For Real-time operations such as instant matching or deleting records to maintain data integrity refer to Real-time Workflows.

Was this helpful?

Previous: Key concepts

Next: Rules and blocking keys

Duplicate stores

Duplicate stores list screen

Creation and management

Sharing

Find duplicate steps in Aperture Match

Prepare your data

Group IDs

Steps

Aperture Data Studio

Aperture Match