Duplicate store objects enable the management and persistence of Duplicate stores in Data Studio. Stores can be both established and updated using the Find duplicates step, and custom settings can be maintained for each store.
When the Find duplicates step is first run the configured store will be established on the Postgres Database. Subsequent runs using the same store will insert or update records in the store. Once the store has been established it can also be used with the Find duplicates query and delete steps as well as Realtime Workflow steps.
This screen provides information on each of the stores accessible in the current Space. From here users can create, edit, delete, clear, and set the sharing options of Duplicate stores. If duplicate stores are enabled for cluster management there is an additional option to Review Duplicate stores. Clicking the button at the top of the screen will take you to the screen that displays all reviewable stores. Here you can choose to perform a match level or selective review. More information can be found in the cluster refinement section.
From the Duplicate stores list screen click Create new Duplicate store, or to edit an existing store, select the Edit details action.
The External label (Duplicate store ID) is the name of the store in the Postgres database. This will be generated automatically based on the Duplicate store name but formatted to be compliant with external label restrictions and to be unique in this environment.
Selecting Include timestamp columns will store timestamps for record creation and last update in the Duplicate store. The columns ‘Created timestamp’ and ‘Updated timestamp’ will be included in the output of the Find Duplicates step to indicate when records were created and last modified.
The server location for the Duplicate store defaults to the server configured in Settings > Workflow steps > Find duplicates.
On the next screen click Add blocking keys to either select an existing set of blocking keys (created via Step settings > Find duplicates settings) or enter your blocking keys manually. It is also possible to set the blocking key limit on this screen. This is the limit at which point we ignore the potential matches generated by the blocking key value. Once you have added blocking keys and clicked Apply, you can edit the individual keys to set key specific blocking key limits. It is recommended that you consult with a support representative before changing this value from the default of 500.
To add matching rules for your store, click on Rules and then Add rules and either select an existing set of rules or enter your rules manually. You can set the rules purpose to either Clustering or Search only. Clustering rules are used to form clusters as records are added to the store. Search only rules can be used when searching the duplicate store for matching records using the bulk or Realtime search workflow steps. Only one clustering ruleset can be specified for a store but multiple sets of search only rules can be added to the store. Clustering rules can only be added or edited when the store is not populated whereas search only rules can be added at any time. If you do not require a clustered store there is no need to add clustering rules.
When adding a set of rules, you can select a subset of the available blocking keys to be associated with the rules. All are selected by default. Only the associated blocking keys will be used to generate potential matches.
Selecting Enable cluster management will make the store available for review on the Review Duplicate stores list screen. This setting cannot be changed when the store is populated.
You can share Duplicate stores globally or with specific Spaces. From the Duplicate stores list page, click Sharing options. To use a Duplicate store that's been shared with you, click Include from another Space.
Before using the Find duplicates step it is important to tag your data to make mapping columns easier. If the columns in your data are tagged already, the Find duplicates step will recognize the tagged columns and automatically assign a relevant Find duplicates column mapping to them. Otherwise, you can manually assign the column mappings based on your knowledge of the data.
This step will only recognize the following:
If your data has columns tagged already, this step will recognize the tagged columns and automatically assign a relevant Find duplicates column mapping to them. Otherwise, you can manually assign the column mappings based on your knowledge of the data.
This step will only recognize the following system-defined tags:
It is important to map your columns as accurately as possible before using the Find duplicates step to make the matching process more efficient. For example, mapping a column as Address when it contains primarily company or name information will lead to less accurate results.
Additionally, using the more granular address element mappings such as Premise and Street and Locality as opposed to the higher level Address mapping (providing your data is divided in such a way) will mean that less effort is required to identify individual address components.
For more information on how Find duplicates utilizes these column mappings, you can refer to the advanced configuration page.
You can apply different rulesets to columns with the same tag by using group IDs.
For example, you may have delivery and billing addresses that you want to treat differently. You would tag both as an address, but create separate group IDs, allowing you to apply different rulesets: only accept an exact match for the billing address, but a close one for the delivery address.
For Bulk operations refer to the Workflow Steps.
For Real-time operations such as instant matching or deleting records to maintain data integrity refer to Real-time Workflows.