Data Quality user documentation

Goal

I want to see how the quality of my customers' data changes over time.

Task

Create a snapshot of data and analyze the trends.

Prerequisites

You're licensed to use the Analyze trends workflow step
The following sample data files are available in Data Studio:
- Customer V1.csv
- Customer V3.csv

1) Create a validation workflow

Go to Workflow Designer and click Create a new workflow.
Enter a name (e.g. Analyze trends) and click Submit.
In Available data sources tab on the left-hand side click Sample Data Source.
Drag and drop the Customer V1 file.
Open the Workflow steps (second) tab.
Drag the Transform step and connect it to the source file.
Drag and connect the Validate step to the Transform one. The workflow should look like this:
We want to create three separate validation rules to check that:
- First Order Date is a date
- Customer Id is not null and
- The email syntax is valid
Click Configure Rules in the Validate step.
Click Add new rule, give it a name (e.g. First order date is date) and click Create.
Search for Date and drag that in.
Click <Input value> and select First Order Date to check that values in this column are dates.
Click Apply in the top menu to save changes.
Create the second rule. Click Configure Rules again and then Add new rule.
Give it a name (e.g. Customer Id is not null) and click Create.
Search for Null and drag that in.
Click <Input value> and select Customer Id to check that there's a value.
Click on (this will change the value to ).
Click Apply in the top menu to save changes.
Create the last rule. Click Configure Rules and then Add new rule.
Give it a name (e.g. Email syntax is valid) and click Create.
Search for Matches expression (under Compare) and drag that in.
Click <Input value> and select Email.
Click <Comparison value> and search for Email Address (under Email Expressions). This will check that the email syntax is valid.
Click Apply in the top menu to save changes. The workflow will look like this:
In the Validate step, click Results by rule to view the results:
Click Close in the top menu to go back to the workflow.

2) Save results as a snapshot

To see how these results change over time, we have to first create a copy of the current view by taking a snapshot.
Open the Workflow steps (second) tab.
Drag Take snapshot step and connect it to the Results for analysis node in the Validate step.
Click on the auto-generated snapshot name (Snapshot01), enter a name (e.g. Validation Results) and press enter. The workflow will look like this:
Click Execute in the top menu to execute the workflow (this will also create the snapshot).
By default, the Scheduled start will be set to Now, so just click Execute.
Click Dismiss in the Job Completed dialog to get back to the workflow.

3) Replace the source file

To simulate a different version of the source file, we'll use a different sample file. In Available data sources (first) tab on the left-hand side the Sample Data Source should be visible.
Drag and drop the Customer V3 file.
Remove the original file (Rows for Customer V1) by clicking X.
Connect Rows for Customer V3 to the Transform step. The workflow should now look like this:
In the Validate step, click Results by rule to see how the results have changed:
Click Close in the top menu to get back to the workflow.
Click Execute in the top menu to execute the workflow (this will also create the snapshot).
By default, the Scheduled start will be set to Now, so just click Execute.
Click Dismiss in the Job Completed dialog to get back to the workflow.

4) Analyze the trends

To analyze trends using the two snapshots we've created, you can either create a new workflow entirely or use the one we've just created. We'll use the same one. Open the Workflow steps (second) tab.
Drag the Use snapshot range step in.
Click Undefined Workflow and select Analyze trends (or the name of the workflow we've just created).
Click Undefined Snapshot and select Validation Results (or the name of the first snapshot we've created).
From the Workflow steps (second) tab, drag the Analyze trends step and connect it to the Use Snapshot Range one.
The tagged columns from the snapshot will be automatically picked up. By default, the Failed Rows metric will be selected.
Click Show data in the Analyze trends step to see the results:
Click Close in the top menu to get back to the workflow.

To see other metrics (such as passed rows), click on Failed Rows and select the required one. To view results as a chart, click Show Chart.

Goal

I want to combine several data sources into one to have a global view of my customers' data.

Task

Combine data from three source files containing customer information (contact/order details) based on the customer ID and order number.

Prerequisites

The following sample data files are available in Data Studio:

Customer V1.csv
Purchase Order Header.csv
Purchase Order Detail.csv

Steps

Go to Workflow Designer and click Create a new workflow.
Enter a name (e.g. Combine data sources) and click Submit.
In Available data sources tab on the left-hand side click Sample Data Source.
Drag and drop the following files, one at a time:
- Customer V1
- Purchase Order Header
- Purchase Order Detail
Drag one of the purchase order files on top of the other one and select Join left in the Drop actions dialog that appears. This will connect the files in a Join step.
In the Join step, click to see the suggested joins. Only exact column name matches will be suggested.
Click on the suggested Order Id ⇐⇒ Order Id join to select these columns.
Click Show data in the Join step to view the joined data. Click Close in the top menu to return to the workflow.
The next step is to join this result to the Customer V1 file.
Open the Workflow steps (second) tab on the left-hand side.
Drag and drop another Join step.
Connect this Join step to Rows for Customer V1 and the first Join step:
In the second Join step, click to see the suggested joins.
Click on the suggested Customer Id ⇐⇒ Customer Id join to apply it.
Click Show data in the last step to view the results of this join. You will see that the join returns 0 rows. Click Close to return to the workflow.
Click Keys in the last Join step. Viewing the two columns side by side, we can see that we don't need the characters preceding the '–' from the Customer V1 file. Click Close to return to the workflow.
From the Workflow steps (second) tab on the left-hand side, drag the Transform step. We now have to connect the Transform step to both the Customer V1 and the last Join step manually.
Click on the Undefined source node in the Transform step and drag it to the Rows for Customer V1.
Now click on the end node in the Transform step and drag it to the top node in the last Join step.
Click Arrange in the top menu to auto-arrange the steps. The workflow should look like this:
In the Transform step, click Show data.
Right-click on the Customer Id column header and select Add transformation.
Search for After and drag it in.
Click Suffix value, type in – then press enter.
Click Apply in the top menu. You will see that the hyphen has been removed from the Customer Id values and the icon appears in next to the header to indicate a transformed column).
Click Close in the top menu to return to the workflow.
In the Transform step, click Undefined column and select Customer Id.
Click Show data on the second Join step to view the final result. You should now see the combination of your customer contact details and order details in one view.

You can also change the title of each step and give it a more descriptive name. For example: double-click on the Transform title, type in Remove '-' and press enter; rename the first Join step to say 'Join customer IDs':

Goal

I want my new marketing campaign to only target customers with deliverable postal and email addresses.

Task

Validate customer addresses and emails, clean them and remove duplicate records.

Prerequisites

You're licensed to use:
- Experian Match, Experian Batch and Email Validate
- the following workflow steps: Find duplicates, Validate addresses and Validate emails.
Experian Match and Experian Batch have been configured according to your license.
The following sample data files are available in Data Studio: AUS Find Duplicates Sample.csv, GBR Find Duplicates Sample.csv, and USA Find Duplicates Sample.csv.

1) Tag your data

One of the benefits of having your data tagged is to allow the Workflow Designer to apply intelligent defaults in your workflow steps, significantly speeding up workflow creation.

1.1) Enable auto-tagging

You can set up auto-tagging if you would like Data Studio to automatically detect columns containing names, addresses, and other customer-related data within a file. You can also easily train the system to recognize types of data that are specific to your organization (or not yet included in Data Studio's knowledge base).

To find out more about auto-tagging and how to enable it, head to this page.

1.2) Manually tag your data

Alternatively, you may choose to tag your data manually, using the following steps.

Note that tagging data in preview allows the workflow steps to automatically pick up the relevant columns.

In Data Explorer, click Sample Data Source.
Right-click on either the AUS Find Duplicates Sample.csv, GBR Find Duplicates Sample.csv, or USA Find Duplicates Sample.csv (depending on what country you would like to work with) and select Preview and configure.
Open the Headings tab.
Click Multi select.
Select the following headings: Address1, Address2, Address3, Town, County and Postcode.
Right-click and select Tag columns. Click Yes to confirm that you want to tag multiple headings.
Click Edit. Under System, select Address then click Tag.
The name of the tag will be shown under the column heading on the left-hand side.
Tag the remaining headings one by one by right-clicking and selecting Tag column:

Heading Tag

Name Name

Town Locality

County Province

Postcode Postal Code

Email Email

Dob Date
Click Apply in the top menu to save changes.

Heading	Tag
Name	Name
Town	Locality
County	Province
Postcode	Postal Code
Email	Email
Dob	Date

Here's an example of tagged columns:

2) Create the workflow

The workflow will validate postal addresses and emails and then remove duplicate records.

2.1) Validate addresses

Go to Workflow Designer and click Create a new workflow.
Enter a name (e.g. Validate and clean) and click Submit.
In the Available data sources tab on the left-hand side click Sample Data Source.
Drag and drop the Find Duplicates Demo Data file you would like to work with (GBR, AUS, or USA).
Open the Workflow steps (second) tab.
Drag the Validate addresses step and connect it to the data source. Because we've tagged the data already, the step will automatically pick up the address columns: Selected columns 6/10.
To confirm that the correct columns have been auto-selected, click Selected columns. In the Validated list, you should see Address1, Address2, Address3, Town, County and Postcode. Click to confirm.
Click Select country and pick United Kingdom from the list.
Click Show data. Scroll to the right to see the validation results. Click Close when done.

An example of a correctly configured Validate Address step (for GBR):

Filter out unwanted rows:

Drag the Split step and connect it to the Validate addresses one.
Click Filter then Create. We're interested in the following results only: Verified Correct, Good Full Match and Tentative Full Match.
Search for Equals under Multi Compare and drag that in.
Click <Input value> and select Address Validate: MatchResult.
Click <Comparison value>, type in Verified Correct and enter.
Repeat steps 4-5 with the <Comparison value> for Good Full Match and Tentative Full Match.
Click Apply in the top menu to save changes.

In the Split step, click Show passing rows to see the rows which passed the filter. Click Close to get back to your workflow.

Tidy up results:

Drag the Transform step and connect it to the Show passing rows node in the Split step.
Click Columns.
Click Multi select.
Select Address1, Address2, Address3, Town, County and Postcode then right-click and Hide. Click Yes to confirm.

Tag validated address columns below, one at a time. Right-click on the column, select Tag column then Edit to add a tag:

Column	Tag
Address Validate: addressLine1	Address
Address Validate: addressLine2	Address
Address Validate: addressLine3	Address
Address Validate: locality	Locality
Address Validate: province	Province
Address Validate: postalCode	Postal Code
Address Validate: country	Country

Click to save changes. The workflow should look like this:

2.2) Validate emails

We use a Domain Name System (DNS) lookup to check whether the email's domain exists. Confirm that you can access the Google DNS (8.8.8.8) before proceeding. If you can't, use nslookup to find a DNS server that can be used on your network and add it: Configuration > Step settings > Validate emails > DNS servers.

Drag Validate emails step and connect it to the Transform one. The previously tagged Email field should be picked up automatically.
Click <Select validation type> and select Domain level.
Click Show data. Scroll to the right to see the validation results. Click Close when done.

An example of a correctly configured Validate Email step:

Filter out unwanted rows:

Drag the Split step and connect it to the Validate emails one.
Click Filter then Create. We're interested in validated emails only.
Search for True and drag that in.
Click <Input value> and select Email Domain: Result.
Click Apply in the top menu to save changes.
In the Split step, click Show passing rows to see all the rows which passed the filter.
Click Close to get back to your workflow. It should look like this:

2.3) Remove duplicate records

Drag Find duplicates step and connect it to the Show passing rows in the Split step.
Select 'GBR_Individual_Default' blocking key.
Select the 'GBR_Individual_Default' ruleset and select GBR_Individual_Default.
Click Show data to start the matching process. Note that this might take a minute. Scroll to the right to see the match status results. Click Close when done.

Filter out unwanted rows:

Drag the Split step and connect it to the Find duplicates one.
Click Filter then Create. We're interested in exact matches only.
Search for Equals (standardised) under Compare and drag that.
Click <Input value> and select Duplicates: Match Status.
Click <Comparison value>, type in 0 and enter.
Click Apply in the top menu to save changes.
Drag the Group step and connect it to the Show passing rows in the Split step.
Click Columns.
Right-click on Duplicates: Cluster ID and select Group by.
Click to save changes.
Click Show data to see the exact records. Click Close when done.
Before exporting the results, we want to combine these exact matches with other records for comparison.
Drag the Union step and connect it to the Group one.
Click Undefined source and connect it to the Show failing rows in the Split step.
Click Show data in the Union step to see the records which we have now identified as deliverable from the initial 103.

2.4) Export data

We want to export data into two files: one with validated and cleaned results, and another with the rest of the records.
Drag the Export step and connect it to the Union one.
Click Settings to specify the file type (.csv will be selected by default) and enter a file name.
Click Apply.

To export all the remaining results, we need to combine them first.

Drag the Union step and connect it to Show failing rows in the first two Split steps.
Drag the Export step and connect it to this new Union one.
Click Execute in the top menu to execute the workflow.

The final workflow should look like this:

(Click to expand)

Was this helpful?

Next: Use the SDK

Try our use cases

Goal

Task

Prerequisites

1) Create a validation workflow

2) Save results as a snapshot

3) Replace the source file

4) Analyze the trends

Goal

Task

Prerequisites

Steps

Goal

Task

Prerequisites

1) Tag your data

1.1) Enable auto-tagging

1.2) Manually tag your data

2) Create the workflow

2.1) Validate addresses

2.2) Validate emails

2.3) Remove duplicate records

2.4) Export data

Aperture Data Studio v1

Manage workflows