Use business rules to measure the quality, completeness and accuracy of your data.
A rule is a Function that returns a true or false value, which translates to a pass or fail. This can be either a default Function (like ‘Contains’ or ‘Is Date’) or a custom one.
The easiest way to get started is to Explore your data, Profile the desired columns and then use the Suggest validation rules action. This will analyze the data in these columns, and suggest formulas to be applied to them as validation rules. Uncheck any suggestions that are not useful and apply to create a new Workflow containing a Validate step, with initial rules and rule groups defined and ready for further changes if required.
On the Validate step, click Rules to view and edit any existing rules and to Add rule or Add group.
Each rule has a name and summary, a parent rule group and a Function definition. The Function can either be a pre-existing Function in Data Studio, or a custom Function created in the Function builder. When building a custom Function, pick one of the relevant columns (or a parameter) as the input value and ensure the Function returns true/false as the result.
Rule groups represent logical collections of validation rules and can have different thresholds set. For example, a group of compliance rules may have a fail threshold of 99%, but a data quality group of rules may use a lower threshold and be increased over time as data and processes are improved.
The Status result column for each rule is based on these thresholds, so Red below the fail threshold, Green at or above the pass threshold, and Amber in between.
Each group has a name and description, pass and fail threshold, plus an optional column to weight the results. As well as counting the number of Passed rows and Failed rows, the results will also include Passed weight and Failed weight columns, which contain a sum of the chosen column for all the passing/failing rows. For example, weighting each rule’s results by total sales allows you to prioritize data quality issues for the highest spending customers with the most overall impact.
Rules can be set to ignore or skip values that do not need to be validated, and so should not impact the rule pass rate. For example, where 10 emails are being checked; 5 are valid, 2 invalid and 3 are blank, the pass rate would be 5/7 (rather than 5/10 if all values were considered).
The Ignore Null values checkbox can be selected when first creating a rule using an existing Function. However, values other than null can also be ignored using the Ignore literal value when designing a custom Function.
Validation results are available in several formats:
Similar to the Source and Profile step, the Validate step has a Source metadata dropdown that enables you to include the lineage metadata of input data. This information (such as file name or batch ID) is useful to include as additional columns in the aggregated results.
Validate emails based on the format or domain address.
Select the Email column and pick one of the two available Validation type options:
Format Check: Checks whether the value matched a valid email format. Returns either true or false.
Examples of valid and invalid email formats:
Only one email can be validated at once; lists of emails as seen in the last example will be rejected.
Domain Level: Checks whether the value has a domain that exists and is an email server. This option returns both an overall validity result (true or false) in the Email domain: Result column, and additional information in the Email domain: Error column describing the reason for failure. The possible outcomes are:
|Valid||True||Domain exists and is a mail server.|
|Bad format||False||Email format is invalid.|
|Invalid domain||False||Domain validation check failed. The domain may not exist, or may have been flagged as illegitimate, disposable, harmful, nondeterministic or unverifiable.|
|Invalid name||False||Local part validation failed. For example it may have been identified as a spam trap or role account such as "firstname.lastname@example.org".|
|Failed to perform DNS lookup||False||An error occurred when attempting to perform the DNS lookup.|
Domain level validation results are cached with results refreshed every 30 days. The cache validity is configurable in Settings > Workflow steps by changing the Email validation cache validity setting.
Click Show step results to view the results.
Validate global phone numbers.
Once connected to the data, you have to specify the following:
Click Show step results to view the results. The following columns will be appended to your data:
Use this step to validate and enrich addresses in bulk using Experian Batch, depending on your license.
If your data has address columns tagged already, this step will automatically pick up all the columns tagged as addresses and list as Selected columns.
If you are using AddressBase Premium data, you can enable key searching by specifying which columns contain UPRNs and UDPRNs.
To enrich valid addresses, choose one of the available Additional datasets. The additional datasets that are available to you will depend on your license.
Using Additional options you can specify how the validated addresses will be returned:
Find out how to configure Experian Batch for the Validate addresses step.
An address cleansed in Data Studio will result in one of the following possible results:
|Verified Correct||Experian Batch verified the input address as a good-quality match to a complete address. No corrections or formatting changes were necessary.|
|Good Match||Experian Batch verified the input address as a good-quality match to a complete address, although minor corrections or formatting changes may have been applied.|
|Good Premise Partial||Experian Batch was not able to find a full match to a correct address, but found a good match to premise level by excluding organization or sub-premise details.|
|Tentative Match||Experian Batch found a match to a complete address, but the overall differences between the input and cleaned addresses are significant enough to reduce the confidence in the match.|
|Multiple Matches||Experian Batch found more than one correct address which matched the input address. This means that no single address could be matched with high confidence.|
|Poor Match||Experian Batch found a match to an address, but with low confidence. This often means that the cleaned address is not deliverable.|
|Partial Match||Experian Batch was unable to find a full correct address which matched the input address. This often occurs when the property number is missing from the input address.|
|Foreign Address||Experian Batch could not find a matching address because the input address referred to a different country.|
|Unmatched||Experian Batch was unable to match the input address to any correct address.|
|Processing Failure||The address input has not returned any results and may be of a bad format. Please report this to whoever manages Aperture Data Studio for your organization.|