Data Quality user documentation

Transform, manipulate, analyze and export your data with re-usable graphical workflows.

If you have issues or questions with workflows, try the troubleshooting.

Available sources

This menu lists all your data sources both raw files and database connections.

My files contains files available to you only; Data sources lists all the available database connections.

Workflow steps

Transforming your data is easy with Aperture Data Studio. Visual workflows are built using a variety of drag-and-drop steps.

Some steps (e.g. Find duplicates or Validate addresses) will have to be configured and licensed before you can use them. Find out about licensing.

When creating/modifying workflows, the top menu allows you to:

create a .pdf report on each workflow by clicking Report,
re-use a workflow by saving it as a .wfl file: click Export,
save the workflow output as a .csv, .txt, .xls, .xlsx, .xml, .dat or .json file by clicking Execute (single output only),
auto-arrange the workflow steps by clicking Arrange,
change the cursor to be able to select multiple steps by clicking Select. To switch back, click Move.

By default, the auto-connect for the workflow steps is enabled, allowing two steps with input/output nodes to be automatically linked when moved close to each other. You can disable this by clicking Disable snap; click Enable snap to turn the auto-connect back on.

To import one or more workflows (.wfl), click in the top menu.

Once added to the workspace, each workflow step can be renamed: double-click on the title of the step and enter the required name.

This step allows you to see data quality trends over time.

It works best when combined with the Use snapshot range step, using snapshots taken from the Validate step (and connected to the Results for analysis node).

Note that connecting from the Results for analysis node will mean that the output columns are automatically tagged for trend analysis.

In the step dialog, specify the following:

Category - the column that determines the version of results (e.g. execution time)
Series - the column that determines the data series (e.g. rule name)
Metric - the column containing the names of the metrics (e.g. processed rows)
Value - the column containing the values for the metrics (e.g. 90)
Metric (for display) - which metric (e.g. passed rows) you wish to view across each Category and Series. Options will be populated from the Metric column. This affects both the data and chart outputs.
All (Max Count) - the maximum number of snapshots you would like to be displayed within your specified snapshot range. You can set '0' indicating 'All' (to return all available snapshots), as well as numbers greater than 0.
Any (Max age) - the maximum number of days in which you'd like to data to be captured.
Seconds (How often to sample) - how often during each day within your snapshot range you would like data to be captured. You can choose Seconds, Minutes, Hours, Days, Months, or Years. If more than one snapshot is available for a given sample size (e.g. day, hour, etc), the most recent snapshot in each sample will be returned.

There are two options for displaying the trend results:

Evenly space chart categories - ensures the time points are displayed evenly along the x-axis (horizontal) of your results chart. This is useful if you have captured data at various times during your snapshot range and would prefer to see a more easy-to-read chart.
Show data - view as a grid with a visual indication (red or green triangle) for each value of the selected metric
Show chart - creates a line graph showing changes over time of the selected metric, for each series

You can also use this step to create regular reports, providing insight into the quality of your data.

The branch step allows you to duplicate data from one input so that it can be used for further workflow steps.

The rows from the source data will be duplicated (branched) into two outputs. To create more than two duplicates of the data, you can link several branch steps.

Click Show data in the step dialog to view the results.

The chart step allows you to create interactive charts of your data at any point in your workflow.

Including this step within the workflow will not affect the data connected to it. Connect this step to one input to create various types of charts. Click Bar chart (2D) in the step dialog to change the type of chart.

To add more columns to the chart (up to 10), click Add data series column. To further customize the chart, click Additional options.

Chart type	The type of chart you wish to draw (e.g. bar, line, pie).
Data label column	The column for labeling the x-axis. For example, if your data is split by year, use this setting to label the bars with their respective year.
Split series	Specify whether to display all the data series on the same chart or split them up.
Title	The title of your graph, displayed at the top.
Description	The description of your graph, displayed under the title.
X-axis label	Labels the X-axis (at the bottom).
Y-axis label	Labels the Y-axis (on the left).
Show legend	Specify whether to display the legend at the bottom.
Show trendline	Specify whether to calculate and draw a trendline on the graph - a straight line of the best fit for your data.
Show X/Y zero line	Specify whether to draw a line at the origin for the axis (useful if you have negative data).
Logarithmic X/Y axis	Specify whether to convert the selected axis to a logarithmic scale.
Show horizontal/ vertical grid lines	Specify whether to draw grid lines in the respective axis.
Stacked?	Specify whether to stack your data on top of each other.
Show section labels	(Pie chart only) Specify whether to show or hide the labels on segments.

Click Show chart in the step dialog to view your data as an interactive chart. Click View data to also show the data alongside the chart.

The chart will be based on the top 1,000 rows.

The export step allows you to export the workflow's output data to a file, DBMS or Azure service.

Each workflow can contain multiple export steps, so you can export data at any point. Each export step can be configured individually.

Click Settings in the step dialog to specify where and how the data will be exported.

Exporting to a file

Select File as the Workflow output type.
Specify the Export File Type (.csv, .txt, .xls or .xlsx, .xml, .dat or .json)
Choose the Data store type - the export location: user's or global (server's) export directory. Note that pre v1.4 workflows will continue to be exported to the server's export directory.
Enter the name of the file in Server filename
Enable Append date/time to filename to automatically add this information to the name of the file
Enable Overwrite server file to allow a file with the same name to be exported more than once (the latest copy will replace the previous one). If this option is disabled, you will get an error when trying to export to the same file name more than once.

To execute workflows with multiple outputs, the export settings defined in each export step have to be used.

Specify the Character set (most sets are supported)

Advanced settings

Column name style determines how the column names will be displayed in the export file:
- Display name - the column name as it appears in Data Studio
- External name - the column name as it appears in the source file
- Alias - the column's alternate name
- Name - the column name
- Camel case - each word or abbreviation in the middle of the phrase will begin with a capital letter (with no spaces or punctuation)
- Title case - the first letter of every word will be capitalized
- Humanised - the first word will be capitalized and underscores turned into spaces
- Database style - the column name will be in upper case and underscores will be used to separate words
- Lower case - the column name will be in lower case
- Upper case - the column name will be in upper case
- Don't export headers - the column names will be excluded from the exported file
End of line delimiter - specify the end of a line delimiter:
- Windows (CR+LF) - carriage return + line feed
- UNIX (LF) - line feed only
- Macintosh (CR) - carriage return only
- Custom - define your own control

Exporting to a DBMS

Select DBMS as the Workflow output type.
Choose a DBMS from the list of available ones
Enter the New Table name or select an existing table
Mode determines how the data will be inserted into the target system:
- Insert - rows are inserted into the table. The equivalent SQL statement is:
  "INSERT INTO tablename (column1, …) VALUES (value1, …)"
- Update - rows are updated. The mapping dialog will show which source columns are mapped to which target columns and which ones have been specified as keys. You can change the mapping and keys at this stage. The keys referred to are the columns that will appear in the equivalent SQL WHERE clause:
  "UPDATE tablename SET (column1=value1, …) WHERE (columnName=value, … )"
- Delete - rows are deleted. The keys are the column names used in the SQL WHERE clause:"DELETE FROM tablename WHERE columnName=value, … "
- Insert or update (if insert fails) - if a row insert fails (there is a key collision), then the statement will be retried as an update.
- Update or insert (if update fails) - if the update fails (no record found with the given key), then the statement will be retried as an insert.
- Create table - selecting this option will automatically create a table in the database. You will have to specify a name for the new table which will be validated against the data source.

Advanced settings

Commit:
- At the End - Any errors encountered during the export will result in the termination of execution and no changes will be applied to the target.
- ln batches - You may specify how many statements (i.e. the batch size - max is 1,000) can execute prior to a commit. Any statements issued after the last commit will be lost if there are any errors.
- Restartable - This is similar to commit in batches; however, all statements that will be executed have been written to a transaction log file first. If there are any errors during the execution that are not data related (e.g. a network error or a power outage) then you may restart the export and it will resume statement execution from the last commit onwards.
Error action - specify what will happen when an error occurs during the export:
- Abort - the export will fail when an error occurs;
- Abort at End - the export will complete and report on any errors in the log file;
- Abort After Percent - you can specify a percentage threshold for allowed errors: if(error count/rowCount)100 > percentage* then export will fail;
- Abort After Count - you can specify an absolute error threshold: if the number of errors exceeds this count then export will fail;
- Ignore - errors will be logged but ignored: the export will complete regardless.
Precheck - enable for data to be checked before starting the export. Use this together with the Error action (above). The following checks will be carried out for every cell:
- if primary key(s) have been defined for the target - check that the data in the primary key column(s) is unique and not null;
- if a target column is not nullable - check that the data is not null;
- check that the source data type is compatible with the target column's data type;
- for numeric data - check that the target column's scale and precision are large enough;
- for alphanumeric data - check that the target column's width is large enough.
Truncate - if Mode is Insert, you may truncate the remote table before applying any inserts.
Pre SQL - you can enter a SQL script that will be executed before to the export. The SQL may contain any valid statements and will be executed as is. Note that you have to include a COMMIT statement at the end of your statement.
Post SQL - you can enter a SQL script that will be executed after the export. The SQL may contain any valid statements and will be executed as is. Note that you have to include a COMMIT statement at the end of your statement.

Exporting to Azure

Select Azure as the Workflow output type.
Choose a DBMS from the list of available ones
Specify the Export File Type (.csv, .txt, .xls or .xlsx, .xml, .dat or .json)
Enter the name of the file in Server filename
Enable Append date/time to filename to automatically add this information to the name of the file
Enable Overwrite server file to allow a file with the same name to be exported more than once (the latest copy will replace the previous one). If this option is disabled, you will get an error when trying to export to the same file name more than once.

To execute workflows with multiple outputs, the export settings defined in each export step have to be used.

Specify the Character set (most sets are supported)

Advanced settings

Column name style determines how the column names will be displayed in the export file:
- Display name - the column name as it appears in Data Studio
- External name - the column name as it appears in the source file
- Alias - the column's alternate name
- Name - the column name
- Camel case - each word or abbreviation in the middle of the phrase will begin with a capital letter (with no spaces or punctuation)
- Title case - the first letter of every word will be capitalized
- Humanised - the first word will be capitalized and underscores turned into spaces
- Database style - the column name will be in upper case and underscores will be used to separate words
- Lower case - the column name will be in lower case
- Upper case - the column name will be in upper case
- Don't export headers - the column names will be excluded from the exported file
End of line delimiter - specify the end of a line delimiter:
- Windows (CR+LF) - carriage return + line feed
- UNIX (LF) - line feed only
- Macintosh (CR) - carriage return only
- Custom - define your own control

The filter step allows you to filter data using a variety of functions.

See Function editor for more details.

The Find duplicates step uses powerful standardization and matching algorithms to group together records containing similar contact data (e.g. name, address, email, phone) and keep that information within a duplicate store. Each group of records, known as a cluster, is assigned a unique cluster ID and a match level. The step provides out-of-the-box functionality for the United Kingdom, Australia, and the United States, but is also completely configurable down to the most granular name and contact elements.

Find out how to use, configure and troubleshoot this step.

This step allows you to kick off custom events at any point in your workflow.

Note that to use this step, at least one custom event has to exist.

Tick the Only fire if input has rows checkbox for the step to be kicked off only when the input has data.

Select the required Custom event then specify values for the available event data items (the event data name will be displayed).

For example, you want to kick off an event when the Validate step has failures. Select the 'Validation failed' custom event (see the example in Custom events section above) and specify the reason for the failure in the 'Reason' event data item. This reason will now be available as an event data item when creating/editing notifications.

Another example is using a Filter step to identify records that you have particular interest in. The Fire event step can then follow the Filter step but only fire when the Filter step is producing records.

Grouping allows you to group values from one input into one or more columns and aggregate the remaining columns, if required.

Click Grouping in the step dialog then drag and drop the required columns up, into the Grouped list. Remove columns from this list by simply dragging them down.

Click to apply the changes and Show data in the step dialog to view the results.

You also configure grouped columns and use aggregate functions on the grouped values when in Show data view: right-click on the grouped column heading and select Add aggregate. In the dialog that appears, specify the name/description for the new column, the source column to be used for aggregation and the required function.

For some aggregation functions, you can choose to only process unique values by enabling the Distinct option.

Harmonization is the process of deduplication - taking multiple records that are deemed to be for the same underlying real-world subject and producing a single resulting record.

Find out how to use this step.

The Join step allows you to join two inputs into one based on one or more columns from each input.

To specify the columns you want to join, click Left Columns or Right Columns.
The Join step provides three matching options which determine how keys will be compared:

Exact match: rows will be joined if key values on the left-hand side and right-hand side are exactly the same, including data type and casing.
Ignore datatype: rows will be joined if keys can be standardized to the same value. A key with the alphanumeric value '1' will be joined with a key that is the integer value 1.
Case-insensitive: rows will be joined if keys match with casing ignored. A key with value 'ABC' will be joined to a key with value 'abc'.

Click on the Venn diagram to specify the type of join you want to apply:

Inner (yellow segment of the Venn) - returns all rows where the join condition is met (keys match)
Left outer (yellow and green segments) - returns all rows from the left-hand table and only the rows from the right-hand table where the join condition is met (keys match)
Left outer without intersection (green segment) - a variant on the basic left outer join: returns only rows from the left-hand table that don't meet the join condition. No rows from the right-hand table are returned.
Right outer (yellow and blue segments) - returns all rows from the right-hand table and only the rows from the left-hand table where the join condition is met (keys match)
Right outer without intersection (blue segment only) - a variant on the basic right outer join: returns only rows from the right-hand table that don't meet the join condition. No rows from the left-hand table are returned.
Full outer (all segments of the Venn)- returns all rows in both the left and right tables. Any time a row has no match in the other table, the select list columns from the other table contains null values.
Full outer without intersection (green and blue segments) - returns all rows in both the left and right tables, excluding rows where the join condition is met.
Cross join or Cartesian product (no segments of the Venn) - returns all rows from the left table. Each row from the left table is combined with all rows from the right table.

Use the All matches menus to specify different combinations of join. The top left menu is for the top join table and the bottom right one for the second join table.

All matches - perform a standard join
Singletons only - will return the first match of a 1 to many match AND any matches that occur only once (this is a normal relational join with duplicates removed)
Duplicates only - will return all the matches of things that are 1 to many only (this is a normal relational join excluding things that match only once)

In the join dialog, click Show data to view the results of the join. Specify which values from the joined columns to use in any following steps: all the matches, only duplicates or only singletons.

Click to see the suggested joins. Only exact matches from the two sources will be suggested. Click on the suggested join to apply it.

The multi view step is a very useful step for viewing your data side by side (horizontally or vertically).

You can link this step to 2-3 data sources to view and compare the data. By default, the horizontal view will be selected. To change to the vertical view, click Horizontal split.

Click Show view to display the results. For more complex views, you can also link this step to another multi-view step.

Only the top input will be passed on to a subsequent non-multi view step.

This step allows you to profile data at any point in a workflow. We examine each value and identify various attributes (such as type, min/max values, count) and determine its uniqueness and completeness.

Click on Profiled columns to select the column(s) you want to profile. In the Profiled columns list on the left, select the required ones and click . To see the results, click Show data in the step dialog.

Note that results might take some time to load.

You can now export or take a snapshot of your profiled data to track data uniqueness over time.

This step allows you to create a sample of the data the step is connected to. You can specify the following sampling settings:

Start at row - the row number from which to start loading
Sample one row in every - specify the frequency of rows to sample
Sample randomly - enable to create a random sample
Limit rows - enable to specify a Row limit that will be used for the sample

To use this step, you have to have at least one JavaScript available in the script library: Glossary > Scripts.

This will allow you to use the script as a workflow step to transform your data.

Click Show data in the step dialog to view the results. The output can then be used as input to any further step(s).

To use this step, you have to have at least one R script available in the script library: Glossary > Scripts.

This will allow you to use the script as a workflow step to transform your data.

Click Show data in the step dialog to view the results.

This step allows you to sort data in ascending/descending order.

Click Sorting in the step, select the required columns and click to save changes.

To change the sort order, click on the column in the Sorted list.

A snapshot is a versioned copy of data taken at a certain point in time (and saved internally in Data Studio as a table with any transformations). Each snapshot is automatically assigned a unique name which you can modify.

Once created, a snapshot will appear in the list of Available data sources and can be used in any workflow.

Take snapshot step

This step will create a snapshot (once the workflow is executed). Each time you execute a workflow with this step, a new version of the snapshot will be created.

If you want to limit the number of snapshots that are created per each step, go to Additional options and turn on Enable version limit. You will then be able to specify how many versions of that particular snapshot that you want to store.

Use latest snapshot step

This step allows you to use the latest available snapshot. First, choose the workflow it belongs to then select the required snapshot.

Use snapshot range step

This step allows you to use several snapshots at a time (it's the union of all the snapshots in the range). If your data changes over time, you can compare the current version to the previously created snapshots. To choose the required snapshot range, find the workflow it belongs to then select the snapshot range.

Click Show data to see the details of all your snapshots in one table.

Find out how to configure an ODBC connection.

Splicing allows you to combine two inputs into one by presenting the values next to each other.

Data in the columns is not combined or merged - the duplicated values are renamed with an appended numeric value. The output – combined values from two sources – can then be used as input to any further step(s).

Click Show data in the step dialog to view the results.

The split step allows you to apply a true/false filter to one input and split the resulting data into two outputs based on these values.

To create a filter, click Filter in the step dialog and Create in the left-hand menu. Specify the necessary filters and click Apply to save the changes.

Click Show passing/Failing rows to view the values that passed/failed the filter. The passing and failing results can be individually linked to any further step(s).

This step allows you to add/remove various functions and arrange/show and hide columns.

See Function editor for more details.

The union step allows you to combine two inputs into one output by presenting the values on top of each other. The source columns with matching headings are combined.

The output – combined values from two sources – can then be used as input to any further step(s).

Click Show data in the step dialog to view the results.

The validation step allows you to identify data that matches the specified rule(s). The results will either be true or false.

To perform validation, you have to set validation rule(s):

select from the list of suggested rules (picked up from the input columns) or
click Configure rules then add a new rule in the step dialog.

You will be taken to the Function editor, allowing you to define the required rule(s). Once created, apply the rule(s) using the Edit validation configuration dialog.

Click View or sort rows to preview the data that will be validated.

To view validation results, click to see the required data: passing, failing and ignored rows. Each of these results can be further analyzed by linking them to other workflow steps.

To view information on each row, click Show data. You can right-click on a failing row and select Reasons for failed rules to see even further detail.

Click Results by rule to view the overall summary of validation results.

This step validates and enriches addresses in bulk.

Addresses will be cleaned by verifying them against the official postally-correct address files for the relevant country. Cleaned addresses are assigned a match result, based on the accuracy of the original address. You can define layouts specifying the number, content and format of the address output columns. Choose one of the available Additional datasets to enrich your data. The datasets that are available to you depend on your license.

Find out how to use, configure and troubleshoot this step.

Use this step to validate emails based on the format or domain address.

The two validation types are available:

Format check - whether the email is in a valid format (a syntax check against the RegEx values as defined in the Glossary)
Domain level - whether the email domain is valid/exists or not (a DNS server check to see if the domain exists and is an email server)

You can customize the way emails are validated:

specify your company's DNS servers: go to Configuration > Validate emails > DNS servers.

We use a Domain Name System (DNS) lookup to check whether the email's domain exists. If not specified, the DNS server will default to Google (8.8.8.8) which may be blocked by your firewall. If blocked, use nslookup to find a DNS server that can be used on your network and add it by clicking Add a new DNS server.

define email address format: go to Glossary > Constants > Email Expressions > Email Address

By default, the validation results are cached with results refreshed every 30 days. To change this, go to Configuration > Step settings > Email step > Email validation cache validity.

Use this step to quickly validate global phone numbers.

Once connected to a data source, you have to specify the following:

Select a phone column - specify the column containing the phone numbers you want to validate
Select country - pick the country to use for phone validation. You can either select a single country or pick a column with country data. For the latter, please ensure that the country names adhere to ISO 3166-1 alpha-2, ISO 3166-1 alpha-3 or ISO3166 country name standards.
Select validation rule - choose the validation rule(s) that will be applied:
- Valid phone number: shows True for phone numbers that have been successfully validated against the selected country and False otherwise.
- Valid phone number for the region: shows True for phone numbers that have been successfully validated against the region of the selected country and False otherwise.
- Possible phone number: shows True for possible phone numbers that have been successfully validated against the selected country and False otherwise.
- Invalid phone number: shows True for phone numbers that have been unsuccessfully validated against the selected country and False otherwise.
- Invalid phone number for the region: shows True for phone numbers that have been unsuccessfully validated against the region of the selected country and False otherwise.
- Not possible phone number: shows True for not possible phone numbers that have been successfully validated against the selected country and False otherwise.

Click Show data to see the validation results. The following columns will be appended to your data:

Validation results – shows the result of the validation rule (one column per each applied rule)
Phone Country - shows the country for the validated phone number.
Phone Number Type – shows the phone type (e.g. mobile or fixed line).

Note that you can also use the Validate phone number function for additional options such as appending the country code.

Workflow outputs

This tab lists all the outputs in your workflow, providing a list view which is useful for larger/complex workflows. To quickly find a specific output, right-click on it and select Locate.

Share workflows

Workflows can be shared with other users connected to the same instance of Data Studio. If you share a workflow, other users will be able to access and modify it.

To share a workflow, go to Workflow Designer, select the required one and click Share.

Note that multiple users can edit the same shared workflow simultaneously, and therefore potentially overwrite each other's changes.

All the shared workflows available to you will appear in the Shared workflows list.

To revert access to a workflow, select it and click Unshare. Note that administrators are able to unshare any workflow.

Execute workflows

A workflow can be executed on an ad-hoc basis or configured to be executed in the future (and at defined intervals, if required).

Go to Workflow Designer and click Execute on the workflow you want to execute now or later.

In the Schedule workflow execution dialog specify when and how often you want to execute the workflow and whether to export the output table.

When scheduling the workflow to be executed later, use the End time type to specify when the End action will be applied. This is the action that will be performed after the workflow is executed. The options are:

None – no action performed after execution
Kill – if the end time has been specified and the execution hasn't yet finished, workflow execution will be stopped
Reschedule – when the process completes or the end time has been reached, the execution will be rescheduled to run again according to the specified time intervals

You can also choose to export the workflow as a file or a script.

Refreshing data sources for workflow execution

When the workflow is re-executed, the data that was most recently read form the data source will be used.

To force all data sources used in a workflow to be automatically refreshed before each execution, the following conditions have to be met:

The refresh setting is enabled: select the required workflow and click Edit details. Enable Refresh on execution and click Apply.
If the data source is a file, it has to be modified more recently than the last time the data was read.
No other workflow that would cause the data source to be refreshed is being executed. This prevents two workflows attempting to refresh the same data source concurrently.

By default, the refresh setting is disabled for all workflows. To enable it for all new workflows, go to Configuration > All server settings and enable Default workflow refresh on execution.

Was this helpful?

Previous: Edit functions

Design workflows

Available sources

Workflow steps

Exporting to a file

Exporting to a DBMS

Exporting to Azure

Take snapshot step

Use latest snapshot step

Use snapshot range step

Workflow outputs

Share workflows

Execute workflows

Refreshing data sources for workflow execution

Aperture Data Studio v1

Get started