Create a Workflow

A Workflow is a sequence of one or more steps that allows you to transform and manipulate your data.

Think of a Workflow step as an action you can perform to your data. Some steps are simple (e.g. sampling data) while others are quite complex (e.g. identifying duplicate records). The power of Workflows is not only in the flexibility of combining any actions you require but also the ease of use - simply drag, drop and connect them.

In addition to all the default steps, you can create your own using the Java SDK (or contact us for our consultants to do it for you).

To create a Workflow:

  1. Go to Workflows.
  2. Click Create new Workflow.
  3. Enter a name.
  4. (Optional) Enter a summary and description, e.g. the purpose of the Workflow.
  5. (Optional) Give the Workflow an external label. This will be used in Notification events or referenced by the API.
  6. Click Apply. You're now in the Workflow editor, where you can create, modify, test and run Workflows.

After you create a new Workflow, you're taken into the Workflow editor in Edit mode. Since all Workflows require data, the Source step is automatically added.

Workflow editor options (edit mode):

  • Zoom In/Out within a Workflow.
  • Generate and download the Workflow report.
  • Run (execute) the Workflow.
  • Automatically Arrange steps.
  • Drag select to select Workflow steps (turn on/off).
  • Snap to automatically connect Workflow step nodes together (turn on/off).
  • Select all steps in a Workflow.
  • Delete selection to delete all selected Workflow steps.
  • Available for some steps: Convert to View to save the actions in selected steps as a View.
  • Workflow parameters to create, update or delete parameters in the Workflow.
    {end-version}
  • Finish editing to save changes to the Workflow.

Workflow editor options (view mode):

  • Zoom the In/Out within a Workflow.
  • Generate and download the Workflow report.
  • Run (execute) the Workflow.
  • View Workflow parameters.
  • Publish this version of the Workflow.
  • Enter Edit mode to make changes.
  • Close the Workflow editor.

Default Workflow steps

Duplicate data to multiple outputs.

The rows from the source data will be duplicated (branched) into two outputs. To create more than two duplicates of the data, you can link several Branch steps.

Visualize your data as a bar, pie, or gauge chart. Export as a PDF or an image file (.JPG, .PNG, .SVG). Find out more about the Chart step.

This step allows you to convert column(s) into rows by clicking Selected columns in the column list on the left.

Specify the following for the new columns:

  • Attributes column heading - this will be populated with the selected column names.
  • Values column heading - this will be populated with the values of the selected columns.

To do the reverse, use the Rows to columns Workflow step.

This step allows you to export your Workflow's output data to a file or an External system that uses a JDBC connection.

Each Workflow can contain multiple Export steps, so you can export data at any point. Each of these steps can be configured individually.

Click Export Settings in the step dialog to specify where and how the data will be exported.

Exporting to a file

  • Specify the Exporter (.csv, .txt, .xls, .xlsx or .json).
  • Choose the Target System - the export location. The default Server export directory will export files into the appropriate Space's export directory on the server. You can also export to an external file system (Apache Hadoop HDFS, Amazon S3, Microsoft Azure Blob Storage, or SFTP server) if one is configured.
  • Enter the name of the file or assign a Workflow parameter for Filename.
    • The Filename is made up of several components. By default, it will have two components; one for the name of the file and a second one for a timestamp.
    • The timestamp is the date and time at the point that the Workflow is executed.
    • You can add new components by using @ on your keyboard to select and enter a template item.
    • Components can be deleted by removing the template directly from the Filename.
    • Each component can either be custom text or a timestamp in one of the available datetime formats.
    • An example of the resulting filename is shown as you build up the filename components.
  • To remove the default extension (such as .csv) from exported files, un-select Include default file extension.
    • You can export files with a custom extension by excluding the default and specifying a custom extension at end of the Filename.
  • Enable Overwrite existing to allow a file with the same name to be exported more than once (the latest copy will replace the previous one). If this option is disabled, you will get an error when trying to export to the same file name more than once.
  • Select Compression or encryption to export the output as one of the following:
    • None for a normal export file
    • Compressed only to export as a compressed file. Currently, only the .zip files are supported. By default, the Compressed filename will have the same filename as defined by the compression type. You can change this if required.
      • To add new components, use @ on your keyboard to select and enter a template item.
      • To delete components, remove the template directly from the Compressed filename.
      • Each component can either be custom text or a timestamp in one of the available datetime formats.
  • * An example of the resulting filename is shown as you build up the compressed filename components.
    • Compress and encrypt to export as an encrypted file. By default, the Encrypted filename will have the same filename as defined by the encryption type. You can change this if required.
      • Select an existing key based on the Encryption key name. Find out how to add a new encryption key.
      • To add new components, use @ on your keyboard to select and enter a template item.
      • To delete components, remove the template directly from the Encrypted filename.
      • Each component can either be custom text or a timestamp in one of the available datetime formats.
  • * An example of the resulting filename is shown as you build up the encrypted filename components.
  • If the Include column names is checked, the first row in the exported file will contain column names. This doesn't apply to .json files.
  • Column name style determines how the column names will be displayed in the export file:
    • None - the column name as it appears in Data Studio.
    • Database style - the column name will be in upper case and underscores will be used to separate words.
    • Lower case - the column name will be in lower case.
    • Upper case - the column name will be in upper case.
    • Humanized - the first word will be capitalized and underscores turned into spaces.
  • Specify the Character set (most sets are supported).
Custom settings for CSV files

You can specify a delimiter and filename extension for CSV files (created in an Export step or downloaded from a grid of data):

  1. Navigate to the installation folder (by default C:\Program Files\Experian\Aperture Data Studio {version}) and open the server.properties file.

  2. On a new line in the file, add the setting you want to modify:

    Setting name Default value Description
    Data.csvDelimiter , Any string; special keywords can be used (see below)
    Data.csvFilenameExtension csv All non alphanumeric characters will be removed
    Keyword Value
    BAR ¦
    COLON :
    COMMA ,
    NONE \O
    PIPE |
    SEMICOLON ;
    SPACE ' '
    TAB \t

    For example, to change the delimiter to tab, the entry in server.properties should be:

    Data.csvDelimiter=TAB
    

Exporting to an External System using JDBC

  • Select JDBC as the Exporter.
  • Choose a DBMS system from the list of available ones. This is a list of the External System connections configured in Data Studio that use a JDBC driver.
  • Choose a Credential for authenticating with the DBMS. These credentials will have been created alongside the External System definition.
  • Select the target table from the list in Table name. These are the tables that you can access in the target DBMS using the selected credential.
  • Mode determines how the data will be exported into the target system:
    • Insert: New rows are inserted into the table. The equivalent SQL statement is:
      "INSERT INTO tablename (column1, …) VALUES (value1, …)"
    • Update: Existing rows are updated. The mapping dialog will show which source columns are mapped to which target columns and which ones have been specified as keys. You can change the mapping and keys at this stage. The keys referred to are the columns that will appear in the equivalent SQL WHERE clause:
      "UPDATE tablename SET (column1=value1, …) WHERE (columnName=value, … )"
    • Delete: Rows are deleted. The keys are the column names used in the SQL WHERE clause:"DELETE FROM tablename WHERE columnName=value, … "
    • Insert or update (if insert fails): If a row insert fails (there is a key collision), then the statement will be retried as an update.
    • Update or insert (if update fails): If the update fails (no record found with the given key), then the statement will be retried as an insert.
  • For Insert mode, you have the option to Truncate table before carrying out the row inserts.
  • For all modes other than Insert, you will need to select one or more Key columns from those in your Workflow data. These columns need to match the key fields in your selected target table.
  • Select Atomic database update if the export should be performed in a single transaction. In this mode it is possible that the database may lock table(s) for the duration of the update. This may be for some considerable time for large exports.
  • Commit batch size: Commit a batch after the specified number of records have been sent to the database. This setting is only available when Atomic database update is unset. Note that the final batch usually has fewer records.
  • Pre SQL: Optionally enter a SQL statement that will be executed before the export. The SQL may contain any valid statements and will be executed on the target database.
  • Post SQL: Optionally enter a SQL statement, similar to Pre SQL but which will be executed after the export has completed.

Filter your data using a function that returns a boolean (true or false) value. If the function returns true, the given row will be passed through to the step output.

The Find duplicates step uses powerful standardization and matching algorithms to group together records containing similar contact data (e.g. name, address, email, phone) and keep that information within a duplicate store. Each group of records, known as a cluster, is assigned a unique cluster ID and a match level. The step provides out-of-the-box functionality for the United Kingdom, Australia, and the United States, but is also completely configurable down to the most granular name and contact elements.

Find out how to use, configure and troubleshoot this step.

Fires a user-defined event from the Workflow.

You can kick off custom events at any point in your Workflow.

Tick the Only fire if input has rows checkbox for the step to be kicked off only when the input has data.

Select the required Custom event then specify values or assign a Workflow parameter for the available event data items (the event data name will be displayed).

For example, you want to kick off an event when the Validate step has failures. Select the 'Validation failed' custom event and specify the reason for the failure in the 'Reason' event data item. This reason will now be available as an event data item when creating/editing notifications.

Another example is using a Filter step to identify records that you have particular interest in. The Fire event step can then follow the Filter step but only fire when the Filter step is producing records.

Group and aggregate column values.

Grouping allows you to group values from one input into one or more columns and aggregate the remaining columns, if required.

Click Grouping in the step dialog then drag and drop the required columns up, into the Grouped list. Remove columns from this list by simply dragging them down. Click Apply when done.

You also configure grouped columns and use aggregate functions on the grouped values when in Show step results view: right-click on the grouped column heading and select Add aggregate. In the dialog that appears, specify the name/description for the new column, the source column to be used for aggregation and the required function.

You can drill into a particular group to see the rows that belong to it by exploring the results, selecting a particular group, right-clicking and selecting View rows in group.

Viewing group rows in other steps

The ability to drill down into the Group step results will be retained if the step is connected to any of the following subsequent steps:

  • Branch
  • Chart
  • Export
  • Filter
  • Fire event
  • Output
  • Sample
  • Sort
  • Split
  • Take Snapshot
  • Transform
  • Union

You can view the drilldown using the right-click options or the toolbar on these steps.

Interactive Datasets

The results of the grouping can be saved to a Dataset by including a Take Snapshot step in your Workflow. Furthermore, if the Interactive option is selected when creating a new Dataset on the Take Snapshot step, you can drill into the data to view the rows in each group in the Dataset.

Once an interactive Dataset has been created, it can be drilled into in various ways, such as:

  • The interactive Dataset itself.
  • A View with the interactive Dataset as the source.
  • A Source Workflow step with the Interactive Dataset as the source.
  • A Source Workflow step, with a View on the Interactive Dataset as the source.

Interactivity also works for multi batch Datasets.

Harmonization is the process of deduplication - taking multiple records that are deemed to be for the same underlying real-world subject and producing a single resulting record.

Find out how to use this step.

Join two inputs into one based on one or more columns from each input.

To specify the columns you want to join, click Left side columns or Right side columns.

Use the Retention type menus to specify different combinations of join. The top menu is for the first join input and the bottom one for the second join input.

  • All matches: Performs a standard join.
  • Singletons only: Return the first match of a one-to-many match and any matches that occur only once. This is a normal relational join with duplicates removed.
  • Duplicates only: Return all the matches that are one-to-many only . This is a normal relational join excluding matches that match only once.

Click on the Venn diagram to specify the type of join you want to apply:

  • Inner (center segment of the Venn): Returns all rows where the join condition is met (keys match).
  • Left outer (left and center segments): Returns all rows from the left-hand input and only the rows from the right-hand input where the join condition is met (keys match).
  • Left outer without intersection (left segment): A variant on the basic left outer join, which returns only rows from the left-hand input that don't meet the join condition. No rows from the right-hand input are returned.
  • Right outer (center and right segments): Returns all rows from the right-hand input and only the rows from the left-hand input where the join condition is met (keys match)
  • Right outer without intersection (right segment only): A variant on the basic right outer join, which returns only rows from the right-hand input that don't meet the join condition. No rows from the left-hand input are returned.
  • Full outer (all segments of the Venn): Returns all rows in both the left and right inputs. If a row from one side of the join has no match in the input on the other side, the columns from the other input will contain null values.
  • Full outer without intersection (left and right segments only): Returns all rows in both the left and right inputs, excluding rows where the join condition is met.
  • Cross join or Cartesian product (no segments of the Venn): Returns all rows from the left input. Each row from the left input is combined with all rows from the right input.

Above the Venn diagram select More options to expose more detailed options for the join.

The Join step provides three matching options which determine how keys will be compared:

  • Exact match: rows will be joined if key values on the left-hand side and right-hand side are exactly the same, including data type and casing.
  • Ignore datatype: rows will be joined if keys can be standardized to the same value. A key with the alphanumeric value '1' will be joined with a key that is the integer value 1.
  • Case-insensitive: rows will be joined if keys match with casing ignored. A key with value 'ABC' will be joined to a key with value 'abc'.

The Do not match nulls option configures how null values are handled in the join. When enabled, left or right column values containing nulls will never match with any other row. This means that rows where the join key contains nulls will never appear in inner join results and will always appear in left/right unmatched results if the selected join type will include left/right unmatched rows.

In the join dialog, click Show step results to view the results of the join. Specify which values from the joined columns to use in any following steps: all the matches, only duplicates or only singletons.

Click to see suggested join keys. Joins will be suggested only where column names match exact on the two side of the join. Click on the suggested join to apply it.

Allows you to match values from a column in the first input (source) to a column in the second input (the lookup table / lookup input). Values are returned from a second column from the lookup table after an operation is applied to the set of values for the matched rows.

Step configuration

You can use any source in a Workflow as the input for this step: a Dataset, View, or the output from another step. You can even use the same output node for both the source and the input (if you want to do a self-lookup).

To add multiple lookup tables, click Add additional lookup table.

Lookup definition

Click Definitions to open up the configuration panel, and Add definitions to allow you to define a new lookup. Multiple lookup definitions can be created in a single Lookup step. A lookup definition consists of the following options:

Name

The name of the resulting new column from the lookup definition.

Lookup type

A lookup type is the operation applied to the value (or values, if more than one match exists in the lookup table) returned by the Lookup. These types/operations are applied to the set of values for the return column where the lookup column matches the lookup value.

Lookup type Description
Count Returns a count of the number of matches that the lookup value has in the lookup column.
Count excluding nulls Returns a count of the number of matches that the lookup value has in the lookup column ignoring nulls.
Exists Returns true if the lookup value has a match in the lookup column, false if not.
Not exists Returns false if the lookup value has a match in the lookup column, true if not.
First value Returns the first value (based on input row order) from the return column where the lookup value matches the lookup column.
Last value Returns the last value (based on input row order) from the return column where the lookup value matches the lookup column.
Minimum value Returns the minimum value (based on a sort of the values) from the return column where the lookup value matches the lookup column.
Maximum value Returns the maximum value (based on a sort of the values) from the return column where the lookup value matches the lookup column.
Sum Returns the sum of the values in the return column where the lookup value matches the lookup column. Non-numeric values will be counted as '0'.
Average Returns the average (mean) of the values in the return column where the lookup value matches the lookup column. Non-numeric values will be excluded from the calculation. Returns '0' by default if the set of return column values are all non-numeric.
Median Returns the median (based on a sort of the values) of the values in the return column where the lookup value matches the lookup column.
Variance Returns the statistical variance of the values in the return column where the lookup value matches the lookup column. Non-numeric values will be excluded from the calculation. Returns '0' by default if the set of return column values are all non-numeric.
Variance (population) Returns the statistical variance of the values in the return column where the lookup value matches the lookup column where it is known that the values represent the whole population. Non-numeric values will be excluded from the calculation. Returns '0' by default if the set of return column values are all non-numeric.
Standard deviation Returns the standard deviation of the values in the return column where the lookup value matches the lookup column. Non-numeric values will be excluded from the calculation. Returns '0' by default if the set of return column values are all non-numeric.
Standard deviation (population) Returns the standard deviation of the values in the return column where the lookup value matches the lookup column where it is known that the values represent the whole population. Non-numeric values will be excluded from the calculation. Returns '0' by default if the set of return column values are all non-numeric.
All values Returns a comma separated list of all values in the return column where the lookup value matches the lookup column. The order is as they appear in the input.
Reverse values Returns a comma separated list of all values in the return column where the lookup value matches the lookup column. The order is a reverse sort of the values.
Sorted values Returns a comma separated list of all values in the return column where the lookup value matches the lookup column. The order is a sort of the values.
Return column

The column from the lookup table that you want the above operations to act on. This option is not present if you have selected the Exists or Not Exists lookup types, as no value from the lookup table is returned.

Match type

This is where you define how you want to match values between your lookup value and lookup column. The selected match type will apply to both sides.

Match type Description
Exact Will only match if the value and datatype exactly match.
Ignore datatype Will match if the values match, ignoring the datatype (e.g. alphanumeric '1' will match numeric value 1).
Case-insensitive Will match if the values match, ignoring the casing (e.g. 'experian' will match 'experIAN').
Denoise Will match if the values match after special characters / whitespace etc. are removed (e.g. 'Experian' will match 'Ex'per ian').
Format Will match if the formats of the two values are the same (e.g. '123abc' will match '456def').
Soundex Will match if the values match after going through the Soundex phonetic algorithm.
Metaphone Will match if the values match after going through the Metaphone phonetic algorithm.
Double metaphone Will match if the values match after going through the Double metaphone phonetic algorithm.
Distinct checkbox

If selected, each value from the return column will only be used once per lookup value match group. Only present for lookup types which have the potential to operate across multiple values, for example Average and All values.

Default value

The value that is returned if no match for the lookup value is found. This field can be left blank in which case a null value is returned. Some lookup types will never use this (e.g. Exists).

Lookup column

The column from the lookup table that contains the values you want to match on.

Lookup value

The column from the source that contains the values you want to match on.

Add more conditions button

Allows you to define additional pairs of lookup columns (additional match criteria) to further narrow matches.

Example

A source Dataset contains customer information alongside a column with a "discount code":

Customer ID Code
1 A
2 A
3 B
4 C
5 A
6 9

Another dataset contains the "discount code" value alongside the rate that that code represents. This will be the data that we lookup in to:

Discount Code Rate
A 5
B 10
C 15

We want to define a lookup that will return the appropriate rate for each discount code. The rate will be added as a new column in my customer (source) Dataset.

To do this, in a Workflow add a Lookup step with the customer dataset as the source input and the discount code dataset as the lookup input.

The lookup definition is set up as follows:

  • Name: Discount Rate
  • Lookup type: First value
  • Lookup table: Rows for discount code
  • Return column: Rate
  • Match type: Exact
  • Default value: "NO MATCH"
  • Lookup column: Code
  • Lookup value: Discount Code

Result:

Customer ID Code Discount Rate
1 A 5
2 A 5
3 B 10
4 C 15
5 A 5
6 9 NO MATCH

Lookup or Join?

Lookups and Joins can both achieve a lot of the same things when combining of two data sources. However, their performance is optimized for different scenarios, so for large data volumes it is important to understand these before choosing which one to use for your workflow.

In general, we recommend lookups for use-cases where the lookup table is small (e.g. a domain file of a few thousand, typically unique, values) or when you need to perform an operation on the results (e.g. calculate and average for the matched values). Joins are better suited for large to large file operations.

More specifically, for high cardinality columns (e.g. true unique IDs), lookups can take at least twice as long as an equivalent join. For low-cardinality (i.e. low uniqueness) / high row count keys, lookup indexes will built quickly compared to the corresponding join index.

This step is used to map columns in your data to a target schema definition. The target schema must be a Dataset that is already loaded into Data Studio.

Map to target is often used in combination with the Export step to populate a target database schema but can also be used to rename multiple columns easily.

Auto-map functionality will attempt to map columns automatically where possible, based on column names in the source (input columns) and the target (output columns). You can also map columns manually.

If an input column is not selected in the mapping for a given target column, the values in the target column will be null.

If Include in output is unchecked for a target column, that column is not included in the step's output. Use this setting when mapping to a schema that includes auto-generated columns in the target database.

When a re-usable Workflow is using the Output step, the step's data will be available in another Workflows.

Click Show step results to view the result from previous step that connects to the Output step.

Profile your data at any point in a Workflow.

We examine each value and identify various attributes (such as type, min/max values, count) and determine its uniqueness and completeness.

Click Profiled columns to select the column(s) you want to profile. In the profiled columns list on the left, select the required ones and click Apply. To see the results, click Show step results in the step dialog.

You can now export or take a snapshot of your profiled data to track data uniqueness over time.

When profiling data, you can select a column and choose Values or Formats to drill into the value or format information identified from the profiling process, and from there drill down again to the individual rows that contain that selected value or format.

Viewing profiled values, formats and rows in other steps

The ability to drill down into the Profile step results will be retained if the step is connected to any of the following subsequent steps:

  • Branch
  • Chart
  • Export
  • Filter
  • Fire event
  • Output
  • Sample
  • Sort
  • Split
  • Take Snapshot
  • Transform
  • Union

You can view the drilldown using the right-click options or the toolbar on these steps.

Interactive Datasets

The results of the profiling can be saved to a Dataset by including a Take Snapshot step in your Workflow. Furthermore, if the Interactive option is selected when creating a new Dataset on the Take Snapshot step, you can drill into the data to view the profiled values and formats in the Dataset. The underlying rows of the drilled down values or formats can also subsequently be drilled into.

Once an interactive Dataset has been created, it can be drilled into in various ways, such as:

  • The interactive Dataset itself.
  • A View with the interactive Dataset as the source.
  • A Source Workflow step with the Interactive Dataset as the source.
  • A Source Workflow step, with a View on the Interactive Dataset as the source.

Interactivity also works for multi batch Datasets.

Lineage metadata

To further improve data quality, Data Studio can capture lineage metadata of data sources. Lineage metadata can be included in Workflow outputs and subsequently used for further integration or processing.

In the Profile step, lineage metadata can be included in the output using the dropdown under More Options.

View up to three inputs side by side (horizontally or vertically). Nest these steps for more flexibility over layout.

Click on Vertical split to change it to horizontal.

To make a report available to Data Consumer users, click Expose to Data Consumer users.

To view the report, click Show report.

This step allows you to convert row(s) into columns by clicking Settings.

  • Column to split - this is the column containing the row values you want to convert to new column names.
  • Values column - this is the column containing the row values that will be populated in the new columns.
  • Click Profile for new values to automatically identify unique values from the Column to split column names (you can edit these as required). You can also manually enter a new column name, as long as it's an existing row value from the chosen column.

The step results will consist of the new column(s) and all other columns except for the Column to split and Values column.

To do the reverse, use the Columns to rows Workflow step.

Create a pivot table

You can use this step together with Group and Export to create a pivot table.

For example, you have the following car sales data:

Year Brand Sold Product ID Model
2019 Honda 10 1 HRV
2019 Mazda 15 2 CX5
2020 Honda 7 3 Civic Si
2019 BMW 3 4 3 Series
2019 Honda 5 5 CRV
2020 Mazda 50 6 Mazda 3

What you're interested in is the amount of Honda and Mazda sales per year:

Year Honda Mazda
2019 15 15
2020 7 50

To achieve that:

  1. Add a Source step with the relevant Dataset.

  2. Connect the Rows to columns step.

  3. Click Settings to configure it:
    -Select Brand as the Column to split.
    -Select Sold as the Values column.
    -Click Profile for new values. The values Honda, Mazda, BMW should be automatically populated.
    -Remove the BMW value and save changes.

  4. Connect the Group step.

  5. Click Grouping to configure it:
    -Remove Count from the Grouped and aggregated columns.
    -Move Year from the Ungrouped columns to the Grouped and aggregated columns.
    -Click Add aggregate and enter the following values:
    -Column selection type: Multiple
    -Selected columns: Honda and Mazda
    -Aggregate: Sum
    -
    Click Apply twice, to save changes and to apply changes to the groupings.

  6. Connect the Export step.

  7. Click Show step results to preview the results and export as required.

Sample and limit your data.

This step allows you to create a sample of the data the step is connected to. You can specify the following sampling settings:

  • Start at row - the row number from which to start loading
  • How often to sample - specify the frequency of rows to sample
  • Sample randomly - tick to create a random sample
  • Limit rows - tick to specify a Row limit that will be used for the sample

Click Show step results to view the results.

Sort your data in ascending/descending order.

Click Sorting in the step, select the required columns and click Apply to save changes.

To change the sort order, click on the column in the Sorted list.

Click Show step results to view the results.

Your data source, often the first step in a Workflow. Select the required Dataset, View or Issue list to start manipulating and transforming your data.

Enable or disable the Delete batches on completion setting to automatically remove processed batches from a Dataset.

Enable or disable the Allow auto-refresh setting to automatically load a new batch of data into the selected Dataset or the Dataset used by the selected View, when the Workflow is executed.

For Issue lists, enable the Include resolved setting if issues that have already been marked as resolved should be processed in the Workflow.

Replaceable sources

Enable Can supply source when executed to make the data source replaceable, so that a different source can be provided when the Workflow is executed.

Enable Must be supplied to make it mandatory for a new data source to be assigned at Workflow execution. This option is available only if Can supply source when executed is selected.

There are two ways to replace the source on execution of a Workflow:

  1. Replace the Dataset used in the Source step with an alternative Dataset. The default Dataset option allows you to specify a different Dataset for this Workflow at execution.
  2. Upload a new file of data to be used as the new source. The File upload option will only appear when the Workflow configuration is suitable. This option allows you to use a different, temporary file for this Workflow execution. Note that the columns of this file have to be identical to the target Dataset and will be automatically mapped. Currently, only Datasets sourced from files (and not External systems) are supported.

A source can be replaced in two ways: executing the Workflow via the user interface or the REST API or by making the Workflow re-usable and embedding it in a parent Workflow.

Source batch sampling

This feature allows the Source step to output only a sample set of batches instead of all of them. A multi-batch Dataset has to be selected as the source for these options to be available.

Take last

This option will allow you to output one of the following:

  • all batches that were created within a given timeframe from the current time.
  • the last given number of batches.
  • all batches.

Sample every

This option will only output the last batch created within each interval of the time unit provided.

For example, if you have three batches with these timestamps:

  • Batch 1: 2020-01-01 9:00
  • Batch 2: 2020-01-01 11:00
  • Batch 3: 2020-01-01 11:59

Depending on the unit of time selected, the following would be the output:

Day

  • Batch 3: 2020-01-01 11:59

Hour

  • Batch 1: 2020-01-01 9:00
  • Batch 3: 2020-01-01 11:59

Minute

  • Batch 1: 2020-01-01 9:00
  • Batch 2: 2020-01-01 11:00
  • Batch 3: 2020-01-01 11:59

Lineage metadata

To further improve data quality, Data Studio can capture lineage metadata of data sources. Lineage metadata can be included in Workflow outputs and subsequently used for further integration or processing.

In the Source step, the available lineage metadata depends on the source type and can be included in the output using the dropdown under More Options after a Dataset has been selected.

Unique row IDs

Each row in a Dataset has a row ID which is guaranteed to be unique across all Datasets. These IDs can optionally be included in the output of the Source step. The IDs are generated in ascending order over time, so it's possible to use them to sort rows of data by the time they were first loaded into Data Studio.

Combine two inputs into one by presenting the values next to each other.

Data in the columns is not combined or merged - the duplicated values are renamed with an appended numeric value. The output – combined values from two sources – can then be used as input to any further step(s).

Click Show step results to view the results.

Apply a true/false filter to one input, splitting it into passing and failing rows.

To create a filter, click Filter in the step dialog and Create in the left-hand menu. Specify the necessary filters and click Apply to save the changes.

Click Show passing/Failing rows to view the values that passed/failed the filter. The passing and failing results can be individually linked to any further step(s).

Snapshots are copies of data that allow you to track changes over time or store your results for use elsewhere. A Snapshot is a type of Dataset that's created by and stored in Data Studio.

Taking Snapshots

Connect the Take Snapshot step's input to the data you'd like to capture in a Snapshot. When the Workfow is run, a new batch of data will be created.

This data can then be added to an existing or new Dataset:

  • Select an existing one from the Dataset list.

  • Use Create new Dataset to create a new one.

    • Name - add a name as it will appear in Data Studio.
    • Summary - an optional short summary of the Dataset.
    • Description - an optional, longer description.
    • Dataset type - choose one of the options: Single batch will write the latest data to the Dataset and keep no history. Multi batch will retain older data in the Dataset allowing for trend analysis.
    • Interactivity - This option will only be available if you're making a copy of Results by rule or Results by group from the Validate step. Selecting Interactive (with drill down) will result in a Dataset that allows users to drill into the underlying data and view the passing/failing rows for the rules/rule groups.
    • Add Batch Timestamp Column - When selected, an additional column will be added to the Dataset, recording the timestamp that the batch was created. This option is useful for trend analysis.
    • Allow automatic batch deletion - This will ensure that data batches which have been used in a Workflow are deleted after they have been processed (i.e. after the Workflow using those batches has been run). This option is intended to be used so that no batch is processed through a Workflow twice. It's used in conjunction with the Delete batches on completion setting in the Source step.
    • Allow auto-refresh - Not applicable to Snapshots.
    • Publish to ODBC - Make the Dataset visible through ODBC connections.

Using Snapshots

Snapshots can be used like other Datasets:

  • As a Workflow source - add a Source step to a Workflow and select the required Snapshot Dataset.
  • As a View source - choose the required Snapshot Dataset when creating a View.

Snapshot schema changes

You may want to update the schema if columns have been added/removed when writing data to a Snapshot.

Resolving Snapshot warnings

A summary of the unresolved changes will appear as a warning on the step. To fix this, you can change the input or resolve the conflicts. Click Update schema: {X} schema change(s) to open the Update Schema dialog (where {X} is the number of changes found) to see a consolidated list of existing, new and missing columns.

You can now:

  • Include or Exclude new columns in the Snapshot's schema. You can change this at any point later.
  • Set all missing mandatory columns as Optional. Optional columns will be retained and filled with empty values in the target Dataset.

Click Apply to update the schema. The step will become valid and you'll be able view the step results/run the Workflow.

Record issues in your data so that they can be tracked, organized and resolved.

Find out how to use this step.

Manipulate and configure your data.

This step allows you to add/remove various Functions and arrange or exclude columns.

Clicking the Columns row on any Transform Workflow step will open the panel where you can select the columns you want to work with.

Moving columns

Moving columns allows you to change the order in which they appear in the data. This can be done by selecting the columns and pressing the up/down arrow button next to the columns list.

Columns can also be moved by dragging them up/down the list.

Renaming columns

Rename a column by selecting the column and clicking Rename on the toolbar. You will then be presented with a modal box prompting you to enter the name and an optional summary.

If multiple columns are renamed at the same time, the modal box will provide you with an opportunity to apply a common prefix or suffix to the originating column name. Alternatively, you will be able to rename all columns individually.

Excluding columns

Select the columns you want to exclude and click Exclude in the toolbar. This can also be done by hovering over the column name and clicking the inline icon.

To make columns visible again, select the columns and click the Include button in the toolbar.

Transforming and creating columns

A column can be transformed by applying one or many Functions to it. These Functions range from a simple action, such as converting a column to uppercase, to more complex multiple column operations.
Whatever it may be, transforming a column can be done to create a new column or to change the data in an existing one.

Suggestions

Before applying any custom transformations, you can use the Suggest option to analyze the data being explored and return some recommended transformations based on its content.

Any of these suggestions can be applied to the data by checking their selection boxes and clicking Apply.

The transformation suggestions can also be created as new columns rather than replacing the original data, by checking New column? in the table and specifying a column name.

Applying a Function

  1. Select the columns
  2. Decide to transform the data in the existing columns, or create new ones for the results
    • To change existing columns click Existing column on the toolbar
    • To create new columns click New column on the toolbar
  3. Choose an existing Function from the list (these are grouped into expandable categories)
  4. After choosing your Function you will be presented with options to configure it
  5. Click Apply transformation to apply your Function to the columns

Creating an advanced Function

  1. Select the columns
  2. Click Existing column or New column (see Applying a Function above for more info)
  3. Click Create your own function
  4. Create a Function with as many Function steps as you like
  5. Click Apply to apply your Function to the columns

If the column already has a transformation Function applied you will first be prompted to choose whether to 'Append to existing functions' or 'Replace existing functions'.

Duplicating columns

Duplicating columns can be done by selecting the columns and clicking Duplicate on the toolbar. The duplicated column will be placed after the original column and will be renamed with the prefix 'Copy of '.

Editing a transformation

  1. Select the column that is already transformed
  2. Click Edit on the toolbar
  3. Either edit the options in the configuration form or start over with a completely new Function

Removing a transformation

To remove the Functions applied to columns, select them and click Delete on the toolbar.

You can only completely delete a created column rather than just remove the transformation, because created columns can't exist without a Function.

Combine two or more inputs into one output by presenting the values on top of each other. The source columns with matching headings will be combined.

The output – combined values from two or more sources – can then be used as input to any further step(s).

Click Show step results to view the results.

Use business rules to measure the quality, completeness and accuracy of your data.

A rule is a Function that returns a true or false value, which translates to a pass or fail. This can be either a default Function (like ‘Contains’ or ‘Is Date’) or a custom one.

Creating rules and groups

The easiest way to get started is to Explore your data, Profile the desired columns and then use the Suggest validation rules action. This will analyze the data in these columns, and suggest formulas to be applied to them as validation rules. Uncheck any suggestions that are not useful and apply to create a new Workflow containing a Validate step, with initial rules and rule groups defined and ready for further changes if required.

On the Validate step, click Rules to view and edit any existing rules and to Add rule or Add group.

Each rule has a name and summary, a parent rule group and a Function definition. The Function can either be a pre-existing Function in Data Studio, or a custom Function created in the Function builder. When building a custom Function, pick one of the relevant columns (or a parameter) as the input value and ensure the Function returns true/false as the result.

Rule groups represent logical collections of validation rules and can have different thresholds set. For example, a group of compliance rules may have a fail threshold of 99%, but a data quality group of rules may use a lower threshold and be increased over time as data and processes are improved.

The Status result column for each rule is based on these thresholds, so Red below the fail threshold, Green at or above the pass threshold, and Amber in between.

Each group has a name and description, pass and fail threshold, plus an optional column to weight the results. As well as counting the number of Passed rows and Failed rows, the results will also include Passed weight and Failed weight columns, which contain a sum of the chosen column for all the passing/failing rows. For example, weighting each rule’s results by total sales allows you to prioritize data quality issues for the highest spending customers with the most overall impact.

Ignoring values in validation rules

Rules can be set to ignore or skip values that do not need to be validated, and so should not impact the rule pass rate. For example, where 10 emails are being checked; 5 are valid, 2 invalid and 3 are blank, the pass rate would be 5/7 (rather than 5/10 if all values were considered).

The Ignore Null values checkbox can be selected when first creating a rule using an existing Function. However, values other than null can also be ignored using the Ignore literal value when designing a custom Function.

Viewing results

Validation results are available in several formats:

  • Show passing rows contains the rows that pass all of the defined rules.
  • Show failing rows contains the rows that fail at least one validation rule.
  • Show all rows is a combination of the two above, comprising the entire input data alongside the rules and a Pass/Fail result for each.
  • Show results by rule summarizes/aggregates the results for each rule with a count of passes and fails.
    • Show results by rule for analysis is the same data, un-pivoted into rows.
  • Show results by group is similar to the above, but useful for a higher level view of aggregated results when there are many rules.
    • Show results by group for analysis is the same data, un-pivoted into rows.

Similar to the Source and Profile step, the Validate step has a Source metadata dropdown that enables you to include the lineage metadata of input data. This information (such as file name or batch ID) is useful to include as additional columns in the aggregated results.

This step validates and enriches addresses in bulk.

Addresses will be cleaned by verifying them against the official postally-correct address files for the relevant country. Cleaned addresses are assigned a match result, based on the accuracy of the original address. You can define layouts specifying the number, content and format of the address output columns. Choose one of the available Additional datasets to enrich your data. The datasets that are available to you depend on your license.

Find out how to use, configure and troubleshoot this step.

Validate emails based on the format or domain address.

Select the Email column and pick one of the two available Validation type options:

  • Format Check: Checks whether the value matched a valid email format. Returns either true or false.
    Examples of valid and invalid email formats:

    Format Result
    info@gmail.com Valid
    first.second-name@gmail.com Valid
    first.name+tag@gmail.com Valid
    name@info@gmail.com Invalid
    name"not"right@test.com Invalid
    another.test.com Invalid
    name@incorrect_domain.com Invalid
    com.domain@name Invalid
    first_address@gmail.com, second_address@gmail.com Invalid

    Only one email can be validated at once; lists of emails as seen in the last example will be rejected.

  • Domain Level: Checks whether the value has a domain that exists and is an email server. This option returns both an overall validity result (true or false) in the Email domain: Result column, and additional information in the Email domain: Error column describing the reason for failure. The possible outcomes are:

    Error Result Description
    Valid True Domain exists and is a mail server.
    Bad format False Email format is invalid.
    Invalid domain False Domain validation check failed. The domain may not exist, or may have been flagged as illegitimate, disposable, harmful, nondeterministic or unverifiable.
    Invalid name False Local part validation failed. For example it may have been identified as a spam trap or role account such as "admin@server.com".
    Failed to perform DNS lookup False An error occurred when attempting to perform the DNS lookup.

Domain level validation results are cached with results refreshed every 30 days. The cache validity is configurable in Settings > Workflow steps by changing the Email validation cache validity setting.

Click Show step results to view the results.

Validate global phone numbers.

Once connected to the data, you have to specify the following:

  • Select phone column - specify the column containing the phone numbers you want to validate
  • Select country - pick the country to use for phone validation. You can either select a single country or pick a column with country data. For the latter, please ensure that the country names adhere to ISO 3166-1 alpha-2, ISO 3166-1 alpha-3 or ISO3166 country name standards.
  • Select validation type - choose the validation rule(s) that will be applied:
    • Valid phone: shows True for phone numbers that have been successfully validated against the selected country and False otherwise.
    • Valid phone region: shows True for phone numbers that have been successfully validated against the region of the selected country and False otherwise.
    • Possible phone : shows True for possible phone numbers that have been successfully validated against the selected country and False otherwise.
    • Invalid phone: shows True for phone numbers that have been unsuccessfully validated against the selected country and False otherwise.
    • Invalid phone region: shows True for phone numbers that have been unsuccessfully validated against the region of the selected country and False otherwise.
    • Not possible phone : shows True for not possible phone numbers that have been successfully validated against the selected country and False otherwise.

Click Show step results to view the results. The following columns will be appended to your data:

  • Validation results – shows the result of the validation rule (one column per each applied rule)
  • Phone Country - shows the country for the validated phone number.
  • Phone Number Type – shows the phone type (e.g. mobile or fixed line).

Explore step results

After clicking Show step results at any point in the Workflow, the Explore mode lets you action the results. Each applied action will be listed in the Actions panel on the right. Clicking on this panel allows you to go back and see the data before the next action was applied.

To make any of these explore actions permanent, click Add to Workflow. Each action will insert a new Workflow step into the current Workflow from whichever step was previously opened.

Available actions:

  • Filter - apply a basic filter, or select advanced to build a Function
  • Columns - exclude/include selected columns from View
  • Infer - infer data types for a column, such as treating blank values as nulls or a column of numeric values as date information
  • Relationships - analyze relationships between columns in your data. Find out more.
  • Chart - visualize the data (similar to the Chart step)

The remaining actions - Sample, Sort, Group, Transform and Profile are the same as their equivalent Workflow steps.

Workflow parameters

Workflow parameters allow you to configure the settings of your Workflow steps without specifying the value. You can create a Workflow parameter and assign it to any of the Workflow step setting that supports the Workflow parameter's datatype.

You can also set the Workflow parameter to be configured on runtime which configures the parameter value on Workflow execution. The step assigned with the Workflow parameter will use the value for the one-time execution.

To add Workflow parameters:

  1. Open an existing/create a new Workflow.
  2. Click Workflow parameters.
  3. Click Add parameter.
  4. Enter the Parameter name.
  5. Select a Datatype. Some Workflow steps only expect certain type of parameters.
  6. (Optional) Enter a Value. This will be used as a value for the Workflow steps it's assigned to. If the Workflow parameter value is empty, the Workflow steps will use its own default value.
  7. (Optional) Enter a Summary. This will become the description of the Workflow parameter on runtime.
  8. (Optional) Tick Configurable on runtime to allow the Workflow parameter to be configured on runtime.
  9. (Optional) Tick Required to make the Workflow parameter mandatory to be configured on runtime. This will appear only if Configurable on runtime is ticked.
  10. Click Apply to save changes and start to use the Workflow parameter in the Workflow.

Sharing Workflows

By default, Workflows will only be available to users who can access the Space that the Workflow has been created in. You can share published Workflows with either specific or all Spaces.

Typically, Workflows are shared in order for them to be embedded into other Workflows.

To share a Workflow:

  1. Go to Workflows.
  2. Click Options > Sharing options.
  3. Select required option:
  • Shared with some Spaces - select required Space(s) from the list of available ones.
  • Shared with all Spaces - select this option to share the Workflow with all the available Spaces.
  1. Click Apply to save changes.

The Shared? value for the Workflow will be Global or Limited, depending on the option you've chosen.

If other users have shared a Workflow with a Space you have access to, you can include that Workflow in your current Space:

  1. Go to Workflows.
  2. Click Include from another Space. This will list all the Workflows shared with you.
  3. Select the required Workflow(s).
  4. Click Apply to save changes.

You can View properties of the shared Workflow(s), run/execute or Remove from Space.

Workflow reports

You can create reports in a HTML format for valid and complete Workflows by clicking Report in the Workflow editor. The report will contain a comprehensive and detailed summary of the Workflow, allowing you to examine the details in a portable format.

A Workflow report generally consists of three sections:

  1. Contents: The summary, version history, Workflow parameters and a high-level abstracted graphical view of the entire Workflow, highlighting the relationships between steps.
  2. Workflow steps: Detailed information (e.g. position, column summaries, options, expressions) of each Workflow step.
  3. Output column lineage and Function glossary: Details of lineage for all output columns in the Workflow as well as any Functions used.