Create a Workflow

A Workflow is a sequence of one or more steps that allows you to transform and manipulate your data.

Think of a Workflow step as an action you can perform to your data. Some steps are simple (e.g. sampling data) while others are quite complex (e.g. identifying duplicate records). The power of Workflows is not only in the flexibility of combining any actions you require but also the ease of use - simply drag, drop and connect them.

In addition to all the default steps, you can also create your own using the Java SDK (or contact us for our consultants to do it for you).

To create a Workflow:

  1. Go to Workflows.
  2. Click Create new Workflow.
  3. Enter a name.
  4. (Optional) Enter a summary and description, e.g. the purpose of the Workflow.
  5. (Optional) Give the Workflow an external label, which will allow it to be referenced later using the REST API, or in a Notification event.
  6. Click Apply. You're now in the Workflow editor, where you can create, modify, test and run Workflows.

After you create a new Workflow, you're taken into the Workflow editor in Edit mode. Since all Workflows require data, the Source step is automatically added.

Workflow editor options (edit mode):

  • Zoom the Workflow In/Out.
  • Generate and download the Workflow Report.
  • Run (execute) the Workflow.
  • Automatically Arrange steps.
  • Drag select to select Workflow steps: on / off.
  • Snap to automatically join Workflow step nodes together: on / off.
  • Select all to select all steps in a Workflow.
  • Delete selection to delete all selected Workflow steps.
  • Convert to View to save the actions in selected steps as a View. This action is only possible for some steps.
  • Workflow parameters to create, update or delete parameters in the Workflow.
  • Finish editing to save changes to the Workflow.

Workflow editor options (view mode):

  • Zoom the Workflow In/Out.
  • Generate and download the Workflow Report.
  • Run (execute) the Workflow.
  • Workflow parameters to view Workflow parameters in the Workflow.
  • Publish this version of the Workflow.
  • Enter Edit mode.
  • Close the Workflow editor.

Available default steps

Duplicate data to multiple outputs.

The rows from the source data will be duplicated (branched) into two outputs. To create more than two duplicates of the data, you can link several Branch steps.

Visualize your data as a bar, pie, or gauge chart. Export as a PDF or an image file (.JPG, .PNG, .SVG). Find out about the Chart page.

The export step allows you to export your Workflow's output data to a file or an External System that uses a JDBC connection.

Each Workflow can contain multiple export steps, so you can export data at any point. Each export step can be configured individually.

Click Export Settings in the step dialog to specify where and how the data will be exported.

Exporting to a file

  • Specify the Exporter (.csv, .txt, .xls or .xlsx or .json).
  • Choose the Target System - the export location. The default Server export directory will export files into the appropriate Space's export directory on the server. You can also export to an external file system (Apache Hadoop HDFS, Amazon S3, Microsoft Azure Blob Storage, or SFTP server) if one is configured.
  • Enter the name of the file or assign a Workflow parameter for Filename.
    • The Filename is made up of several components. By default, it will have two components; one for the name of the file and a second one for a timestamp.
    • The timestamp is the date and time at the point that the Workflow is executed.
    • You can add new components by clicking on Add filename component or Insert.
    • Components can be deleted by clicking on Delete.
    • Each component can be either custom text, or a timestamp in one of the available datetime formats.
    • An example of the resulting filename is shown as you build up the filename components.
  • Enable Overwrite existing to allow a file with the same name to be exported more than once (the latest copy will replace the previous one). If this option is disabled, you will get an error when trying to export to the same file name more than once.
  • If the Include column names is checked, the first row in the exported file will contain column names. This doesn't apply to .json files.
  • Column name style determines how the column names will be displayed in the export file:
    • None - the column name as it appears in Data Studio.
    • Database style - the column name will be in upper case and underscores will be used to separate words.
    • Lower case - the column name will be in lower case.
    • Upper case - the column name will be in upper case.
    • Humanized - the first word will be capitalized and underscores turned into spaces.
  • Specify the Character set (most sets are supported).
Custom settings for CSV files

You can specify a delimiter and filename extension for CSV files (created in an Export step or downloaded from a grid of data):

  1. Navigate to the installation folder (by default C:\Program Files\Experian\Aperture Data Studio {version}) and open the server.properties file.

  2. On a new line in the file, add the setting you want to modify:

    Setting name Default value Description
    Data.csvDelimiter , Any string; special keywords can be used (see below)
    Data.csvFilenameExtension csv All non alphanumeric characters will be removed
    Keyword Value
    BAR ¦
    COLON :
    COMMA ,
    NONE \O
    PIPE |
    SEMICOLON ;
    SPACE ' '
    TAB \t

    For example, to change the delimiter to tab, the entry in server.properties should be:

    Data.csvDelimiter=TAB
    

Exporting to an External System using JDBC

  • Select JDBC as the Exporter.
  • Choose a DBMS system from the list of available ones. This is a list of the External System connections configured in Data Studio that use a JDBC driver.
  • Choose a Credential for authenticating with the DBMS. These credentials will have been created alongside the External System definition.
  • Select the target table from the list in Table name. These are the tables that you can access in the target DBMS using the selected credential.
  • Mode determines how the data will be exported into the target system:
    • Insert: New rows are inserted into the table. The equivalent SQL statement is:
      "INSERT INTO tablename (column1, …) VALUES (value1, …)"
    • Update: Existing rows are updated. The mapping dialog will show which source columns are mapped to which target columns and which ones have been specified as keys. You can change the mapping and keys at this stage. The keys referred to are the columns that will appear in the equivalent SQL WHERE clause:
      "UPDATE tablename SET (column1=value1, …) WHERE (columnName=value, … )"
    • Delete: Rows are deleted. The keys are the column names used in the SQL WHERE clause:"DELETE FROM tablename WHERE columnName=value, … "
    • Insert or update (if insert fails): If a row insert fails (there is a key collision), then the statement will be retried as an update.
    • Update or insert (if update fails): If the update fails (no record found with the given key), then the statement will be retried as an insert.
  • For Insert mode, you have the option to Truncate table before carrying out the row inserts.
  • For all modes other than Insert, you will need to select one or more Key columns from those in your Workflow data. These columns need to match the key fields in your selected target table.
  • Select Atomic database update if the export should be performed in a single transaction. In this mode it is possible that the database may lock table(s) for the duration of the update. This may be for some considerable time for large exports.
  • Commit batch size: Commit a batch after the specified number of records have been sent to the database. This setting is only available when Atomic database update is unset. Note that the final batch usually has fewer records.
  • Pre SQL: Optionally enter a SQL statement that will be executed before the export. The SQL may contain any valid statements and will be executed on the target database.
  • Post SQL: Optionally enter a SQL statement, similar to Pre SQL but which will be executed after the export has completed.

Filter your data.

You can use a variety of Functions to do this.

The Find duplicates step uses powerful standardization and matching algorithms to group together records containing similar contact data (e.g. name, address, email, phone) and keep that information within a duplicate store. Each group of records, known as a cluster, is assigned a unique cluster ID and a match level. The step provides out-of-the-box functionality for the United Kingdom, Australia, and the United States, but is also completely configurable down to the most granular name and contact elements.

Find out how to use, configure and troubleshoot this step.

Fires a user-defined event from the Workflow.

You can kick off custom events at any point in your Workflow.

Tick the Only fire if input has rows checkbox for the step to be kicked off only when the input has data.

Select the required Custom event then specify values or assign a Workflow parameter for the available event data items (the event data name will be displayed).

For example, you want to kick off an event when the Validate step has failures. Select the 'Validation failed' custom event and specify the reason for the failure in the 'Reason' event data item. This reason will now be available as an event data item when creating/editing notifications.

Another example is using a Filter step to identify records that you have particular interest in. The Fire event step can then follow the Filter step but only fire when the Filter step is producing records.

Group and aggregate column values.

Grouping allows you to group values from one input into one or more columns and aggregate the remaining columns, if required.

Click Grouping in the step dialog then drag and drop the required columns up, into the Grouped list. Remove columns from this list by simply dragging them down. Click Apply when done.

You also configure grouped columns and use aggregate functions on the grouped values when in Show step results view: right-click on the grouped column heading and select Add aggregate. In the dialog that appears, specify the name/description for the new column, the source column to be used for aggregation and the required function.

Harmonization is the process of deduplication - taking multiple records that are deemed to be for the same underlying real-world subject and producing a single resulting record.

Find out how to use this step.

Join two inputs into one based on one or more columns from each input.

To specify the columns you want to join, click Left side columns or Right side columns.

Use the Retention type menus to specify different combinations of join. The top menu is for the first join table and the bottom one for the second join table.

  • All matches: Performs a standard join.
  • Singletons only: Return the first match of a one-to-many match and any matches that occur only once. This is a normal relational join with duplicates removed.
  • Duplicates only: Return all the matches that are one-to-many only . This is a normal relational join excluding matches that match only once.

Click on the Venn diagram to specify the type of join you want to apply:

  • Inner (yellow segment of the Venn) - returns all rows where the join condition is met (keys match)
  • Left outer (yellow and green segments) - returns all rows from the left-hand table and only the rows from the right-hand table where the join condition is met (keys match)
  • Left outer without intersection (green segment) - a variant on the basic left outer join: returns only rows from the left-hand table that don't meet the join condition. No rows from the right-hand table are returned.
  • Right outer (yellow and blue segments) - returns all rows from the right-hand table and only the rows from the left-hand table where the join condition is met (keys match)
  • Right outer without intersection (blue segment only) - a variant on the basic right outer join: returns only rows from the right-hand table that don't meet the join condition. No rows from the left-hand table are returned.
  • Full outer (all segments of the Venn)- returns all rows in both the left and right tables. Any time a row has no match in the other table, the select list columns from the other table contains null values.
  • Full outer without intersection (green and blue segments) - returns all rows in both the left and right tables, excluding rows where the join condition is met.
  • Cross join or Cartesian product (no segments of the Venn) - returns all rows from the left table. Each row from the left table is combined with all rows from the right table.

Above the Venn diagram select More options to expose more detailed options for the join.

The Join step provides three matching options which determine how keys will be compared:

  • Exact match: rows will be joined if key values on the left-hand side and right-hand side are exactly the same, including data type and casing.
  • Ignore datatype: rows will be joined if keys can be standardized to the same value. A key with the alphanumeric value '1' will be joined with a key that is the integer value 1.
  • Case-insensitive: rows will be joined if keys match with casing ignored. A key with value 'ABC' will be joined to a key with value 'abc'.

The Do not match nulls option configures how null values are handled in the join. When enabled, left or right column values containing nulls will never match with any other row. This means that rows where the join key contains nulls will never appear in inner join results and will always appear in left/right unmatched results if the selected join type will include left/right unmatched rows.

In the join dialog, click Show step results to view the results of the join. Specify which values from the joined columns to use in any following steps: all the matches, only duplicates or only singletons.

Click to see the suggested joins. Only exact matches from the two sources will be suggested. Click on the suggested join to apply it.

Allows you to match values from a column in the first input (source) to a column in the second input (the lookup table / lookup input). Values are returned from a second column from the lookup table after an operation is applied to the set of values for the matched rows.

Step configuration

You can use any source in a Workflow as the input for this step: a Dataset, View, or the output from another step. You can even use the same output node for both the source and the input (if you want to do a self-lookup).

To add multiple lookup tables, click Add additional lookup table.

Lookup definition

Click Definitions to open up the configuration panel, and Add definitions to allow you to define a new lookup. Multiple lookup definitions can be created in a single Lookup step. A lookup definition consists of the following options:

Name

The name of the resulting new column from the lookup definition.

Lookup type

A lookup type is the operation applied to the value (or values, if more than one match exists in the lookup table) returned by the Lookup. These types/operations are applied to the set of values for the return column where the lookup column matches the lookup value.

Lookup type Description
Count Returns a count of the number of matches that the lookup value has in the lookup column.
Count excluding nulls Returns a count of the number of matches that the lookup value has in the lookup column ignoring nulls.
Exists Returns true if the lookup value has a match in the lookup column, false if not.
Not exists Returns false if the lookup value has a match in the lookup column, true if not.
First value Returns the first value (based on input row order) from the return column where the lookup value matches the lookup column.
Last value Returns the last value (based on input row order) from the return column where the lookup value matches the lookup column.
Minimum value Returns the minimum value (based on a sort of the values) from the return column where the lookup value matches the lookup column.
Maximum value Returns the maximum value (based on a sort of the values) from the return column where the lookup value matches the lookup column.
Sum Returns the sum of the values in the return column where the lookup value matches the lookup column. Non-numeric values will be counted as '0'.
Average Returns the average (mean) of the values in the return column where the lookup value matches the lookup column. Non-numeric values will be excluded from the calculation. Returns '0' by default if the set of return column values are all non-numeric.
Median Returns the median (based on a sort of the values) of the values in the return column where the lookup value matches the lookup column.
Variance Returns the statistical variance of the values in the return column where the lookup value matches the lookup column. Non-numeric values will be excluded from the calculation. Returns '0' by default if the set of return column values are all non-numeric.
Variance (population) Returns the statistical variance of the values in the return column where the lookup value matches the lookup column where it is known that the values represent the whole population. Non-numeric values will be excluded from the calculation. Returns '0' by default if the set of return column values are all non-numeric.
Standard deviation Returns the standard deviation of the values in the return column where the lookup value matches the lookup column. Non-numeric values will be excluded from the calculation. Returns '0' by default if the set of return column values are all non-numeric.
Standard deviation (population) Returns the standard deviation of the values in the return column where the lookup value matches the lookup column where it is known that the values represent the whole population. Non-numeric values will be excluded from the calculation. Returns '0' by default if the set of return column values are all non-numeric.
All values Returns a comma separated list of all values in the return column where the lookup value matches the lookup column. The order is as they appear in the input.
Reverse values Returns a comma separated list of all values in the return column where the lookup value matches the lookup column. The order is a reverse sort of the values.
Sorted values Returns a comma separated list of all values in the return column where the lookup value matches the lookup column. The order is a sort of the values.
Return column

The column from the lookup table that you want the above operations to act on. This option is not present if you have selected the Exists or Not Exists lookup types, as no value from the lookup table is returned.

Match type

This is where you define how you want to match values between your lookup value and lookup column. The selected match type will apply to both sides.

Match type Description
Exact Will only match if the value and datatype exactly match.
Ignore datatype Will match if the values match, ignoring the datatype (e.g. alphanumeric '1' will match numeric value 1).
Case-insensitive Will match if the values match, ignoring the casing (e.g. 'experian' will match 'experIAN').
Denoise Will match if the values match after special characters / whitespace etc. are removed (e.g. 'Experian' will match 'Ex'per ian').
Format Will match if the formats of the two values are the same (e.g. '123abc' will match '456def').
Soundex Will match if the values match after going through the Soundex phonetic algorithm.
Metaphone Will match if the values match after going through the Metaphone phonetic algorithm.
Double metaphone Will match if the values match after going through the Double metaphone phonetic algorithm.
Distinct checkbox

If selected, each value from the return column will only be used once per lookup value match group. Only present for lookup types which have the potential to operate across multiple values, for example Average and All values.

Default value

The value that is returned if no match for the lookup value is found. This field can be left blank in which case a null value is returned. Some lookup types will never use this (e.g. Exists).

Lookup column

The column from the lookup table that contains the values you want to match on.

Lookup value

The column from the source that contains the values you want to match on.

Add more conditions button

Allows you to define additional pairs of lookup columns (additional match criteria) to further narrow matches.

Example

A source Dataset contains customer information alongside a column with a "discount code":

Customer ID Code
1 A
2 A
3 B
4 C
5 A
6 9

Another dataset contains the "discount code" value alongside the rate that that code represents. This will be the data that we lookup in to:

Discount Code Rate
A 5
B 10
C 15

We want to define a lookup that will return the appropriate rate for each discount code. The rate will be added as a new column in my customer (source) Dataset.

To do this, in a Workflow add a Lookup step with the customer dataset as the source input and the discount code dataset as the lookup input.

The lookup definition is set up as follows:

  • Name: Discount Rate
  • Lookup type: First value
  • Lookup table: Rows for discount code
  • Return column: Rate
  • Match type: Exact
  • Default value: "NO MATCH"
  • Lookup column: Code
  • Lookup value: Discount Code

Result:

Customer ID Code Discount Rate
1 A 5
2 A 5
3 B 10
4 C 15
5 A 5
6 9 NO MATCH

Lookup or Join?

Lookups and Joins can both achieve a lot of the same things when combining of two data sources. However, their performance is optimized for different scenarios, so for large data volumes it is important to understand these before choosing which one to use for your workflow.

In general, we recommend lookups for use-cases where the lookup table is small (e.g. a domain file of a few thousand, typically unique, values) or when you need to perform an operation on the results (e.g. calculate and average for the matched values). Joins are better suited for large to large file operations.

More specifically, for high cardinality columns (e.g. true unique IDs), lookups can take at least twice as long as an equivalent join. For low-cardinality (i.e. low uniqueness) / high row count keys, lookup indexes will built quickly compared to the corresponding join index.

The Map to target step is used to map columns in your data to a target schema definition. The target schema must be a Dataset that is already loaded into Data Studio. Map to target is often used in combination with the Export step to populate a target database schema. The step can also be used to rename multiple columns easily.

Auto-map functionality will attempt to map columns automatically where possible, based on column names in the source (input columns) and the target (output columns). Column mapping can also be done manually.

If an input column is not selected in the mapping for a given target column, the values in the target column will be null.

If Include in output is unchecked for a target column, that column is not included in the step's output. Use this setting when mapping to a schema that includes auto-generated columns in the target database.

When a re-usable Workflow is using the Output step, the step's data will be available in another Workflows.

Click Show step results to view the result from previous step that connects to the Output step.

Profile your data at any point in a Workflow.

We examine each value and identify various attributes (such as type, min/max values, count) and determine its uniqueness and completeness.

Click on Profiled columns to select the column(s) you want to profile. In the Profiled columns list on the left, select the required ones and click Apply. To see the results, click Show step results in the step dialog.

You can now export or take a snapshot of your profiled data to track data uniqueness over time.

Lineage metadata

To further improve data quality, Data Studio supports the capturing of lineage metadata of data sources. Lineage metadata can be included in workflow outputs and subsequently be used for further integration or processing.

In the Profile step, lineage metadata can be selected to be included in the output using the dropdown chooser provided under "More Options".

View up to three inputs side by side (horizontally or vertically). Nest these steps for more flexibility over layout.

Click on Vertical split to change it to horizontal.

To make a report available to Data Consumer users, click Expose to Data Consumer users.

To view the report, click Show report.

Sample and limit your data.

This step allows you to create a sample of the data the step is connected to. You can specify the following sampling settings:

  • Start at row - the row number from which to start loading
  • How often to sample - specify the frequency of rows to sample
  • Sample randomly - tick to create a random sample
  • Limit rows - tick to specify a Row limit that will be used for the sample

Click Show step results to view the results.

Sort your data in ascending/descending order.

Click Sorting in the step, select the required columns and click Apply to save changes.

To change the sort order, click on the column in the Sorted list.

Click Show step results to view the results.

Your data source, always the first step in a Workflow.

Select the required Dataset or View to start manipulating and transforming your data.

Enable or disable the Delete batches on completion setting to automatically remove processed batches from a Dataset.

Enable or disable the Allow auto-refresh setting to automatically load a new batch of data into the selected Dataset, or the Dataset used by the selected View, when the Workflow is executed.

Replaceable sources

This feature allows a Workflow to be run using a different data source. You have two options:

  1. Dataset
  2. File upload

The default Dataset option allows you to specify a different Dataset for this Workflow execution.

The File upload option will only appear when the Workflow configuration is suitable. This option allows you to use a different, temporary file for this Workflow execution. Note that the columns of this file have to be identical to the target Dataset and will be automatically mapped. Currently, only Datasets sourced from files (and not external systems) are supported.

Enable Can supply source when executed to make the data source replaceable at Workflow execution. This will also allow you to replace with another source on Workflow execution.

Enable Must be supplied to make it mandatory for a new data source to be assigned at Workflow execution.

Source batch sampling

This feature allows the Source step to output a sample set of batches instead of all of them.

Take last

This option will allow you to output one of the following:

  • all batches which were created within a given timeframe from the current time.
  • the last given number of batches.
  • all batches.

Sample every

This option will only output the last batch created within each interval of the unit of time given.

For example, if three batches are created with the given timestamps:

  • Batch 1: 2020-01-01 9:00
  • Batch 2: 2020-01-01 11:00
  • Batch 3: 2020-01-01 11:59

This would be the output depending on the unit of time selected:

Day
  • Batch 3: 2020-01-01 11:59
Hour
  • Batch 1: 2020-01-01 9:00
  • Batch 3: 2020-01-01 11:59
Minute
  • Batch 1: 2020-01-01 9:00
  • Batch 2: 2020-01-01 11:00
  • Batch 3: 2020-01-01 11:59

Lineage metadata

To further improve data quality, Data Studio supports the capturing of lineage metadata of data sources. Lineage metadata can be included in workflow outputs and subsequently be used for further integration or processing.

In the Source step, lineage metadata can be selected to be included in the output using the dropdown chooser provided. This chooser will only appear under "More Options" after a dataset has been selected. The lineage metadata available depends on the source type of the step.

Unique row IDs

Each row in a Dataset has a row ID, which is guaranteed to be unique across all Datasets. These IDs can optionally be included in the output of the Source step. The IDs are generated in ascending order over time, so it's possible to use them to sort rows of data by the time they were first loaded into Data Studio.

Combine two inputs into one by presenting the values next to each other.

Data in the columns is not combined or merged - the duplicated values are renamed with an appended numeric value. The output – combined values from two sources – can then be used as input to any further step(s).

Click Show step results to view the results.

Apply a true/false filter to one input, splitting it into passing and failing rows.

To create a filter, click Filter in the step dialog and Create in the left-hand menu. Specify the necessary filters and click Apply to save the changes.

Click Show passing/Failing rows to view the values that passed/failed the filter. The passing and failing results can be individually linked to any further step(s).

Snapshots are copies of your data allowing you to track data changes over time or store your results in a dataset for use elsewhere. Snapshots are Dataset that are created by and stored in Data Studio.

The Take snapshot step creates a new batch of data in a Dataset when the Workflow containing the step is executed. The new batch of data created by the step can either be added to an existing Dataset, or a new Dataset can be created to hold the batch. The dataset created by the step can be used like any other dataset - to be explored on its own or used as a source in a workflow or view.

Taking Snapshots

Add the Take snapshot step to your workflow and connect its input to the data you'd like to capture in a snapshot. The step creates a new batch of data in its associated dataset from the data connected to its input when the workflow containing the step is executed.

To use the step you will need to select the target Dataset into which the data will be written. You can either configure the snapshot to use an existing dataset or create a new one:

  • Select an existing dataset from the Dataset dropdown list.
  • Use Create new Dataset to create a new dataset.
    • Name - the name of the dataset as it will appear in Data Studio.
    • Summary - a short summary of the dataset.
    • Description - a longer description of the dataset.
    • Dataset type - Single batch will write the latest data to the dataset and keep no history. Multi batch will retain older data in the dataset allowing for trend analysis.
    • Interactivity - This option will only be available if you are making a copy of Results by rule or Results by group from the Validate step. Selecting Interactive (with drill down) will result in a dataset that allows users to drill into the underlying data and view the passing or failing rows for the rules or rule groups in the dataset.
    • Add Batch Timestamp Column - When this option is selected, an additional column is added to the dataset recording the timestamp when the batch was created. This option is useful for trend analysis.
    • Allow automatic batch deletion - This option will ensure the data batches which have been used in a workflow are deleted after they have been processed (i.e. after the workflow using those batches has been executed). This is intended to be used so that no batch is processed through a workflow twice. It is used in conjunction with the Delete batches on completion setting in a workflow's source step.
    • Allow auto-refresh - Not applicable to snapshots.
    • Publish to ODBC - When enabled this dataset will be visible through ODBC connections.

Using Snapshots

Snapshots can be used like other Datasets.

  • As a Workflow source - Add a Source step to a Workflow and select the snapshot Dataset in order to use that Dataset in your Workflow.
  • As a View source - When creating a view choose the snapshot Dataset in order to use it as the basis of a View.

Updating the schema

Update Schema - This option will appear when the step's input schema does not match the selected target Dataset's schema. Once the snapshot step has been configured, if its input data's schema changes it will display a warning icon. You will have to click on the Update Schema option to review and triage the changes in the Update Schema dialog before they take effect.

  • Columns added - if columns are added to the input schema the snapshot step will display a warning and using the Update Schema dialog you can include one or more of the missing columns into the target dataset. Until this is done the snapshot step will still execute, but will only write the old columns' data to the dataset.
  • Columns removed (and optionally added) - if columns have been removed from the input schema the snapshot step will display a warning and will become invalid causing workflow execution to fail. In order to resolve this, you can either change the input or enter the Update Schema dialog which will display the differences and allow you to mark the missing columns as optional, thus making them optional columns in the target dataset, which will not remove them from the target dataset but will retain them and fill them with empty values.

Manipulate and configure your data.

This step allows you to add/remove various Functions and arrange/show and hide columns.

Clicking the Columns row on any Transform Workflow step will open the panel where you can select the columns you want to work with.

Moving columns

Moving columns allows you to change the order in which they appear in the data. This can be done by selecting the columns and pressing the up/down arrow button next to the columns list.

Columns can also be moved by dragging them up/down the list.

Renaming columns

Rename a column by selecting the column and pressing the Rename button in the toolbar. You will then be presented with a modal box prompting you to enter the name and an optional summary.

If multiple columns are renamed at the same time, the modal box will provide you with an opportunity to apply a common prefix or suffix to the originating column name. Alternatively, you will be able to rename all columns individually.

Hiding columns

Select the columns that need hiding and press the Hide button in the toolbar. This can also be done by hovering over the column name and pressing the inline icon.

To make columns visible again, select the columns and press the Show button in the toolbar.

Transforming and creating columns

A column can be transformed by applying one or many Functions to it. These Functions range from a simple action, such as converting a column to uppercase, to more complex multiple column operations.
Whatever it may be, transforming a column can be done to create a new column or to change the data in an existing one.

Applying a Function

  1. Select the columns
  2. Decide to transform the data in the existing columns, or create new ones for the results
    • To change existing columns click the Existing column button from the toolbar
    • To create new columns click the New column button in the toolbar
  3. Choose an existing Function from the list (these are grouped into expandable categories)
  4. After choosing your Function you will be presented with options to configure it
  5. Click the Apply transformation button to apply your Function to the columns

Creating an advanced Function

  1. Select the columns
  2. Click Existing column or New column (see Applying a Function above for more info)
  3. Click the Create your own function link
  4. Create a Function with as many Function steps as you like
  5. Click the Apply button to apply your Function to the columns

If the column already has a transformation Function applied you will first be prompted to choose whether to 'Append to existing functions' or 'Replace existing functions'.

Duplicating columns

Duplicating columns can be done by selecting the columns and pressing the Duplicate button in the toolbar. The duplicated column will be placed after the original column and will be renamed with the prefix 'Copy of '.

Editing a transformation

  1. Select the column that is already transformed
  2. Click the Edit transformation button from the toolbar
  3. Either edit the options in the configuration form or start over with a completely new Function

Removing a transformation

To remove the Functions applied to columns, select them and click the Remove transformation button from the toolbar.

You can only remove the transformation from input columns, because created columns can't exist without a Function

Combine two inputs into one output by presenting the values on top of each other. The source columns with matching headings will be combined.

The output – combined values from two sources – can then be used as input to any further step(s).

Click Show step results to view the results.

Assess the quality of your data by defining rules.

To perform validation, you have to set validation rule(s). In the step dialog, click Rules to either:

  • select from the list of available rules or
  • create a new rule and apply it

If you're creating a new rule, click Add group first. Enter the required details - including the pass and fail values and click Add

To view validation results, click on the required option:

  • Show passing rows
  • Show failing rows
  • Show all rows
  • Results by rule
  • Results by rule for analysis
  • Results by group
  • Results by group for analysis

If you are viewing Results by rule or Results by group, you can view the passing or failing rows for that rule/rule group using the right-click options or the toolbar buttons.

More options

Show group name in rule results Can be disabled if the group name is not required e.g. all rules belong to a single group
Show rule results in failing rows Can be disabled if failure information per rule is not required in final output

Lineage metadata

To further improve data quality, Data Studio supports the capturing of lineage metadata of data sources. Lineage metadata can be included in workflow outputs and subsequently be used for further integration or processing.

In the Validate step, lineage metadata can be selected to be included in the output using the dropdown chooser provided under "More Options". The metadata is included in two outputs for the Validate step:

  • Show results by rule
  • Show results by group

Interactive Datasets

The results of the validation can be saved to a dataset by including a Take Snapshot step in your workflow. Furthermore, if the Interactive (with drill down) option is selected when creating a new Dataset on the Take Snapshot step, you can drill in to the data to view the passing or failing rows for the rule/rule groups in the dataset. Interactivity is supported for the Results by rule and Results by group outputs.

This step validates and enriches addresses in bulk.

Addresses will be cleaned by verifying them against the official postally-correct address files for the relevant country. Cleaned addresses are assigned a match result, based on the accuracy of the original address. You can define layouts specifying the number, content and format of the address output columns. Choose one of the available Additional datasets to enrich your data. The datasets that are available to you depend on your license.

Find out how to use, configure and troubleshoot this step.

Validate emails based on the format or domain address.

Select the Email column and pick one of the two available Validation type options:

  • Format Check: Checks whether the value matched a valid email format. Returns either true or false.
    Examples of valid and invalid email formats:

    Format Result
    info@gmail.com Valid
    first.second-name@gmail.com Valid
    first.name+tag@gmail.com Valid
    name@info@gmail.com Invalid
    name"not"right@test.com Invalid
    another.test.com Invalid
    name@incorrect_domain.com Invalid
    com.domain@name Invalid
    first_address@gmail.com, second_address@gmail.com Invalid

    Only one email can be validated at once; lists of emails as seen in the last example will be rejected.

  • Domain Level: Checks whether the value has a domain that exists and is an email server. This option returns both an overall validity result (true or false) in the Email domain: Result column, and additional information in the Email domain: Error column describing the reason for failure. The possible outcomes are:

    Error Result Description
    Valid True Domain exists and is a mail server.
    Bad format False Email format is invalid.
    Invalid domain False Domain validation check failed. The domain may not exist, or may have been flagged as illegitimate, disposable, harmful, nondeterministic or unverifiable.
    Invalid name False Local part validation failed. For example it may have been identified as a spam trap or role account such as "admin@server.com".
    Failed to perform DNS lookup False An error occurred when attempting to perform the DNS lookup.

Domain level validation results are cached with results refreshed every 30 days. The cache validity is configurable in Settings > Workflow steps by changing the Email validation cache validity setting.

Click Show step results to view the results.

Validate global phone numbers.

Once connected to the data, you have to specify the following:

  • Select phone column - specify the column containing the phone numbers you want to validate
  • Select country - pick the country to use for phone validation. You can either select a single country or pick a column with country data. For the latter, please ensure that the country names adhere to ISO 3166-1 alpha-2, ISO 3166-1 alpha-3 or ISO3166 country name standards.
  • Select validation type - choose the validation rule(s) that will be applied:
    • Valid phone: shows True for phone numbers that have been successfully validated against the selected country and False otherwise.
    • Valid phone region: shows True for phone numbers that have been successfully validated against the region of the selected country and False otherwise.
    • Possible phone : shows True for possible phone numbers that have been successfully validated against the selected country and False otherwise.
    • Invalid phone: shows True for phone numbers that have been unsuccessfully validated against the selected country and False otherwise.
    • Invalid phone region: shows True for phone numbers that have been unsuccessfully validated against the region of the selected country and False otherwise.
    • Not possible phone : shows True for not possible phone numbers that have been successfully validated against the selected country and False otherwise.

Click Show step results to view the results. The following columns will be appended to your data:

  • Validation results – shows the result of the validation rule (one column per each applied rule)
  • Phone Country - shows the country for the validated phone number.
  • Phone Number Type – shows the phone type (e.g. mobile or fixed line).

Workflow parameters

Workflow parameters allow you to configure the settings of your Workflow steps without specifying the value. You can create a Workflow parameter and assign it to any of the Workflow step setting that support the Workflow parameter's datatype.

You can also set the Workflow parameter to be configured on runtime, which allow you to configure the parameter value on Workflow execution. The step that assigned with the Workflow parameter will use the value for the one-time execution.

To add Workflow parameters:

  1. Open an existing or create a new Workflow.
  2. Click Workflow parameters.
  3. Click Add parameter.
  4. Enter the Parameter name.
  5. Select Datatype. Some Workflow steps only expect certain type of Workflow parameter.
  6. (Optional) Enter Value. This will be used as value for the Workflow steps that it assigned to. If the Workflow parameter value is empty, the Workflow steps will use its own default value.
  7. (Optional) Enter Summary. This will become the description of the Workflow parameter on runtime.
  8. (Optional) Tick Runtime configurable to allow the Workflow parameter to be configured on runtime.
  9. (Optional) Tick Required to make the Workflow parameter mandatory to be configured on runtime.
  10. Click Apply to save changes and start to use the Workflow parameter in the Workflow.

Sharing Workflows

By default, Workflows will only be available to users who can access the Space in which the Workflow has been created. You can also share published Workflows with either specific Spaces or all Spaces.

To share a Workflow:

  1. Go to Workflows.
  2. Click Options > Sharing options.
  3. Select the required option:
    • Shared with some Spaces - select Spaces from the list of available Spaces to share this Workflow with.
    • Shared with all Spaces - select this option to share the Workflow with all the available Spaces.
  4. Click Apply to save changes.

The Shared? column of the Workflow will now have the Global or Limited value, depending on the option that you've chosen.

If other users have shared a Workflow with a Space you have access to, you can include that Workflow in your current Space:

  1. Go to Workflows.
  2. Click Include from another Space. This will list all the Workflows that shared with you.
  3. Select the required Workflow(s).
  4. Click Apply to save changes.

For the shared Workflow in the Workflow list, you'll have the option to View properties, Run and Remove from Space.

Workflow reports

You can create reports in a HTML format for valid and complete Workflows by clicking Report in the Workflow editor. The report will contain a comprehensive and detailed summary of the Workflow, allowing you to examine the details in a portable format.

A Workflow report generally consists of three sections:

  1. Contents: The summary, version history, Workflow parameters and a high-level abstracted graphical view of the entire Workflow, highlighting the relationships between steps.
  2. Workflow steps: Detailed information (e.g. position, column summaries, options, expressions) of each Workflow step.
  3. Output column lineage and Function glossary: Details of lineage for all output columns in the Workflow as well as any Functions used.