A Dataset in Data Studio is the container in which data is stored. It consists of the metadata describing a set of input data, including the name, schema, parsing options, settings and tags.

The actual data (the values in rows and columns) contained within a Dataset is referred to as a batch of data.

The source of the Dataset could be a file, a database table, a DDL file, or a Snapshot.

External label specifications

External Datasets labels have to:

  • contain only alphanumeric characters or underscores,
  • be less than 255 characters long and
  • not start with a number or an underscore.

Data Studio supports most of the common file formats:

To upload a file:

  1. Go to Datasets.

  2. Click Add Dataset.

  3. Select Upload file for a locally stored file..

  4. Click Next.

  5. If uploading a local file, click Browse for file and select the required one and click Next.

  6. Select from the available parsers for your file type. Click Browse metadata file if you're uploading files that have associated metadata (e.g. .dat)

  7. (Optional) Change the charset and language/region of the file from a list of defeaults.

  8. Select how Data Studio should parse the data in each column.

  9. If the input data has quoted values and you require the row to be terminated at an end of line character, regardless of whether it is inside or outside quotes, tick the option Parse newline characters within quoted sections as new rows. Otherwise, quoted newline characters will be incorporated into the cell data.

  10. You can preview how your data values will be loaded. For files with more than one table/sheet, a preview of each Dataset will be shown. Click Next.

  11. (Optional) You can automatically tag your data.

  12. Data Studio will suggest some default settings for the data type for each column - in most cases, you won't need to manually change anything.

  13. (Optional) You can annotate your data before loading it in Data Studio: add a description for each column, change the data type, and manually add or edit tags. Click Next.

  14. (Optional) You can edit the Dataset name and add a summary (this will appear in the Dataset list) and a longer description of what your data contains.

  15. The Dataset will have an External label in order to identify the Dataset from systems external to Data Studio.

  16. Select the required Batch settings: Single batch or Multi batch.

  17. (Optional) You can allow Workflows to delete the contents of this Dataset after processing by selecting Allow automatic batch deletion.

  18. You can select to have a server dropzone available for this Dataset if you want to be able to automatically upload new files with Enable server dropzone.

  19. (Optional) Select Enable server dropzone to have a dropzone folder created on the Data Studio server for this Dataset. This will allow the Dataset to Dataset to automatically upload a new file to the Dataset from the dropzone folder.

  20. Choose the Compression level: Row-based to store data by row, Column-based to store data by column or None to not use compression for this Dataset. The value selected by default is determined by what's been specified in Settings > Performance > Compression. .

  21. (Optional) You can select to make this Dataset available via ODBC by selecting Publish to ODBC and OData.

  22. Click Finish to complete the configuration and start the data load.

An External system is a remote database or file system from which Data Studio can extract data. The supported systems are:

  • JDBC
  • Apache Hadoop HDFS
  • Amazon S3
  • Microsoft Azure Blob Storage
  • An SFTP server

To load data from an external system:

  1. Go to Datasets.

  2. Click Add new Dataset.

  3. Select External system as the source type.

  4. Select a system from the available ones. If no systems are available, click System not listed? Add new and follow the steps to configure an External system.

  5. Select a credential to be associated with the Dataset.

  6. Select the required file, table or database view. Click Next to see a preview of the data to be loaded. Click Next again to annotate columns.

  7. Select the data you want to load.

    • If the connection type of your External system is JDBC, you will have the option to load data from a Table or database view directly, or load data based on a SQL Database query.
    • For other connection types, select the file you want to load.
  8. Click Next to see a preview of the data to be loaded. Click Next again to annotate columns.

  9. (Optional) You can automatically tag your data.

  10. (Optional) You can annotate your data before loading it in Data Studio: add a description for each column and manually add or edit tags. Click Next.

  11. (Optional) You can edit the Dataset name and add a summary (this will appear in the Dataset list) and a longer description of what your data contains.

  12. (Situationally required) You can enter an External label for the Dataset in order to identify the Dataset from systems external to Data Studio. Note: An External label is required if your Dataset will have a dropzone or will be made available via ODBC and OData.

  13. The Dataset will have an External label in order to identify the Dataset from systems external to Data Studio.

  14. Select the required Batch settings, Single batch or Multi batch.

  15. (Optional) Allow Workflows to delete the contents of this Dataset after processing by selecting Allow automatic batch deletion.

  16. (Optional) If the External system type is a file system (SFTP, AWS S3 or Azure Blob), you can select to have an external system dropzone available for this Dataset to automatically upload new files to the Dataset using a folder located in the External system.

  17. An External label will need to be specified.

  18. (Optional) If the External system type is a file system (SFTP, AWS S3 or Azure Blob), you can select to have a server dropzone available for this Dataset if you want to be able to automatically upload new files using Enable server dropzone. 1. (Optional) If the External system type is a file system (SFTP, AWS S3 or Azure Blob), you can select to have an external system dropzone available for this Dataset to automatically upload new files to the Dataset using a folder located in the External system with Enable server dropzone.

  19. Choose the Compression level: Row-based* to store data by row, Column-based to store data by column or None to not use compression. The value selected by default is determined by what's been specified in Settings > Performance > Compression. .

  20. (Optional) You can select to make this Dataset available via ODBC and OData by selecting Publish to ODBC and OData. The option will not be selectable if the Space has not been configured to allow publishing.

  21. Click Finish and the loaded data will appear in the Datasets list.

Dataset batches

A batch of data is zero or more rows of data loaded into a Dataset at a point in time. By default, a Dataset will contain only a single batch of data, and often the first batch will be created at the same time as the Dataset itself.

When a Dataset is refreshed, either by uploading a new file, re-fetching data from a database table, or taking a new Snapshot, the old batch is replaced by a new batch. However, a Dataset can optionally contain multiple batches.

When creating or editing a Dataset, you can choose whether it will contain only one batch, or whether it may contain many.

  • Single batch: If new data is uploaded into this Dataset, the existing data is over-written, and the latest batch will be the only one available.
  • Multi batch: New data uploaded to the Dataset is appended to existing data. Old batches will remain unless removed manually or as part of an automated process within a Workflow.

The primary use case for a multi batch Dataset is to process data that has been accumulated over a measure of time. For example, if a "new customers" file gets created daily, you can upload all the files from the past week to create a single Dataset consisting of multiple batches representing daily additions. Multi batch Datasets are also useful for storing historical metrics, such as daily validation rule results, added as new batches to a Dataset using a Take Snapshot step. Each execution of the validation rules Workflow will add another set of metrics as a new batch to the same Dataset, allowing you to analyze quality trends over time

Selecting and viewing batches

You can view the batch or batches within a Dataset from the Datasets page using Options > Show batches.

If a multi batch Dataset is selected in a Workflow's Source step, options in the step will allow you to configure which batches to select, or to sample batches.

Adding new batches

New batches of data can be added to your Dataset in three different ways:

  1. Manually. From the Datasets page:
    • Use Options > Upload new data to upload a new file, for Datasets that were originally created using a file load (i.e. the System is "Imported file").
    • Use Options > Refresh data to pull the latest data from a data source, for Datasets the were originally created from an External system.
  2. From a Workflow. Use the Take Snapshot step to update any existing Dataset with a new batch.
  3. Automatically from a Dropzone. As an external label is given to a Dataset, you can configure Data Studio to watch a "dropzone" folder, and load new files that appear in that folder as new batches of data providing they closely match the target Dataset's schema.
    • For Datasets that were originally created using a file load or an SFTP, AWS S3 or Azure Blob External system, a dropzone folder can be used.
    • For Datasets that were originally created from an SFTP, AWS S3 or Azure Blob External system, a dropzone folder can also be configured on the External system.

Automatic batch deletion

When creating or editing a multi batch Dataset, you can specify a limit for the maximum number of batches that a Dataset will contain. If a number between 2-999 (inclusive) is specified, older batches will be automatically deleted according to the specified maximum amount, when new batches are loaded into that Dataset. If this setting is left blank, no automatic deletion of batches will occur.
When a Dataset is deleted, all batches within it are also deleted. However, it is possible to delete selected batches while retaining the Dataset.

The Allow automatic batch deletion option, used with multi batch Datasets, will ensure that the batches of data which have been used in a Workflow are automatically deleted from the Dataset after they have been processed (i.e. after the Workflow processing those batches has been executed). This can be used to ensure that no batch is processed through a Workflow twice. It should be used in conjunction with the Delete batches on completion setting in the Workflow's Source step.

Batch timestamp

When creating a new Dataset using the Take Snapshot step, you have the option to add batch timestamp column. When a new batch of data is added to the Dataset, the date and time that the batch was added is included in the batch timestamp column. In a multi batch Dataset, the timestamps can be used to differentiate data from separate batches, and for further analysis, aggregation and charting.

Dataset parsing options

When loading a Dataset from a file there are three parsing options.

  • Automatically detect the type of data for each column (default) - this option previews the first 20 lines of the file and automatically determines the most suitable data type for each column options.
    If values in columns beyond the preview limit have a different datatype, these are loaded as alphanumeric.
    Values are parsed on load. For example the value "0123" would be loaded as the numeric value "123" using this option, and whitespace values will become NULL
  • Treat all columns as alphanumeric - this option will not attempt to carry out any parsing on the values on load. All values will be loaded as alphanumeric. This will include retaining whitespace-only values and leading zeros.
  • Allow columns to support mixed datatypes - datatype defined for each column is Unknown, allowing values to be parsed as any type.

Fixed width files metadata

The metadata for fixed width files (.dat or .txt) is loaded in a csv file. The following columns are accepted in this csv file. The order of columns do not matter, as long as the headers are labeled correctly.

Col name Required? Description
COLUMN Required The name of the column. Headers in the data file are not used.
START_POSITION Required Start position of the column. 1 indexed, i.e. the first character in the file is position 1.
END_POSITION Required End position of the column. This is the last character included in the column. An END_POSITION of 12 will include the character at position 12.
LENGTH Required Length of the column. This is needed for validation.
TYPE Required The data type of the column. ALPHANUMERIC, NUMERIC or DATE.
SUMMARY Optional Column description.
TAGS Optional List of data tags that can be assigned to the column. Multiple values must be surrounded with double quote and separated with comma. E.g. "Name, Surname".

Auto tagging columns

Data Studio can automatically determine your column data tags for you. Clicking the Auto tag button will prompt Data Studio to analyze the data in your columns and assign data tags to them appropriately. The amount of time taken to auto tag will vary depending on the number of columns in your Dataset. You can review the changes and change the tags manually afterwards and no manual tags that have already been assigned will be removed by this process.

You may also cancel the auto tagging process before it has fully analyzed all the columns in your Dataset if you wish. If you do so, tags that were assigned before cancellation will be maintained.

If any columns were assigned tags (whether the process was fully completed or cancelled) a Remove all tags button will be available once the auto tag process has finished and clicking it will remove all tags, both manual and those assigned automatically.

Auto tagging can be done while creating a Dataset, or on an existing Dataset using Annotate Columns option.

Annotating columns for an existing Dataset

The process for annotating the columns of an existing Dataset is similar to the process followed when creating Datasets. To reach edit Annotate columns screen for an existing Dataset:

  1. Go to Datasets.
  2. On the required Dataset, expand the Options and click Annotate columns.

When the Auto Tag button is pressed Data Studio will analyze the data in your columns and assign data tags to them appropriately. Data Studio will add all automatic data tags for all the columns if those data tags were not added manually.

Language and region

The Language and Region setting controls how date and numeric values are parsed on load (if detecting data types).

For example:

  • Dates: English (United Kingdom) will parse the value "01\02\2020" as 1st February 2020, English (United States) will parse the same value as 2nd January 2020
  • Numbers: English (United Kingdom) will parse the value "1,001" as the number 1001, German will parse the same value as "1.001"

Dates and Times

Loading timestamps with timezone

Some database systems and file formats can store time zone-aware date and time values.

In the following CSV example, row 1 is a timestamp with a timezone +3 hrs. Row 2 is in UTC. Data Studio is not timezone aware, and when timestamps are loaded into Data Studio, values are converted into UTC.

row timestamp in CSV file timestamp loaded into Data Studio
1 01/01/2013 01:00:00+0300 31-Dec-2012 22:00:00
2 01/01/2013 01:00:00 01-Jan-2013 01:00:00

Loading dates with two-digit years

A watershed (also known as a century cutoff) is used to calculate the century that any two digit year is in. The default is a floating watershed of 50 years. This can be configured in Settings > Data handling with the option to use a fixed or floating watershed with configurable values for both.

  • A fixed watershed will interpret any year after the given value to be in the last century.
  • A floating watershed is similar to the fixed one but it's relative to the current year. The given value is added to the current year and any two digit years above will be considered to be in the last century. For example, if the floating watershed value is 10 and the current year is 2022, the watershed year will be 10 + 22, so any years after 32 will be converted to 19XX and years below it will be converted to 20XX. This is useful when parsing older dates such as date of birth.

Supported date and time formats

Valid Input Comments
02-Jan-1970 Date elements separated by hyphen, shortened month name.
04-January-1970 Date elements separated by hyphen, using full month name.
15.Dec.1970 Date elements separated by period.
15/Dec/1970 Date elements separated by forward slash.
01-01-20 For 2-digit years the watershed will determine the century. Using default watershed, this will be parsed to 2020.
01-01-80 Will be parsed to 1980 using the default watershed.
1970-12-15 Year precedence.
12/15/2020 Month precedence. The locale (language and region) is used to set the month precedence setting in the parsing rules which determines the possible expected formats. So 10/28/2020 will not parse with a UK locale (but will for US) because 28 is an invalid month value, and there would be no US locale parsing as a fallback.
Valid Input Comments
01-01-1970T12:34:56 Basic date/time without millis, using T time designator
01-01-1970 12:34:56 Basic date/time without millis
01-01-1970T12:34 Basic date/time without seconds, using T time designator
01-01-1970 12:34 Basic date/time without seconds
01-01-1970 1234 Hour and minute without separator. Becomes 01/01/1970 12:34
01-01-1970 234 Hour and minute without separator. Becomes 01/01/1970 02:34
01-01-1970T12:34:56.987 With milliseconds
01-01-1970T12:34:56.9 Becomes 01/01/1970 12:34:56.900
01-01-1970T123456.987 Without time separators. Becomes 01/01/1970 12:34:56.987
01-01-1970T10:00:00Z With timezone Zulu
01-01-1970 10:00:00+01:00 Becomes 01/01/1970 09:00:00
01-01-1970 10:00:00+01:00[Europe/Paris] Becomes 01/01/1970 09:00:00
01-01-1970T10:00:00-01:00 Becomes 01/01/1970 11:00:00
02-Jan-1990 12:01:02Z[UTC] Becomes 02/01/1990 12:01:02
Valid Input Comments
1:02:03.004 Time with milliseconds
1:02:03 Time with hours, minutes and seconds
10:02 Time with hours and minutes

Import data from row

This option lets you specify the number of rows that will be skipped when uploading a Dataset, allowing you to 'clean up' the data before it's loaded into Data Studio. The value is also retained for the next batch upload or data refresh.

For example, setting the value to 5 will skip the first 4 rows in the Dataset when it's uploaded.

If Use first row for column names is checked, the first row of data will be considered as the header and will not count towards the number of lines of data to skip.

Compression levels

To reduce the disk space usage, you can compress data immediately after its been loaded into Data Studio. You can set the level of compression that will be applied by default in Settings > Performance > Compression > Set a default compression level.

Data Studio offers two levels of data compression:

  1. Row-based - loaded data will be stored row by row and then compressed. This is the highest level of compression but may affect the Workflow performance.
    An index file contains metadata for each row of the dataset, including a seek position into a data file. The data file stores the Dataset rows in a proprietary format. It’s possible to randomly access individual cells within the row, but it requires iterating over preceding cells.

    • Advantages: Efficiency in 'reading' data, and very fast access to individual cells in a Dataset.
    • Disadvantages: Iterating data in columns is not very efficient.
  2. Column-based - loaded data will be stored column by column and then compressed. This is a medium level of compression that may have some impact on the Workflow performance. This is the default compression option used by Data Studio.

    • Advantages: Very fast iterations over a single column and good compression, depending on the data.
    • Disadvantages: Not efficient in 'reading' individual rows.

For example, consider the following table of data:

Column 1 Column 2 Column 3
A1 B1 C1
A2 B2 C2
A3 B3 C3

Using the row-based compression, data will be stored as A1B1C1, A2B2C2, etc, while using the column-based compression, data will be stored as ABC, where A is storing A1A2A3, B is storing B1B2B3 and so on.