A Dataset stores the schema (column layout) of the data that will be loaded, including the parsing options, settings, data tags and data batch details.

You don't have to store data within Data Studio. You can create an empty Dataset with column headers and load data in batches. This lets you re-use the same Dataset by replacing data batches whenever needed.

Data within Datasets is stored in one or more batches. Each batch has an ID, date, timestamp and row count.

When creating or editing a Dataset, choose whether it will contain one or more batches. Choose one of the following:

  • Overwrite (default) to create a single-batch Dataset. New data loaded into this Dataset will overwrite any existing data. The latest data will be the only data available, meaning you're using the freshest data.
  • Add to create a multi-batch Dataset. New data loaded into this Dataset will be added/appended to existing data. Older batches of data will remain, so this is useful for historical data accumulated over time, such as weekly sales metrics or daily validation results.

A multi-batch Dataset allows you track data vintage and trend changes over time. It also allows older data to be automatically deleted.

Selecting and viewing batches

You can view the batch details within a Dataset from the Datasets list screen using Actions > Show batches.

The Source step in a Workflow provides an option to select All batches (default), last N batches, Batches created in the latest time period or to take a sample batch over a time period.

Deleting batches

To delete manually, go to the batch details within a Dataset in the Datasets list using Actions > Show batches. Select one or more batches then Delete and confirm deletion.

Datasets also have an Allow automatic batch deletion setting. Checking this provides the Delete batches on completion option in the Source step using this Dataset. When the Workflow is executed, it can be set to automatically delete any batches of data that are processed successfully, which will ensure that no batch of data is processed through a Workflow twice.

Automatic batch deletion

Each multi-batch Dataset has an optional limit on the number of batches it can have. When new data is loaded into that Dataset, if the specified number of batches would exceed the limit, the oldest batch will be automatically deleted. If this setting is left blank, no automatic deletion of batches will occur.

Automatically delete old data

For compliance or housekeeping reasons, we recommend to automatically delete old data. The Automatically delete older batches after N days setting is checked daily for every Dataset that has this setting enabled and delete any batches that are older than today minus the specified number of days.

Create a Dataset

To create, go to Datasets and click Add Dataset. This will launch a wizard to guide you through the process.

First, choose to either Upload file or connect to an External system such as a cloud file store or a database.

Loading a file

Data Studio supports most common file formats:.csv, .txt, .xls, .xlsx, .json, .psv, .sas7bdat, .dat, .metro2, as well as encrypted files or password protected files with these extensions: .aes, .enc, .gpg, .pgp, .zip, .7zip, .xls, .xlsx.

The metadata for fixed width files (.dat or .txt) is loaded in a .csv file. The following columns are accepted in this csv file. The order of columns do not matter, as long as the headers are labelled correctly.

Col name Required? Description
COLUMN Required The name of the column. Headers in the data file are not used.
START_POSITION Required Start position of the column. 1 indexed, i.e. the first character in the file is position 1.
END_POSITION Required End position of the column. This is the last character included in the column. An END_POSITION of 12 will include the character at position 12.
LENGTH Required Length of the column. This is needed for validation.
TYPE Required The data type of the column. ALPHANUMERIC, NUMERIC or DATE.
SUMMARY Optional Column description.
TAGS Optional List of Data tags to be assigned to the column. Multiple values must be surrounded with double quote and separated with comma. E.g. "Name, Surname".

Data Studio will analyze the file and automatically determine the best parser options to extract the data from the file, but these can be manually amended.

File parser is auto-selected based on the file extension, but a different parser can be selected from the list.

Character set is auto-selected by comparing first 64k characters against a universal library.

Language and region is auto-selected to match the host machine OS locale, unless overridden in the Data Studio installation config files.
The setting controls how date and numeric values are parsed on load (if automatically detecting data types). For example:

  • Date: 'English (United Kingdom)' will parse the value "01/02/2020" as 1st February 2020, 'English (United States)' will parse the same value as 2nd January 2020
  • Numeric: 'English (United Kingdom)' will parse the value "1,001" as the number 1001, 'German' will parse the same value as "1.001"

Delimiter and Quote character are auto-selected from the first 20 rows of data, but can be selected from the list or a custom value entered.

Import data from row can be used with .csv, .psv, and .txt delimited files to specify a number of rows to skip.

Use first row for column names - is selected by default, unselecting will name columns 'Column 1, Column 2, etc.'

Parse newline characters within quoted sections as new rows - always terminate row at an end of line character. By default newline characters within Quote characters, i.e. paragraphs of text, will be incorporated into the cell data.

Column datatype options:

  • Automatically detect the type of data for each column - sets the datatype of each column based on the first 20 rows. This impacts how data is parsed later on using the Language and region setting. Examples: a Numeric column will automatically convert "0123" to "123" and a whitespace will be replaced with null.
  • Treat all columns as alphanumeric - will set Alphanumeric for all columns.
  • Allow columns to support mixed datatypes - will set Unknown datatype for all columns.

Date only

Valid Input Comments
02-Jan-1970 Date elements separated by hyphen, shortened month name.
04-January-1970 Date elements separated by hyphen, using full month name.
15.Dec.1970 Date elements separated by period.
15/Dec/1970 Date elements separated by forward slash.
01-01-20 For 2-digit years the watershed will determine the century. Using default watershed, this will be parsed to 2020.
01-01-80 Will be parsed to 1980 using the default watershed.
1970-12-15 Year precedence.
12/15/2020 Month precedence. The locale (language and region) is used to set the month precedence setting in the parsing rules which determines the possible expected formats. So 10/28/2020 will not parse with a UK locale (but will for US) because 28 is an invalid month value, and there would be no US locale parsing as a fallback.

Date and time

Valid Input Comments
01-01-1970T12:34:56 Basic date/time without millis, using T time designator
01-01-1970 12:34:56 Basic date/time without millis
01-01-1970T12:34 Basic date/time without seconds, using T time designator
01-01-1970 12:34 Basic date/time without seconds
01-01-1970 1234 Hour and minute without separator. Becomes 01/01/1970 12:34
01-01-1970 234 Hour and minute without separator. Becomes 01/01/1970 02:34
01-01-1970T12:34:56.987 With milliseconds
01-01-1970T12:34:56.9 Becomes 01/01/1970 12:34:56.900
01-01-1970T123456.987 Without time separators. Becomes 01/01/1970 12:34:56.987
01-01-1970T10:00:00Z With timezone Zulu
01-01-1970 10:00:00+01:00 Becomes 01/01/1970 09:00:00
01-01-1970 10:00:00+01:00[Europe/Paris] Becomes 01/01/1970 09:00:00
01-01-1970T10:00:00-01:00 Becomes 01/01/1970 11:00:00
02-Jan-1990 12:01:02Z[UTC] Becomes 02/01/1990 12:01:02

Time only

Valid Input Comments
1:02:03.004 Time with milliseconds
1:02:03 Time with hours, minutes and seconds
10:02 Time with hours and minutes

Some database systems and file formats can store time zone-aware date and time values.

In the following CSV example, row 1 is a timestamp with a timezone +3 hrs. Row 2 is in UTC. Data Studio is not timezone aware, and when timestamps are loaded into Data Studio, values are converted into UTC.

row timestamp in CSV file timestamp loaded into Data Studio
1 01/01/2013 01:00:00+0300 31-Dec-2012 22:00:00
2 01/01/2013 01:00:00 01-Jan-2013 01:00:00

A watershed (also known as a century cutoff) is used to calculate the century that any two digit year is in. The default is a floating watershed of 50 years. This can be configured in Settings > Data handling with the option to use a fixed or floating watershed with configurable values for both.

  • A fixed watershed will interpret any year after the given value to be in the last century.
  • A floating watershed is similar to the fixed one but it's relative to the current year. The given value is added to the current year and any two digit years above will be considered to be in the last century. For example, if the floating watershed value is 10 and the current year is 2022, the watershed year will be 10 + 22, so any years after 32 will be converted to 19XX and years below it will be converted to 20XX. This is useful when parsing older dates such as date of birth.

If a .zip file contains multiple files or sub-folders the wizard will provide an option File name filter pattern field to apply a filter or mask to load only some of the files in the .zip.

Examples:

  • ‘MyFilename.xlsx’ to load only a specific file.
  • ‘MyFoldername/*.csv’ to load all .csv files in a specific folder.
  • ‘MyFoldername/*_2025.txt’ to load all .txt files with similar name from a specific folder.

All files included will be ‘unioned’ and loaded into a single batch within a Dataset. It is assumed they have the same type and an identical schema that matches that of the existing Dataset.

If creating a new Dataset, the first file encountered will be selected for the wizard to build out the schema. A current limitation only allows one .xlsx file to be loaded.

Preview data

Once loaded, a preview is shown containing the first few rows of data, allowing you to check the file/table and settings are correct before proceeding. For multiple tables or files like Excel containing more than one sheet, the first few rows of each table are shown in the preview.

Annotate columns

Each column in a Dataset can have a summary description and Data tags assigned. The Optional setting, if selected, will allow data to be loaded in future without this column being included.

An existing Dataset’s column annotations can be changed by selecting Actions > Annotate columns.

Data tags

Data tags are labels shown next to column headers. As well as making it easier to find the correct column, they help improve the user experience by providing relevant defaults and suggestions in Workflow steps.

We recommend to Auto tag your data, which will analyze the data and automatically apply data tags.

Column datatypes

The Datatype of a column can have a value of Alphanumeric, Date, Numeric or Unknown. This can change how file values are automatically parsed and help users to quickly identify and find the correct column in future.

The first time data is loaded into a Dataset, the datatypes are suggested by looking at the first 20 lines of data, but these can be updated during creation or in future from the Actions > Annotate columns page.

Dataset details and settings

Datasets have a Name, Summary, Description and an External Label that can be updated from the Actions > Edit details page.

To reduce disk space usage (by up to 80%) Data Studio will automatically compress your data as it's being loaded in, but compression may impact Workflow performance. The default compression applied can be updated in Settings > Performance. Each Dataset can have a different compression-level from the default. Any change of compression will be applied to data loaded from that point onwards.

  • Column-based compression is best for fast iterations of a single column, such as data validation or transformation where a Workflow is processing every value in a column.
  • Row-based compression often takes up least disk space and can be faster to compress the data. It creates a file that indexes each row, making it quick to access individual cells within the row.
  • No compression – the Dataset, including the data, will take up more disk space than the size of the file being loaded. However, loading data and processing Workflows will be fastest, so this is the best option for processing data in Data Studio without storing it long-term.

Loading new data

The Dataset batch settings will determine if data will overwrite or be added to any existing data.

To load a new version of a file into a file Dataset:

  • Manually: either from the Datasets list screen, by going to Actions > Upload new data or clicking into a Dataset (to check it's the correct one), then Upload data and selecting a required file. If the file structure has been amended, i.e. the schema no longer matches or spreadsheet names have changed, you will be prompted to configure and annotate columns.
  • Automatically: using a Dataset dropzone.

To load new data into an external system Dataset:

  • Manually: from the Datasets list screen, go to Actions > Refresh data (you can select multiple Datasets to refresh).
  • Automatically: ensure the Dataset has Allow auto-refresh enabled, create a Workflow with a Source step that has this Dataset as input and the Allow auto refresh option enabled.

Sharing Datasets

A Dataset cannot be shared with another Space, so the data will only be visible to users with access to that Space.

However, a Dataset can be the source of one or more Views or Charts which can then be shared with other Spaces.