A Dataset stores the schema (column layout) of the data that will be loaded, including the parsing options, settings, data tags and data batch details.

You don't have to store data within Data Studio. You can create an empty Dataset with column headers and load data in batches. This lets you re-use the same Dataset by replacing data batches whenever needed.

Data within Datasets is stored in one or more batches. Each batch has an ID, date, timestamp and row count.

When creating or editing a Dataset, choose whether it will contain one or more batches. Choose one of the following:

Overwrite (default) to create a single-batch Dataset. New data loaded into this Dataset will overwrite any existing data. The latest data will be the only data available, meaning you're using the freshest data.
Add to create a multi-batch Dataset. New data loaded into this Dataset will be added/appended to existing data. Older batches of data will remain, so this is useful for historical data accumulated over time, such as weekly sales metrics or daily validation results.

A multi-batch Dataset allows you track data vintage and trend changes over time. It also allows older data to be automatically deleted.

Selecting and viewing batches

You can view the batch details within a Dataset from the Datasets list screen using Actions > Show batches.

The Source step in a Workflow provides an option to select All batches (default), last N batches, Batches created in the latest time period or to take a sample batch over a time period.

Deleting batches

To delete manually, go to the batch details within a Dataset in the Datasets list using Actions > Show batches. Select one or more batches then Delete and confirm deletion.

Datasets also have an Allow automatic batch deletion setting. Checking this provides the Delete batches on completion option in the Source step using this Dataset. When the Workflow is executed, it can be set to automatically delete any batches of data that are processed successfully, which will ensure that no batch of data is processed through a Workflow twice.

Automatic batch deletion

Each multi-batch Dataset has an optional limit on the number of batches it can have. When new data is loaded into that Dataset, if the specified number of batches would exceed the limit, the oldest batch will be automatically deleted. If this setting is left blank, no automatic deletion of batches will occur.

For performance reasons, we strongly recommend that the batch limit does not exceed 1000.

Automatically delete old data

For compliance or housekeeping reasons, we recommend to automatically delete old data. The Automatically delete older batches after N days setting is checked daily for every Dataset that has this setting enabled and delete any batches that are older than today minus the specified number of days.

Create a Dataset

Make sure that each Dataset is created in the correct Space as Datasets cannot be shared with other Spaces for security reasons.

To create, go to Datasets and click Add Dataset. This will launch a wizard to guide you through the process.

First, choose to either Upload file or connect to an External system such as a cloud file store or a database.

Loading a file

Data Studio supports most common file formats:.csv, .txt, .xls, .xlsx, .json, .psv, .sas7bdat, .dat, .metro2, as well as encrypted files or password protected files with these extensions: .aes, .enc, .gpg, .pgp, .zip, .xls, .xlsx.

The metadata for fixed width files (.dat or .txt) is loaded in a .csv file. The following columns are accepted in this csv file. The order of columns do not matter, as long as the headers are labelled correctly.

Col name	Required?	Description
COLUMN	Required	The name of the column. Headers in the data file are not used.
START_POSITION	Required	Start position of the column. 1 indexed, i.e. the first character in the file is position 1.
END_POSITION	Required	End position of the column. This is the last character included in the column. An END_POSITION of 12 will include the character at position 12.
LENGTH	Required	Length of the column. This is needed for validation.
TYPE	Required	The data type of the column. ALPHANUMERIC, NUMERIC or DATE.
SUMMARY	Optional	Column description.
TAGS	Optional	List of Data tags to be assigned to the column. Multiple values must be surrounded with double quote and separated with comma. E.g. "Name, Surname".

Data Studio will analyze the file and automatically determine the best parser options to extract the data from the file, but these can be manually amended.

File parser is auto-selected based on the file extension, but a different parser can be selected from the list.

Character set is auto-selected by comparing first 64k characters against a universal library.

Language and region is auto-selected to match the host machine OS locale, unless overridden in the Data Studio installation config files.
The setting controls how date and numeric values are parsed on load (if automatically detecting data types). For example:

Date: 'English (United Kingdom)' will parse the value "01/02/2020" as 1st February 2020, 'English (United States)' will parse the same value as 2nd January 2020
Numeric: 'English (United Kingdom)' will parse the value "1,001" as the number 1001, 'German' will parse the same value as "1.001"

Delimiter and Quote character are auto-selected from the first 20 rows of data, but can be selected from the list or a custom value entered.

Import data from row can be used with .csv, .psv, and .txt delimited files to specify a number of rows to skip.

Use first row for column names - is selected by default, unselecting will name columns 'Column 1, Column 2, etc.'

Parse newline characters within quoted sections as new rows - always terminate row at an end of line character. By default newline characters within Quote characters, i.e. paragraphs of text, will be incorporated into the cell data.

Column datatype options:

Automatically detect the type of data for each column - sets the datatype of each column based on the first 20 rows. This impacts how data is parsed later on using the Language and region setting. Examples: a Numeric column will automatically convert "0123" to "123" and a whitespace will be replaced with null.
Treat all columns as alphanumeric - will set Alphanumeric for all columns.
Allow columns to support mixed datatypes - will set Unknown datatype for all columns.

Date only

Valid Input	Comments
02-Jan-1970	Date elements separated by hyphen, shortened month name.
04-January-1970	Date elements separated by hyphen, using full month name.
15.Dec.1970	Date elements separated by period.
15/Dec/1970	Date elements separated by forward slash.
01-01-20	For 2-digit years the watershed will determine the century. Using default watershed, this will be parsed to 2020.
01-01-80	Will be parsed to 1980 using the default watershed.
1970-12-15	Year precedence.
12/15/2020	Month precedence. The locale (language and region) is used to set the month precedence setting in the parsing rules which determines the possible expected formats. So 10/28/2020 will not parse with a UK locale (but will for US) because 28 is an invalid month value, and there would be no US locale parsing as a fallback.

Date and time

Valid Input	Comments
01-01-1970T12:34:56	Basic date/time without millis, using T time designator
01-01-1970 12:34:56	Basic date/time without millis
01-01-1970T12:34	Basic date/time without seconds, using T time designator
01-01-1970 12:34	Basic date/time without seconds
01-01-1970 1234	Hour and minute without separator. Becomes 01/01/1970 12:34
01-01-1970 234	Hour and minute without separator. Becomes 01/01/1970 02:34
01-01-1970T12:34:56.987	With milliseconds
01-01-1970T12:34:56.9	Becomes 01/01/1970 12:34:56.900
01-01-1970T123456.987	Without time separators. Becomes 01/01/1970 12:34:56.987
01-01-1970T10:00:00Z	With timezone Zulu
01-01-1970 10:00:00+01:00	Becomes 01/01/1970 09:00:00
01-01-1970 10:00:00+01:00[Europe/Paris]	Becomes 01/01/1970 09:00:00
01-01-1970T10:00:00-01:00	Becomes 01/01/1970 11:00:00
02-Jan-1990 12:01:02Z[UTC]	Becomes 02/01/1990 12:01:02

Time only

Valid Input	Comments
1:02:03.004	Time with milliseconds
1:02:03	Time with hours, minutes and seconds
10:02	Time with hours and minutes

Some database systems and file formats can store time zone-aware date and time values.

In the following CSV example, row 1 is a timestamp with a timezone +3 hrs. Row 2 is in UTC. Data Studio is not timezone aware, and when timestamps are loaded into Data Studio, values are converted into UTC.

row	timestamp in CSV file	timestamp loaded into Data Studio
1	01/01/2013 01:00:00+0300	31-Dec-2012 22:00:00
2	01/01/2013 01:00:00	01-Jan-2013 01:00:00

Row 1 is converted to UTC by subtracting 3 hrs.

A watershed (also known as a century cutoff) is used to calculate the century that any two digit year is in. The default is a floating watershed of 50 years. This can be configured in Settings > Data handling with the option to use a fixed or floating watershed with configurable values for both.

A fixed watershed will interpret any year after the given value to be in the last century.
A floating watershed is similar to the fixed one but it's relative to the current year. The given value is added to the current year and any two digit years above will be considered to be in the last century. For example, if the floating watershed value is 10 and the current year is 2022, the watershed year will be 10 + 22, so any years after 32 will be converted to 19XX and years below it will be converted to 20XX. This is useful when parsing older dates such as date of birth.

Preview data

Once loaded, a preview is shown containing the first few rows of data, allowing you to check the file/table and settings are correct before proceeding. For multiple tables or files like Excel containing more than one sheet, the first few rows of each table are shown in the preview.

Annotate columns

Each column in a Dataset can have a summary description and Data tags assigned. The Optional setting, if selected, will allow data to be loaded in future without this column being included.

An existing Dataset’s column annotations can be changed by selecting Actions > Annotate columns.

Data tags

Data tags are labels shown next to column headers. As well as making it easier to find the correct column, they help improve the user experience by providing relevant defaults and suggestions in Workflow steps.

We recommend to Auto tag your data, which will analyze the data and automatically apply data tags.

Column datatypes

The first time data is loaded into a Dataset, the datatypes are suggested by looking at the first 20 lines of data, but these can be updated during creation or in future from the Actions > Annotate columns page.

Data Studio allows values of different types to be stored within a column, meaning that every value in a Numeric column doesn't have to be a number.

Dataset details and settings

Datasets have a Name, Summary, Description and an External Label that can be updated from the Actions > Edit details page.

To reduce disk space usage (by up to 80%) Data Studio will automatically compress your data as it's being loaded in, but compression may impact Workflow performance. The default compression applied can be updated in Settings > Performance. Each Dataset can have a different compression-level from the default. Any change of compression will be applied to data loaded from that point onwards.

Column-based compression is best for fast iterations of a single column, such as data validation or transformation where a Workflow is processing every value in a column.
Row-based compression often takes up least disk space and can be faster to compress the data. It creates a file that indexes each row, making it quick to access individual cells within the row.
No compression – the Dataset, including the data, will take up more disk space than the size of the file being loaded. However, loading data and processing Workflows will be fastest, so this is the best option for processing data in Data Studio without storing it long-term.

Loading new data

The Dataset batch settings will determine if data will overwrite or be added to any existing data.

To load a new version of a file into a file Dataset:

Manually: either from the Datasets list screen, by going to Actions > Upload new data or clicking into a Dataset (to check it's the correct one), then Upload data and selecting a required file. If the file structure has been amended, i.e. the schema no longer matches or spreadsheet names have changed, you will be prompted to configure and annotate columns.
Automatically: using a Dataset dropzone.

To load new data into an external system Dataset:

Manually: from the Datasets list screen, go to Actions > Refresh data (you can select multiple Datasets to refresh).
Automatically: ensure the Dataset has Allow auto-refresh enabled, create a Workflow with a Source step that has this Dataset as input and the Allow auto refresh option enabled.

Sharing Datasets

A Dataset cannot be shared with another Space, so the data will only be visible to users with access to that Space.

However, Data can be shared through the use of Views. To give a user or group of users access to a View but not its underlying source data, you can share a View into a Space which they have access to.

The ability to share a view to another Space is a particularly useful way to filter out or obfuscate sensitive data before making it available for processing by another set of users. It allows also data owners to create different representations of a given dataset to suppress irrelevant columns or run logical sorting in advance so downstream users can get more immediate value from the data.

You can also export data from any Dataset in a Space and import it to another Space, either in the same Environment or a different one.

Was this helpful?

Previous: Views

Next: Charts

Datasets