A Dataset in Data Studio is the container in which data is stored. It consists of the metadata describing a set of input data, including the name, schema, parsing options, settings and tags. The actual data (the values in rows and columns) contained within a Dataset is referred to as a batch of data.
External labels for Datasets have to:
The source of the Dataset could be a file, a database table, a DDL file, or a snapshot taken in a Workflow.
Data Studio supports most common file formats:
To upload a file:
Go to Datasets.
Click Add Dataset.
Select Upload file for a locally stored file, or Server import directory to get a file from a pre-defined import directory.
If uploading a local file, click Browse for file and select the required one.
Select from the available parsers for your file type. Select Browse metadata file if you're uploading files that have associated metadata (e.g. .dat)
(Optional) Change the charset and language/region of the file from the system defaults
Select how Data Studio should parse the data in each column.
If the input data has quoted values and you require the row to be terminated at an end of line character, regardless of whether it is inside or outside quotes, tick the option Parse newline characters within quoted sections as new rows. Otherwise, quoted newline characters will be incorporated into the cell data.
You can preview how your data values will be loaded. For files with more than one table/sheet, a preview of each Dataset will be shown.
(Optional) You can automatically tag your data.
Data Studio will suggest some default settings for the data type for each column - in most cases, you won't need to manually change anything.
(Optional) You can annotate your data before loading it in Data Studio: add a description for each column, change the data type, and manually add or edit tags.
(Optional) You can edit the Dataset name and add a summary (this will appear in the Dataset list) and a longer description of what your data contains.
The Dataset will have an External label in order to identify the Dataset from systems external to Data Studio.
Select the required Batch settings, Single batch or Multi batch.
(Optional) You can allow Workflows to delete the contents of this Dataset after processing by selecting Allow automatic batch deletion.
Enable or disable the Allow auto-refresh setting.
you can select to have a server dropzone available for this Dataset if you want to be able to automatically upload new files using Enable server dropzone.
(Optional) Select Enable server dropzone to have a dropzone folder created on the Data Studio server for this Dataset. This will allow the Dataset to automatically upload a new file to the Dataset from the dropzone folder.
Choose the Compression level: Row-based to store data by row, Column-based to store data by column or None to not use compression for this Dataset. The value selected by default is determined by what's been specified in Settings > Performance > Compression. Find out more about the compression levels.
(Optional) You can select to make this Dataset available via ODBC by selecting Publish to ODBC and OData.
Click Finish to complete the configuration and start the data load.
An External system is a remote database system or file system from which Data Studio can extract data. The supported systems are:
To load data from an external system:
A batch of data is zero or more rows of data loaded into a Dataset at a point in time. By default, a Dataset will contain only a single batch of data, and often the first batch will be created at the same time as the Dataset itself.
When a Dataset is refreshed, either by uploading a new file, re-fetching data from a database table, or taking a new snapshot, the old batch is replaced by a new batch. However, a Dataset can optionally contain multiple batches.
When creating or editing a Dataset, you can choose whether it will contain only one batch, or whether it may contain many.
The primary use case for a multi batch Dataset is to process data that has been accumulated over a measure of time. For example, if a "new customers" file gets created daily, you can upload all the files from the past week to create a single Dataset consisting of multiple batches representing daily additions. Multi batch Datasets are also useful for storing historical metrics, such as daily validation rule results, added as new batches to a Dataset using a Take snapshot step. Each execution of the validation rules Workflow will add another set of metrics as a new batch to the same Dataset, allowing you to analyze quality trends over time
You can view the batch or batches within a Dataset from the Datasets page using Options > Show batches.
If a multi batch Dataset is selected in a Workflow's Source step, options in the step will allow you to configure which batches to select, or to sample batches.
New batches of data can be added to your Dataset in three different ways:
When creating or editing a multi batch Dataset, you can specify a limit for the maximum number of batches that a Dataset will contain. If a number between 2-999 (inclusive) is specified, older batches will be automatically deleted according to the specified maximum amount, when new batches are loaded into that Dataset. If this setting is left blank, no automatic deletion of batches will occur.
When a Dataset is deleted, all batches within it are also deleted. However, it is possible to delete selected batches while retaining the Dataset.
The Allow automatic batch deletion option, used with multi batch Datasets, will ensure that the batches of data which have been used in a Workflow are automatically deleted from the Dataset after they have been processed (i.e. after the Workflow processing those batches has been executed). This can be used to ensure that no batch is processed through a Workflow twice. It should be used in conjunction with the Delete batches on completion setting in the Workflow's Source step.
When creating a new Dataset using the Take snapshot step, you have the option to add batch timestamp column. When a new batch of data is added to the Dataset, the date and time that the batch was added is included in the batch timestamp column. In a multi batch Dataset, the timestamps can be used to differentiate data from separate batches, and for further analysis, aggregation and charting.
When loading a Dataset from a file there are three parsing options.
The metadata for fixed width files (.dat or .txt) is loaded in a csv file. The following columns are accepted in this csv file. The order of columns do not matter, as long as the headers are labeled correctly.
|COLUMN||Required||The name of the column. Headers in the data file are not used.|
|START_POSITION||Required||Start position of the column. 1 indexed, i.e. the first character in the file is position 1.|
|END_POSITION||Required||End position of the column. This is the last character included in the column. An END_POSITION of 12 will include the character at position 12.|
|LENGTH||Required||Length of the column. This is needed for validation.|
|TYPE||Required||The data type of the column. ALPHANUMERIC, NUMERIC or DATE.|
|TAGS||Optional||List of data tags that can be assigned to the column. Multiple values must be surrounded with double quote and separated with comma. E.g. "Name, Surname".|
Data Studio can automatically determine your column data tags for you. Clicking the Auto tag button will prompt Data Studio to analyze the data in your columns and assign data tags to them appropriately. The amount of time taken to auto tag will vary depending on the number of columns in your Dataset. You can review the changes and change the tags manually afterwards and no manual tags that have already been assigned will be removed by this process.
You may also cancel the auto tagging process before it has fully analyzed all the columns in your Dataset if you wish. If you do so, tags that were assigned before cancellation will be maintained.
If any columns were assigned tags (whether the process was fully completed or cancelled) a Remove all tags button will be available once the auto tag process has finished and clicking it will remove all tags, both manual and those assigned automatically.
Auto tagging can be done while creating a Dataset, or on an existing Dataset using Annotate Columns option.
The process for annotating the columns of an existing Dataset is similar to the process followed when creating Datasets. To reach edit Annotate columns screen for an existing Dataset:
When the Auto Tag button is pressed Data Studio will analyze the data in your columns and assign data tags to them appropriately. Data Studio will add all automatic data tags for all the columns if those data tags were not added manually.
The Language and Region setting controls how date and numeric values are parsed on load (if detecting data types).
Some database systems and file formats can store time zone-aware date and time values.
In the following CSV example, row 1 is a timestamp with a timezone +3 hrs. Row 2 is in UTC. Data Studio is not timezone aware, and when timestamps are loaded into Data Studio, values are converted into UTC.
|row||timestamp in CSV file||timestamp loaded into Data Studio|
|1||01/01/2013 01:00:00+0300||31-Dec-2012 22:00:00|
|2||01/01/2013 01:00:00||01-Jan-2013 01:00:00|
A watershed (also known as a century cutoff) is used to calculate the century that any two digit year is in. The default is a floating watershed of 50 years. This can be configured in Settings > Data handling with the option to use a fixed or floating watershed with configurable values for both.
|02-Jan-1970||Date elements separated by hyphen, shortened month name.|
|04-January-1970||Date elements separated by hyphen, using full month name.|
|15.Dec.1970||Date elements separated by period.|
|15/Dec/1970||Date elements separated by forward slash.|
|01-01-20||For 2-digit years the watershed will determine the century. Using default watershed, this will be parsed to 2020.|
|01-01-80||Will be parsed to 1980 using the default watershed.|
|12/15/2020||Month precedence. The locale (language and region) is used to set the month precedence setting in the parsing rules which determines the possible expected formats. So 10/28/2020 will not parse with a UK locale (but will for US) because 28 is an invalid month value, and there would be no US locale parsing as a fallback.|
|01-01-1970T12:34:56||Basic date/time without millis, using T time designator|
|01-01-1970 12:34:56||Basic date/time without millis|
|01-01-1970T12:34||Basic date/time without seconds, using T time designator|
|01-01-1970 12:34||Basic date/time without seconds|
|01-01-1970 1234||Hour and minute without separator. Becomes 01/01/1970 12:34|
|01-01-1970 234||Hour and minute without separator. Becomes 01/01/1970 02:34|
|01-01-1970T12:34:56.9||Becomes 01/01/1970 12:34:56.900|
|01-01-1970T123456.987||Without time separators. Becomes 01/01/1970 12:34:56.987|
|01-01-1970T10:00:00Z||With timezone Zulu|
|01-01-1970 10:00:00+01:00||Becomes 01/01/1970 09:00:00|
|01-01-1970 10:00:00+01:00[Europe/Paris]||Becomes 01/01/1970 09:00:00|
|01-01-1970T10:00:00-01:00||Becomes 01/01/1970 11:00:00|
|02-Jan-1990 12:01:02Z[UTC]||Becomes 02/01/1990 12:01:02|
This option lets you specify the number of rows that will be skipped when uploading a Dataset, allowing you to 'clean up' the data before it's loaded into Data Studio. The value is also retained for the next batch upload or data refresh.
For example, setting the value to
5 will skip the first 4 rows in the Dataset when it's uploaded.
If Use first row for column names is checked, the first row of data will be considered as the header and will not count towards the number of lines of data to skip.
To reduce the disk space usage, you can compress data immediately after its been loaded into Data Studio. You can set the level of compression that will be applied by default in Settings > Performance > Compression > Set a default compression level.
Data Studio offers two levels of data compression:
Row-based - loaded data will be stored row by row and then compressed. This is the highest level of compression but may affect the Workflow performance.
An index file contains metadata for each row of the dataset, including a seek position into a data file. The data file stores the Dataset rows in a proprietary format. It’s possible to randomly access individual cells within the row, but it requires iterating over preceding cells.
Column-based - loaded data will be stored column by column and then compressed. This is a medium level of compression that may have some impact on the Workflow performance. This is the default compression option used by Data Studio.
For example, consider the following table of data:
|Column 1||Column 2||Column 3|
Using the row-based compression, data will be stored as
A2B2C2, etc, while using the column-based compression, data will be stored as
A is storing
B is storing
B1B2B3 and so on.