A Dataset in Data Studio is the container in which data is stored. It consists of the metadata describing a set of input data, including the name, schema, parsing options, settings and tags. The actual data (the values in rows and columns) contained within a Dataset is referred to as a batch of data.
The source of the Dataset could be a file, a database table, a DDL file, or a snapshot taken in a Workflow.
Data Studio supports most common file formats:
To upload a file:
An External system is a remote database system or file system from which Data Studio can extract data. The supported systems are:
To load data from an external system:
A batch of data is zero or more rows of data loaded into a Dataset at a point in time. By default, a Dataset will contain only a single batch of data, and often the first batch will be created at the same time as the Dataset itself.
When a Dataset is refreshed, either by uploading a new file, re-fetching data from a database table, or taking a new snapshot, the old batch is replaced by a new batch. However, a Dataset can optionally contain multiple batches.
When creating or editing a Dataset, you can choose whether it will contain only one batch, or whether it may contain many.
The primary use case for a multi batch Dataset is to process data that has been accumulated over a measure of time. For example, if a "new customers" file gets created daily, you can upload all the files from the past week to create a single Dataset consisting of multiple batches representing daily additions. Multi batch Datasets are also useful for storing historical metrics, such as daily validation rule results, added as new batches to a Dataset using a Take snapshot step. Each execution of the validation rules Workflow will add another set of metrics as a new batch to the same Dataset, allowing you to analyze quality trends over time
You can view the batch or batches within a Dataset from the Datasets page using Options > Show batches.
If a multi batch Dataset is selected in a Workflow's Source step, options in the step will allow you to configure which batches to select, or to sample batches.
New batches of data can be added to your Dataset in three different ways:
When creating or editing a multi batch Dataset, you can specify a limit for the maximum number of batches that a Dataset will contain. If a number between 2-999 (inclusive) is specified, older batches will be automatically deleted according to the specified maximum amount, when new batches are loaded into that Dataset. If this setting is left blank, no automatic deletion of batches will occur.
When a Dataset is deleted, all batches within it are also deleted. However, it is possible to delete selected batches while retaining the Dataset.
The Allow automatic batch deletion option, used with multi batch Datasets, will ensure that the batches of data which have been used in a Workflow are automatically deleted from the Dataset after they have been processed (i.e. after the Workflow processing those batches has been executed). This can be used to ensure that no batch is processed through a Workflow twice. It should be used in conjunction with the Delete batches on completion setting in the Workflow's Source step.
When creating a new Dataset using the Take snapshot step, you have the option to add batch timestamp column. When a new batch of data is added to the Dataset, the date and time that the batch was added is included in the batch timestamp column. In a multi batch Dataset, the timestamps can be used to differentiate data from separate batches, and for further analysis, aggregation and charting.
When loading a Dataset from a file there are three parsing options.
The metadata for fixed width files (.dat or .txt) is loaded in a csv file. The following columns are accepted in this csv file. The order of columns do not matter, as long as the headers are labeled correctly.
|COLUMN||Required||The name of the column. Headers in the data file are not used.|
|START_POSITION||Required||Start position of the column. 1 indexed, i.e. the first character in the file is position 1.|
|END_POSITION||Required||End position of the column. This is the last character included in the column. An END_POSITION of 12 will include the character at position 12.|
|LENGTH||Required||Length of the column. This is needed for validation.|
|TYPE||Required||The data type of the column. ALPHANUMERIC, NUMERIC or DATE.|
|TAGS||Optional||List of data tags that can be assigned to the column. Multiple values must be surrounded with double quote and separated with comma. E.g. "Name, Surname".|
Data Studio can automatically determine your column data tags for you. Clicking the Auto tag button will prompt Data Studio to analyze the data in your columns and assign data tags to them appropriately. The amount of time taken to auto tag will vary depending on the number of columns in your Dataset. You can review the changes and change the tags manually afterwards and no manual tags that have already been assigned will be removed by this process.
You may also cancel the auto tagging process before it has fully analyzed all the columns in your Dataset if you wish. If you do so, tags that were assigned before cancellation will be maintained.
If any columns were assigned tags (whether the process was fully completed or cancelled) a Remove all tags button will be available once the auto tag process has finished and clicking it will remove all tags, both manual and those assigned automatically.
Auto tagging can be done while creating a Dataset, or on an existing Dataset using Annotate Columns option.
The process for annotating the columns of an existing Dataset is similar to the process followed when creating Datasets. To reach edit Annotate columns screen for an existing Dataset:
When the Auto Tag button is pressed Data Studio will analyze the data in your columns and assign data tags to them appropriately. Data Studio will add all automatic data tags for all the columns if those data tags were not added manually.
The Language and Region setting controls how date and numeric values are parsed on load (if detecting data types).
Some database systems and file formats can store time zone-aware date and time values.
In the following CSV example, row 1 is a timestamp with a timezone +3 hrs. Row 2 is in UTC. Data Studio is not timezone aware, and when timestamps are loaded into Data Studio, values are converted into UTC.
|row||timestamp in CSV file||timestamp loaded into Data Studio|
|1||01/01/2013 01:00:00+0300||31-Dec-2012 22:00:00|
|2||01/01/2013 01:00:00||01-Jan-2013 01:00:00|
|02-Jan-1970||Date elements separated by hyphen, shortened month name.|
|04-January-1970||Date elements separated by hyphen, using full month name.|
|15.Dec.1970||Date elements separated by period.|
|15/Dec/1970||Date elements separated by forward slash.|
|01-01-20||For 2-digit years the watershed will determine the century. Using default watershed, this will be parsed to 2020.|
|01-01-80||Will be parsed to 1980 using the default watershed.|
|12/15/2020||Month precedence. The locale (language and region) is used to set the month precedence setting in the parsing rules which determines the possible expected formats. So 10/28/2020 will not parse with a UK locale (but will for US) because 28 is an invalid month value, and there would be no US locale parsing as a fallback.|
|01-01-1970T12:34:56||Basic date/time without millis, using T time designator|
|01-01-1970 12:34:56||Basic date/time without millis|
|01-01-1970T12:34||Basic date/time without seconds, using T time designator|
|01-01-1970 12:34||Basic date/time without seconds|
|01-01-1970 1234||Hour and minute without separator. Becomes 01/01/1970 12:34|
|01-01-1970 234||Hour and minute without separator. Becomes 01/01/1970 02:34|
|01-01-1970T12:34:56.9||Becomes 01/01/1970 12:34:56.900|
|01-01-1970T123456.987||Without time separators. Becomes 01/01/1970 12:34:56.987|
|01-01-1970T10:00:00Z||With timezone Zulu|
|01-01-1970 10:00:00+01:00||Becomes 01/01/1970 09:00:00|
|01-01-1970 10:00:00+01:00[Europe/Paris]||Becomes 01/01/1970 09:00:00|
|01-01-1970T10:00:00-01:00||Becomes 01/01/1970 11:00:00|
|02-Jan-1990 12:01:02Z[UTC]||Becomes 02/01/1990 12:01:02|
This option lets you specify the number of rows that will be skipped when uploading a Dataset, allowing you to 'clean up' the data before it's loaded into Data Studio. The value is also retained for the next batch upload or data refresh.
For example, setting the value to
5 will skip the first 4 rows in the Dataset when it's uploaded.
If Use first row for column names is checked, the first row of data will be considered as the header and will not count towards the number of lines of data to skip.