Data Quality user documentation

Data tags allow you to enrich source data by specifying additional information about what type of data each column contains. Having tagged columns improves the user experience by providing sensible defaults and suggestions in Workflow steps. For example, Find duplicates and Validate addresses. Data Studio can use machine learning algorithms to automatically tag columns.

Tags are saved as part of the Dataset, so any new batch of data added to the Dataset will retain the same tags.

Data tags appear next to the column name in data grids, so if you're not familiar with your data they provide an overview at a glance.

There are two types of data tags:

System defined (cannot be modified, used to drive system behavior)
User defined

Tags in Data Studio are hierarchical: a parent tag (e.g. Address) can have multiple children (e.g. Postal Code, City).

While you can't modify system defined tags, you can create a child tag (e.g. PO box) and assign it to a system tag (e.g. Address). Once created, it will appear in the User defined list under that system parent tag.

Only users with a role that includes the "Create and Edit Data Tags" capability will be able to view and manage tags.

Create a tag

To create new tags follow these steps:

Go to System > Data tags.
Click the Create new tag button to add a new one.
Fill the name, and optionally assign a parent tag.
Click Apply.

The newly created tag will appear in the list.

Manage tags

To manage existing tags:

Go to System > Data tags.
Click on Options > Edit details to edit it.
Change the name. And optionally the parent tag of this new tag.
Click Apply.
The tag will be updated in the list.

Any data tag can have multiple training datasets. A training dataset is used to train a fingerprint, which will allow Data Studio to learn how to recognize data that have similar properties, and allow them to be auto tagged.

Training datasets

To create or view the Training datasets for a data tag:

Go to System > Data tags.
On the required data tag, click on Options > Manage training datasets.
The list of training datasets for the selected data tag will appear.

When creating a new training dataset you will need to define:

Name for the training dataset.
Source type, either an existing Dataset or an existing View.
Name of the Dataset or View to use.
Column from that source that contains the values to be used for training.
Threshold, a percentage value between 35 and 99 that determines how similar the data must be to the fingerprint in order for the tag to be applied automatically.
Threshold suggestion
To help you find the best threshold, you can select a source in the right-hand side box. In the table you can see a match value which represents the smallest possible threshold needed for the column to be tagged. Click Use as threshold to apply this value as the threshold.
If Strict length validation should be performed. If set, any data tagged must closely match the mean length of the training data.

Once the training dataset has been created, only name and threshold properties can be modified.

Data Studio will train a fingerprint using the specified source and column. Once the fingerprint is trained, it can be used for auto tagging new Datasets. When auto tagging, Data Studio will analyze the input columns and tag them if a fingerprint match is found.

Training datasets can be disabled, which will prevent the training data from being included in the auto tagging process.

Was this helpful?

Previous: Configure data export directory

Next: Discover and profile data

Aperture Data Studio v2

Prepare and move data

Next topic:
Share data selectively
Previous topic:
Discover and profile data

Tag data

Create a tag

Manage tags

Training datasets

Aperture Data Studio v2

Prepare and move data