Data Quality user documentation | Profile and understand data

Before improving the quality of your data, you should profile your data.

To start, ensure that you have access to at least one Dataset. How do I add a Dataset?

Profile data

Profiling a set of data is the process of analyzing the values in each column to calculate a range of per-column attributes, such as minimum and maximum values, value and null counts, uniqueness and completeness of values.

Statistics on common formats and outlier/rare values are calculated for each column.

Any Dataset or View can be profiled, either using the Profile step in a Workflow, or in Explore mode. When profiling data in Explore mode, you can right-click a column name to drill down to that column's unique Values or Formats, and from there drill down again to the individual rows that contain that selected value or format.

Profile attributes

Attribute	Description
Name	The name of the column being profiled.
Summary	The description of the column being profiled.
Uniqueness	The uniqueness of the column (unique count as a percentage of row count).
Completeness	The completeness of the column (percentage values in the column that are non-null).
Row Count	The number of values in the column.
Has Nulls	Does the column have a null count greater than zero?
Dominant Datatype	The Datatype that is most common across values in this column.
Format Count	The number of unique format patterns for all the values in this column.
Shortest Length	The length of the shortest value in the column.
Longest Length	The length of the longest value in the column.
Rare Values	Whether there are unexpected infrequent values in the column.
Frequent Values	Whether there are unexpectedly frequent values in the column.
Long Values	Whether there are abnormally long values in the column.
Missing Values	Whether there are rare occurrences of NULLs in the column.

Includes the core attributes as well as:

Attribute	Description
Unique Count	The number of unique values in the column (same as the count of grouped values. Nulls are counted as a value).
Minimum	The minimum value in the column (first alphabetically, earliest date, lowest numeric value).
Maximum	The maximum value in the column (last alphabetically, latest date, highest numeric value).
Overall Datatype	The Datatype that would be required to store all values. If a column has both Numeric and Date datatypes, the overall type would be Alphanumeric.
Sum	The sum of all numeric values in this column.
Standard Deviation	The standard deviation of all numeric values in this Column. Non-numeric values are treated as zero.
Average	The average of all numeric values in this column (sum of numbers / number count).
Precision	The overall numeric precision for this column, which is the largest number of digits in a number including digits on both sides of the decimal point.
Scale	The overall numeric scale for this column which is the most number of digits to the right of the decimal point.
Zero Count	The number of rows in the Table containing a value in this column that is zero
Negative Values	The count of numeric values in this column that are less than zero
Least Common Value	The overall least frequently occurring value in the column.
Least Common Count	The count of the number of time the least common value occurs in the column.
Most Common Value	The overall most frequently occurring value in the column.
Most Common Count	The count of the number of time the most common value occurs in the column.
Least Common Format	The least frequently occurring format pattern in this column.
Least Common Format Count	The count of the number of times the least frequently occurring format pattern occurs in this column.
Most Common Format	The most frequently occurring format pattern in this column.
Most Common Format Count	The count of the number of times the most frequently occurring format pattern occurs in this column.
Average Length	The average lengths of all values in the column.
Length Deviation	The standard deviation of the lengths of all values in the column.
Frequency Deviation	The standard deviation of the frequency of occurrence of each value across the set of values in the column.
Format Frequency Deviation	The standard deviation of the frequency of occurrence of each format pattern across the set of format patterns in the column.
Alphanumeric Uniqueness	The uniqueness of the alphanumeric values in the column (unique alphanumeric values as a percentage of all alphanumeric values).
Alphanumeric Unique Count	The number of unique alphanumeric values in the column.
Alphanumeric Completeness	The completeness of the alphanumeric values column (values in the column that are alphanumeric as a percentage of the total row count).
Alphanumeric Count	The count of alphanumeric values in the column.
Alphanumeric Minimum	The first alphanumeric value in the column alphabetically.
Alphanumeric Maximum	The last alphanumeric value in the column alphabetically.
Number Uniqueness	The uniqueness of the numeric values in the column (unique numeric values in the column as a percentage of all numeric values).
Number Unique Count	The number of unique numeric values in the column.
Number Completeness	The completeness of the numeric values column (values in the column that are numeric as a percentage of the total row count).
Number Count	The count of numeric values in the column.
Number Minimum	The lowest numeric value in the column.
Number Maximum	The highest numeric value in the column.
Check Sum	The checksum for all the values in the column.
Date Uniqueness	The uniqueness of the date values in the column (unique date values in the column as a percentage of all date values).
Date Unique Count	The number of unique date values in the column.
Date Completeness	The completeness of the date values column (values in the column that are date as a percentage of the total row count).
Date Count	The count of date values in the column.
Date Minimum	The earliest date value in the column.
Date Maximum	The latest date value in the column.
Nullity	The percentage of rows in the table where the value in this column is null.
Null Count	The number of rows in the table containing a value in this column that is null.
Key Check	Whether the data in the column denotes this is a perfect key or is a key that is broken.
Rare Formats	Whether there are unexpected infrequent formats in the column.
Short Values	Whether there are abnormally short values in the column.
Low Amounts	Whether there are abnormally low numeric values in the column.
High Amounts	Whether there are abnormally high numeric values in the column.
Sequence	Whether the values in the column are sequential numbers.
Average Frequency	The average frequency of occurrence of a value across the set of values in the column.
Average Format Frequency	The average of frequency of occurrence of a format pattern across the set of values in the column.
Sum Squared	The sum of the squares of all numeric values in the column
Length Sum Squared	The sum of the square of lengths of all values in the column.
Length Sum	The sum of the lengths of all values in the column.
Sum Squared Of Frequency	The sum of the square of frequency of occurrence of each value in the column.
Sum Squared Of Format Frequency	The sum of the square of frequency of occurrence of each format pattern of values in the column.
Data Tags	Data tags assigned to the column.
Sensitive Data Tags	Data tags with sensitive flag assgined.

Interactivity

Profile results are interactive in that users can click in to see a list of the Values or Formats for a chosen column. This list contains the row count and percentage distribution to understand which are the dominant items.

It is then possible to further drilldown to the specific Rows in the original underlying data that contain selected value(s) or format(s).

Format pattern

One aspect of Data Studio's profiling process is to generate a format pattern (and simple format pattern) for each value replacing any letter with 'A' or number with '9'. These format patterns can then be used to carry out tasks like:

Examining unique format counts - profiling groups values in a column by the format pattern.
Validating that values conform to a defined set of valid formats.
Extract values that match a given format.
Lookup values in another data source based, using the format pattern as the match criteria.

The format pattern is calculated using the following rules:

Any alphabetic character is A.
Any numeric character is 9.
Any whitespace (space, carriage return, linefeed, tab) is S.
The characters @, . (period), , (comma), _ (underscore), - (hyphen), and + are unchanged.
Any braces (including < and >) become ( and ).
The characters / (forward slash), \ (backslash), | (pipe) are /.
All types of quote (double, single, grave accent) become " (double quote).
Any currency symbols such as £, €, ¥ become €.
Other non-alphabetic chars, for example &, =, ®, ‡, %, ¬, ´ (the acute accent), become # . This applies to characters with the following Unicode category designations (unless a previous rule already handles them):
- Mn: Nonspacing Mark.
- Me: Enclosing Mark.
- Cf: Format.
- Ps: Start Punctuation.
- Pe: End Punctuation.
- Pc: Connector Punctuation.
- Po: Other Punctuation.
- Sm: Math Symbol, e.g. =, ÷.
- Sk: Modifier symbol.
- So: Other symbol.
Any other character not previously covered, such as the left and right single and double quotation marks, become ? (question mark).

Simple format pattern

The simple format pattern is calculated in the same way as the format pattern, except that it removes consecutive repeated format characters after the first one.

Format pattern examples

Value	Format	Simple format
MNOPQ3344ZZ	AAAAA9999AA	A9A
SK10 2ED	AA99S9AA	A9SA
1-888-397-3742	9-999-999-9999	9-9-9-9
test@gmail.com	AAAA@AAAAA.AAA	A@A.A
($250)	(€999)	(€9)
H&J Ltd.	A#ASAAA.	A#ASA.
{1/2}34	(9/9)99	(9/9)9

Outliers calculation

An outlier is a value that appears to deviate markedly from other members of the sample in which it occurs. For data, it is a value which has an unusual attribute or characteristic, for example:

An unusually large salary amount in the HR system
A very frequently occurring postcode in the "goods returned" system
A strangely-formatted product code in the sales system.

The Outliers profile attributes in Data Studio highlight values which are very different from the average (mean or expected) value in some way:

Key check (column values are unique)
Rare or frequent values
Short or long values
Missing values (rare nulls)
Low or high amounts (numeric values)
Rare formats
(is a) sequence

For characteristics based on the relative sizes of a set of values, Data Studio uses a statistical calculation based on standard deviations and the average. The outliers statistics will highlight values which are at the extremes of a normal distribution curve, in other words those occurring no more than twice per thousand values.

For characteristics based on a single statistic, Data Studio uses a threshold value of rarity threshold of less than one in 1000 to determine whether something unusual is present. Records whose values are more than 3.3 standard deviations (the SD tolerance) above or below the average (mean) value are rare/high/low.

Some outlier statistics require at least two unique values to be present in the data set to ever return 'Yes'.

Suggest validation rules from Profile results

Profile statistics can be quickly converted into validation rulesusing the Suggest validation rules action. Selecting this action will bring up a list of suggested rules for each profiled column of data.

Once all rules and settings have been applied, a new Workflow will be automatically created containing a Validate step with the rules, which can be further edited if required.

Customize the Profile output

The statistical columns included and the order they appear in can be updated. Go to Step settings > Profile step settings and configure as required.

To define your own profile statistics, click Add attribute and select a Function. The value type returned by the Function determines the statistical column options:

Boolean (true/false). For example: 'Equals' can calculate the Count and Frequency
Numeric. For example: 'Length' can calculate the Total, Average, Minimum and Maximum

These calculated Profile columns can then be re-ordered/removed just like the standard Profile columns.

A pre-set created in the 'System space' will be availble to select in all Spaces.

Setting a default Profile output

You can set one of your Profile step settings as the Default profile preset for the Space. The default will be automatically selected when profiling data, which will be Core statistics if nothing has been set.

The User-defined default within a Space will take priority over a default set in the 'System space'.

Quick Profile

Any Dataset or View can be profiled in two clicks by right-clicking on it (or selecting the Actions menu) in the list screen and selecting Profile. This will include all columns in the table and use the default profile preset statistics.

Was this helpful?

Previous: AUS credit data (CCR)

Next: Transform and fix data

Profile and understand data