Discover and profile data

Before creating a Single Customer View (SCV) of your data, you should profile your data.

Profile data

Profiling a set of data is the process of analyzing the values in each column to calculate a range of per-column attributes, such as minimum and maximum values, value and null counts, uniqueness and completeness of values.

Statistics on outlier values (existence of rare or common values) and value formats (using the format pattern) are also calculated for each column.

profiled data

Any Dataset or View can be profiled, either using the Profile step in a Workflow, or in Explore mode. When profiling data in Explore mode, you can right-click a column name to drill down to that column's unique Values or Formats, and from there drill down again to the individual rows that contain that selected value or format.

Profile attributes

Default statistics

Attribute Description
Name The name of the column being profiled.
Description The description of the column being profiled.
Uniqueness The uniqueness of the column (unique count as a percentage of row count).
Completeness The completeness of the column (percentage values in the column that are non-null).
Row Count The number of values in the column.
Has Nulls Does the column have a null count greater than zero?
Dominant Datatype The Datatype that is most common across values in this column.
Format Count The number of unique format patterns for all the values in this column.
Shortest Length The length of the shortest value in the column.
Longest Length The length of the longest value in the column.
Rare Values Whether there are unexpected infrequent values in the column.
Frequent Values Whether there are unexpectedly frequent values in the column.
Long Values Whether there are abnormally long values in the column.
Missing Values Whether there are rare occurrences of NULLs in the column.

All Statistics

Includes the attributes included in the default statistics, and these additional attributes:

Attribute Description
Unique Count The number of unique values in the column (same as the count of grouped values. Nulls are counted as a value).
Minimum The minimum value in the column (first alphabetically, earliest date, lowest numeric value).
Maximum The maximum value in the column (last alphabetically, latest date, highest numeric value).
Overall Datatype The Datatype that would be required to store all values. If a column has both Numeric and Date datatypes, the overall type would be Alphanumeric.
Sum The sum of all numeric values in this column.
Standard Deviation The standard deviation of all numeric values in this Column. Non-numeric values are treated as zero.
Average The average of all numeric values in this column (sum of numbers / number count).
Precision The overall numeric precision for this column, which is the largest number of digits in a number including digits on both sides of the decimal point.
Scale The overall numeric scale for this column which is the most number of digits to the right of the decimal point.
Zero Count The number of rows in the Table containing a value in this column that is zero
Negative Values The count of numeric values in this column that are less than zero
Least Common Value The overall least frequently occurring value in the column.
Least Common Count The count of the number of time the least common value occurs in the column.
Most Common Value The overall most frequently occurring value in the column.
Most Common Count The count of the number of time the most common value occurs in the column.
Least Common Format The least frequently occurring format pattern in this column.
Least Common Format Count The count of the number of times the least frequently occurring format pattern occurs in this column.
Most Common Format The most frequently occurring format pattern in this column.
Most Common Format Count The count of the number of times the most frequently occurring format pattern occurs in this column.
Average Length The average lengths of all values in the column.
Length Deviation The standard deviation of the lengths of all values in the column.
Frequency Deviation The standard deviation of the frequency of occurrence of each value across the set of values in the column.
Format Frequency Deviation The standard deviation of the frequency of occurrence of each format pattern across the set of format patterns in the column.
Alphanumeric Uniqueness The uniqueness of the alphanumeric values in the column (unique alphanumeric values as a percentage of all alphanumeric values).
Alphanumeric Unique Count The number of unique alphanumeric values in the column.
Alphanumeric Completeness The completeness of the alphanumeric values column (values in the column that are alphanumeric as a percentage of the total row count).
Alphanumeric Count The count of alphanumeric values in the column.
Alphanumeric Minimum The first alphanumeric value in the column alphabetically.
Alphanumeric Maximum The last alphanumeric value in the column alphabetically.
Number Uniqueness The uniqueness of the numeric values in the column (unique numeric values in the column as a percentage of all numeric values).
Number Unique Count The number of unique numeric values in the column.
Number Completeness The completeness of the numeric values column (values in the column that are numeric as a percentage of the total row count).
Number Count The count of numeric values in the column.
Number Minimum The lowest numeric value in the column.
Number Maximum The highest numeric value in the column.
Date Uniqueness The uniqueness of the date values in the column (unique date values in the column as a percentage of all date values).
Date Unique Count The number of unique date values in the column.
Date Completeness The completeness of the date values column (values in the column that are date as a percentage of the total row count).
Date Count The count of date values in the column.
Date Minimum The earliest date value in the column.
Date Maximum The latest date value in the column.
Nullity The percentage of rows in the table where the value in this column is null.
Null Count The number of rows in the table containing a value in this column that is null.
Key Check Whether the data in the column denotes this is a perfect key or is a key that is broken.
Rare Formats Whether there are unexpected infrequent formats in the column.
Short Values Whether there are abnormally short values in the column.
Low Amounts Whether there are abnormally low numeric values in the column.
High Amounts Whether there are abnormally high numeric values in the column.
Sequence Whether the values in the column are sequential numbers.
Average Frequency The average frequency of occurrence of a value across the set of values in the column.
Average Format Frequency The average of frequency of occurrence of a format pattern across the set of values in the column.
Sum Squared The sum of the squares of all numeric values in the column
Length Sum Squared The sum of the square of lengths of all values in the column.
Length Sum The sum of the lengths of all values in the column.
Sum Squared Of Frequency The sum of the square of frequency of occurrence of each value in the column.
Sum Squared Of Format Frequency The sum of the square of frequency of occurrence of each format pattern of values in the column.

Format pattern

One aspect of Data Studio's profiling process is to generate a format pattern for each value. The format patterns can then be used to carry out tasks like:

  • Examining unique format counts - profiling groups values in a column by the format pattern.
  • Validating that values conform to a defined set of valid formats.
  • Extract values that match a given format.
  • Lookup values in another data source based, using the format pattern as the match criteria.

Deriving the format pattern

The format pattern is calculated using the following rules:

  1. Any alphabetic character is A.
  2. Any numeric character is 9.
  3. Any whitespace (space, carriage return, linefeed, tab) is S.
  4. The characters @, . (period), , (comma), _ (underscore), - (hyphen), and + are unchanged.
  5. Any braces (including < and >) become ( and ).
  6. The characters / (forward slash), \ (backslash), | (pipe) are /.
  7. All types of quote (double, single, grave accent) become " (double quote).
  8. Any currency symbols such as £, , ¥ become .
  9. Other non-alphabetic chars, for example &, =, ®, , %, ¬, ´ (the acute accent), become # . This applies to characters with the following Unicode category designations (unless a previous rule already handles them):
    • Mn: Nonspacing Mark.
    • Me: Enclosing Mark.
    • Cf: Format.
    • Ps: Start Punctuation.
    • Pe: End Punctuation.
    • Pc: Connector Punctuation.
    • Po: Other Punctuation.
    • Sm: Math Symbol, e.g. =, ÷.
    • Sk: Modifier symbol.
    • So: Other symbol.
  10. Any other character not previously covered, such as the left and right single and double quotation marks, become ? (question mark).

Simple format pattern

The simple format pattern is calculated in the same way as the format pattern, except that it removes consecutive repeated format characters after the first one.

Format pattern examples

Value Format Simple format
MNOPQ3344ZZ AAAAA9999AA A9A
SK10 2ED AA99S9AA A9SA
1-888-397-3742 9-999-999-9999 9-9-9-9
test@gmail.com AAAA@AAAAA.AAA A@A.A
($250) (€999) (€9)
H&J Ltd. A#ASAAA. A#ASA.
{1/2}34 (9/9)99 (9/9)9

Outliers calculation

An outlier is a value that appears to deviate markedly from other members of the sample in which it occurs. For data, it is a value which has an unusual attribute or characteristic, for example:

  • An unusually large salary amount in the HR system
  • A very frequently occurring postcode in the "goods returned" system
  • A strangely-formatted product code in the sales system.

The Outliers profile attributes in Data Studio highlight values which are very different from the average (mean or expected) value in some way:

  • Key check (column values are unique)
  • Rare or frequent values
  • Short or long values
  • Missing values (rare nulls)
  • Low or high amounts (numeric values)
  • Rare formats
  • (is a) sequence

For characteristics based on the relative sizes of a set of values, Data Studio uses a statistical calculation based on standard deviations and the average. The outliers statistics will highlight values which are at the extremes of a normal distribution curve, in other words those occurring no more than twice per thousand values.

For characteristics based on a single statistic, Data Studio uses a threshold value of rarity threshold of less than one in 1000 to determine whether something unusual is present. Records whose values are more than 3.3 standard deviations (the SD tolerance) above or below the average (mean) value are rare/high/low.

Some outlier statistics require at least two unique values to be present in the data set to ever return 'Yes'.

Suggest validation rules

After using the Profile action to profile data, the resulting profiling and data analysis can be quickly converted into validation rules using the Suggest validation rules action. Selecting this action will bring up a list of suggested rules to apply to each profiled column of data.

Each data column will form its own rule group, with all suggested rules for that data column listed as rules within that group. These rules can be individually deselected and therefore excluded from the final validation rules if required.

An overall Pass at: and Fail below: setting will also be displayed with default values of 100% and 95% respectively. These values can be adjusted should they be deemed too strict for the validation required.

Once all rules and settings have been applied, a new Workflow will be created comprising the original source data that was profiled, connected to a Validate step. This Validate step will contain applied validation groups and rules within them from the previous panel, which can be edited later if required.

This Workflow can then be run on a regular basis to ensure the source data conforms to the validation rules consistently.