Products and services you want to submit your data to will likely have specific requirements for the layout, schema, file type, etc., that the data must meet before it will be accepted. The process of alignment and transformation is called data preparation, and it aims to deliver data in the required format usable by third parties.
Aperture Data Studio is an effective tool for data preparation, as just one of the use cases it can handle. Different steps can be used to merge multiple files together, rename and reorder columns, hide or mask values that should not be included, filter out records, transform specific values, and export the data in the desired format. Once this process has been defined, it can be automated so that in the future any new data is processed automatically, ready for the third-party service.
Let’s say each month your internal system generates a file containing 2 million rows of data. There is sensitive PII data that needs to be removed before these records can be submitted to an external marketing service, which requires the data to be in their set format. Moreover, the column names do not match the naming and order of the target system schema.
First, you need to confirm and adopt the specific requirements. Once you have considered all format specifications, you can set up a workflow that includes all necessary transformation steps and prepare data for export to the required output format. If this is expected to be a recurrent use case, Aperture Data Studio allows you to automate and schedule the workflow for future use.
If, for example, the target file should be in CSV format, and the schema (column names and order) includes the columns CustomerID
, RegistrationDate
, and Address
, you may need to make the following considerations:
To confirm the required format and file type, request a sample file from the provider of the external system. To correctly adopt their format, prepare two datasets in Aperture Data Studio—one representing the target format and the other your internal source data. First, create DatasetA and load the sample file. Then, create DatasetB and load your internal file into it. This step will help you transform and map your data correctly for submission.
Data tags enrich source data by specifying additional information about what type of data each column contains. Data Studio can use machine learning algorithms to automatically tag columns. Tags are saved as part of the Dataset, so any new batch of data added to the Dataset will retain the same tags.
In the created DatasetB, select Auto tag to automatically apply data tags to the columns. This will help you automatically match fields in DatasetA and DatasetB during transformation and mapping steps.
Transformation involves building workflows that consist of various actions and steps. These workflows allow you to manipulate, clean, and configure your data in a structured manner using a large variety of native or custom Functions.
For example, you can hide sensitive PII data, convert text in a column to uppercase, change the order of columns, exclude, rename, or create new columns.
To hide the PII data in DatasetB, create a new Workflow that uses this Dataset as the Source step and add a Transformation step to it. Apply relevant Functions to the data. For example, the Hash code function applied to the columns containing PII will return a generated hash code for the input value based on the specified algorithm. The Format pattern function will return the format pattern of an input value. This helps validate that the values conform to the defined format.
If, for example, your schema contains the column name Cust_ID
, but the external system requires the name CustomerID
, you can use the column renaming function in Aperture Data Studio to adjust this.
If the external system requires dates in YYYY-MM-DD format but the internal format is MM/DD/YYYY, you can apply a Format date conversion function to ensure compatibility.
The Map to target step is used to align and transform your source Dataset so that it matches the structure and format of the target one. Add this step to the Workflow. Select DatasetA as the target schema.
The auto-map functionality will map columns automatically where possible, based on column names in the source (input columns) and the target (output columns). In case there are remaining unmapped columns, review them manually, selecting the appropriate columns.
Add an Export step to the Workflow and apply the settings for the appropriate file type, compression, encryption, and file name.
Automation allows an action, such as running a Workflow, to happen automatically whenever a particular event occurs, such as data being loaded. Alternatively, a Schedule could be used to run a Workflow(s) at regular intervals.