Technical recommendations

Aperture Data Studio is a self-contained web-based application which runs on most java-compliant operating systems on commodity hardware. It takes full advantage of 64-bit architectures, and is multi-threaded and linearly scalable.

To avoid potential performance drops during critical operations, we strongly recommend that you ensure your anti-virus software does not check the directories containing Aperture Data Studio data files and that any system sweeps are scheduled to run outside of office hours/data loading periods.

Client

Aperture Data Studio is a web-based application, so all that is requred for the client is a web browser. We strongly recommend using Chrome to access the application.

A full production deployment is shown below.

Find duplicates server: If deduplication is not required, this server is not required.
Electronic updates: This optional component is only required if address validation is needed.

Aperture Data Studio Server

Our recommendations will depend on the size of your workload - small, medium or large.

Recommendations below are for single users (max 2-3 concurrent users) or when processing fewer than 10 million rows of data.

Component Recommendation
Operating System (OS) Windows Server 2019, 2016, 2012 R2, or Windows 10 64-bit. Supported Linux distributions.
Processor (CPU) 8 virtual cores, equivalent of an Intel Quad Core CPU.
Memory (RAM) 16 GB dedicated for Aperture Data Studio usage. You may require more (e.g. 32 GB) for OS and any other software.
Disk (HDD, SSD, etc.) Consumer SSDs (as fast as possible).

Recommendations below are for 4-10 users or when processing 10-100 million rows of data.

Component Recommendation
Operating System (OS) Windows Server 2019, 2016, 2012 R2, or Windows 10 64-bit. Supported Linux distributions.
Processor (CPU) 16 virtual cores, equivalent of an Intel Quad Core CPU.
Memory (RAM) 32 GB dedicated for Aperture Data Studio usage. You may require more (e.g. 48 GB) for OS and any other software.
Disk (HDD, SSD, etc.) Enterprise SSDs (as fast as possible).

Recommendations below are for more than 10 users or when processing 100 million - 1 billion rows of data.

Component Recommendation
Operating System (OS) Windows Server 2019, 2016, 2012 R2, or Windows 10 64-bit. Supported Linux distributions.
Processor (CPU) 16 virtual cores, equivalent of Dual Intel Quad Core CPUs.
Memory (RAM) 128 GB dedicated for Aperture Data Studio usage. You may require more for OS and any other software.
Disk (HDD, SSD, etc.) Enterprise Provisioned SSDs (as fast as possible).

Find Duplicates Server

The find duplicates recommended server size is in line with the sizing for the Aperture Data Studio Server. However, the RAM and CPU requirements can vary depending on the type of deduplication taking place as well as the number of records.

Use the Aperture Data Studio server recommendations as a starting point, but vary the sizing based on analysis of your data and the processing time required for your job.

CPUs: Adding more cores will improve performance where the number of records to be processed increase.

RAM: Adding more RAM will improve performance where the average size of a cluster is larger, or there are many large clusters of duplicate records.

General recommendations

Virtual cores refer to using a virtual machine or an instance from Azure/AWS. If using physical hardware, it would be CPU threads available to the server. Note that slower core speeds would still impact the performance but the main denominator is how many cores are available for use by Data Studio.

Single core speed: Has the most impact on view operations (joining, sorting, grouping, lookups, expression evaluation, etc.).

Multi-core speed: Affects sorting, grouping, profiling and the ability to support concurrent user activity/workflow execution.

In production, we recommend that Data Studio makes use of 50% of available RAM, with at least 16 GB allocated for Data Studio, with 23 GB overall, to allow for OS and other software to be able to run without impacting performance.

Memory amount: Affects joining, profiling, lookup index creation and, to an extent sorting and grouping.

Disks can have a significant impact on load/profile times for Data Studio as well as address validation and writing to snapshots. You should therefore use the fastest hard drives possible, ideally Enterprise SSDs (if hosted, make them provisioned versions). We recommend using RAID-0 for data storage.

The average disk throughput is the biggest denominator (IOPS being far less of a consideration) which affects profiling, loading, exporting. For large data sets, it can also impact joining, sorting and grouping.

Disk size

Database size (after load/profile) can be easily estimated. The loaded database size is around 80% and the profiled one is around 165% of the original file size.

Based on your requirements, we suggest the following Amazon Web Services: i3, c4, c5 or m4.

These AWS instances provide the best performance for Data Studio without giving an excess of CPU/RAM limits that wouldn't be required. This is while providing disks that will lower the impact of cloud-based solutions that have potentially shared resources which can cause bottlenecks for Data Studio.

For standard usage we recommend higher-end Data Studio instances with standard (default) storage listed on the standard tier, not the basic one. For maximum performance, we recommend using the relevant premium storage quantity for your needs in conjunction with the chosen Data Studio instance.

These Azure instances provide the best performance for Data Studio without giving an excess of CPU/RAM limits that wouldn't be required.

We strongly recommend that your antivirus software does not check the directories containing Data Studio data files, and that any system sweeps are scheduled to run outside of office hours/periods when data loading/profiling can occur. This is to avoid any performance drops that might happen during critical operations.

For smaller data sets, Aperture Data Studio can be easily run on a virtual machine (VM). For larger data sets or intensive usage, it's worth taking the following into account to ensure maximum performance:

  • VM should be allocated with dedicated CPUs. This ensures that the full performance is available to Aperture Data Studio rather than potentially decreasing performance that comes with having the CPUs resources shared across multiple VMs or applications on the server.
  • If possible, enable directly attached storage for the VM. Most VMs are used as quick builds based on sharing hardware resources which can have a negative impact on performance for a database storage engine like Aperture Data Studio.
  • Where a physical disk is shared between VMs on a host, multiple processes trying to access the disk across those VMs can lead to high contention, significantly impacting Data Studio's performance. For best performance, we recommend that storage is dedicated to the VM hosting Data Studio's database.

For particularly large data sets, we recommend using a dedicated server instead of a VM.