home/Data quality/Aperture Data Studio v2/Set up/Install Data Studio on Windows

Install Data Studio on Windows

Aperture Data Studio is a high performance self-contained web-based application for data management. It runs on most Java-compliant operating systems on commodity hardware. Data Studio takes full advantage of 64-bit architectures, is multi-threaded and linearly scalable.

View technical recommendations before installation. Note that your setup requirements will depend on the size of your data and Aperture Data Studio usage.

To install Data Studio, download and run the Experian Aperture Data Studio Setup executable.

  1. Choose an installation type. Select Typical or Complete to install all components, or Custom to optionally unselect the Find duplicates workbench or JDBC driver components.
  2. Review installation folder locations on disk.
  3. To view and modify paths to folders used by Data Studio uncheck Use destination drive as the basis for Data Studio paths.
  4. Review configurable locations on disk. Click Settings to view and edit paths to individual database folders. Edit locations based on decision made in the previous section. These locations can be modified after installation.
  5. Click Install.

The installer can be run silently from the command line using the following switches:

  • /exenoui: Launches the installer exe without a UI
  • /qn: Set the UI level to no UI

Locations for the data folders and the address validation component can be set from the command line using the following parameters:

Parameter Name Description Default folder if not set
DIRNAME_ROOT The root database directory. C:\ApertureDataStudio
DIRNAME_DATA The root variable data file directory. {DIRNAME_ROOT}\data
DIRNAME_IMPORT The directory used for importing files from the server. {DIRNAME_DATA}\import
DIRNAME_EXPORT The root directory used for exporting files. {DIRNAME_DATA}\export
DIRNAME_LOG The log file directory. {DIRNAME_DATA}\log
DIRNAME_RESOURCE The resources directory. {DIRNAME_DATA}\resource
DIRNAME_REPOSITORY The directory containing the repository and repository backups. {DIRNAME_DATA}\repository
DIRNAME_TEMP The temporary files directory. {DIRNAME_DATA}\temp
MATCH_DATABASE_PATH_WINDOWS The directory for duplicates stores created by the Find duplicates Workflow step. {DIRNAME_DATA}\experianmatch
SERVER_ADDRESSVALIDATEINSTALLPATH The Address Validate runtime directory. C:\ProgramData\Experian\addressValidate\runtime

To perform an upgrade the parameter AI_UPGRADE="Yes" can be used.

New install example

"Experian Aperture Data Studio 2.3.9 Setup.exe" /exenoui /qn DIRNAME_DATA="C:\DATA" DIRNAME_LOG="C:\LOG"
This will install Data Studio silently, setting the DIRNAME_DATA and DIRNAME_LOG directories.

Upgrade example

"Experian Aperture Data Studio 2.3.9 Setup.exe" /exenoui /qn AI_UPGRADE="Yes"
This will upgrade Data Studio from a previous version, using the same configuration settings.

When the installation is complete:

  1. Open a web browser and go to http://localhost:7701/ or use the shortcut on your desktop. Note that it may take a minute for the service to start.
  2. Log in with the super-user username and password:
    • Username: administrator.
    • Password: administrator (you'll be prompted to change the password).
  3. In the Update license page, request a license, or enter your update code if you already have it.
  4. Start using Aperture Data Studio.

By default Data Studio is configured to use a variable amount of memory, 66% of what is available, up to a maximum of 16GB. This is useful for initial testing and evaluation where Data Studio is likely to be installed on a machine with many other services running.

When moving into a dedicated dev, test, UAT or production environment it is recommended to modify the memory settings to optimise the resources available and to obtain consistent behaviour.
available
Memory settings may need to be further tuned based on the results observed once in use with the production data, users and regular workload.

It is generally recommended that 50% of the availale RAM on the server is dedicated to Aperture Data Studio. Using more may impact performance and stability by restricting resources available for the operating system and file system cache.

To change the memory settings edit the Aperture Data Studio Service 64bit.ini file in the installation directory. The default memory settings will be found in the Virtual Machine Parameters setting.

    Virtual Machine Parameters=-Xms66:1000:16000P -Xmx66:1000:16000P

In order to change this to use 32GB of RAM, for example. This would be modified to:

    Virtual Machine Parameters=-Xms32G -Xmx32G

-Xms is the minimum amount of heap space the process will use, -Xmx is the maximum. These are followed by the amount of memory to be used as a whole integer and either M or G. M is megabytes and G is gigabytes.

The Data Studio service will need to be restarted after these changes are made.

The metadata repository is the main underlying database that stores all information in Data Studio other than the actual data being processed. This includes users, environments, workflows, functions, datasets (only the metadata not the data itself) and views.

The repository is a SQLite relational database which stores the metadata information for all environments in your Data Studio instance. This consists of a single file, the repository.db, located in the \data\repository folder in the database root directory.

Backups of this repository are configured by modifying the repositories.json file which can be found in the installation root directory.

The default file will look as follows:

[ {
  "id" : "5f3b8528-49d5-4a05-802e-c5437468126b",
  "name" : "Default",
  "datastoreType" : "SQLITE",
  "directory" : "C:\\ApertureDataStudio\\data\\repository",
  "filename" : "repository.db",
  "backupOptions" : {
    "enabled" : false,
    "intervalInMinutes" : 2,
    "backupHistoryIntervalInMinutes" : 1440,
    "maxHistoricBackups" : 7,
    "backupOnStartup" : true
  }
} ]

To configure backups, change the "enabled" setting to true.

    "enabled" : true,

The Data Studio service will need to be restarted after these changes are made.

When backups are enabled, two levels of backup will be created.

  1. An active backup will be taken every intervalInMinutes and will be placed in a file called repository-backup.db alongside the original repository.
  2. Historic backups will be taken of the active backup every backupHistoryIntervalInMinutes and placed into a backup folder within the repository folder. The default is to backup once a day. The most recent historical backup will be named repository-backup-current.db.

When replicating for disaster recovery it is safe to backup the files within the \repository\backup folder. The active repository.db and repository-backup.db files should not be copied as they are being actively written to.

Aperture Data Studio logging

Log4j 2 can be configured using the log4j2.xml file found in your installation directory.

The logging configuration file is made up of two main sections. Appenders define where log information is written to.

  1. This defines the appender to use to be a RollingFile appender.
  2. The file logs will be written to is named datastudio.log in the configured Data Studio log folder.
  3. This is the pattern for archived log. In this case a folder will be created for each month with a compressed archive created with the date as part of the file name. The archive extension can be changed to zip if required.
  4. This defines the pattern the log messages will take. See log4j2 documentation for details.
  5. This defines the policies causing a file rollover to occur. In the default logging they will rollover after they reach 250MB or once a day.
  6. Lastly this defines how many rollover files should be retained. More than 20 in this case will cause the oldest to be deleted.

The Logger section defines what information gets written.

The log levels available to use are trace, debug, info, warn or error. The default log configuration is set to log informational level events, however this should be raised to warning or error in production to avoid large log files.

Find duplicates logging

Find duplicates also uses Log4j 2 for logging. The log4j2.xml file can be found in the Tomcat webapps directory Tomcat x.x/webapps/match-rest-api-x.x.x/WEB-INF/classes.

By default Find duplicates is configured to log error messages only, apart from the following cases.

  1. The REST API will log warning messages to the findDuplicates.log file
  2. Standardize will log warning messages to the findDuplicates.log file
  3. The core matching engine will log warning message to the findDuplicatesCore.log file.

By default the Data Studio server serves on port 7701 and includes a self-signed certificate to support secure HTTPS connections.

In many cases a certificate signed by a trusted certificate authority (CA) should be applied to the server to enable encryption of client-server communications and appropriately identify your server to clients.

To add an SSL certificate to the keystore and configure Data Studio to use the default HTTPS port (443):

  1. Log in with a user that has Manage Communication Settings capabilities (typically an administrator), and browse to Settings > Communication.
  2. Change the REST web server TCP/IP port to 443 (or your desired port).
  3. Turn on Use Secure Sockets Layer (SSL).
  4. Enter certificate file location information:
    • Key file: The path to the file containing the private key used to generate the certificate (for example C:\my certs\CAkey.pem or C:\my certs\CAkey.key). Do not use quotes in the path.
    • Certificate file: The path to the file containing the CA's certificate chain (for example C:\my certs\CARootcert.pem or C:\my certs\CARootcert.crt). If you have root and intermediate certificates in separate files, combine these first. Again, do not use quotes in the path.
    • Key passphrase: The pass phrase with which the key file (or combined certificate and key file) has been encoded.
  5. Click Save and restart the Data Studio server for the settings to take effect.

The Data Studio database is made up of a number of different folders, which by default are all located under the database root directory.

Most of the database folder locations can be configured during installation. To change them after installation time:

  1. Navigate to the installation folder (by default C:\Program Files\Experian\Aperture Data Studio {version}) and open the server.properties file.

  2. On a new line in the file, add the setting you want to modify:

    Setting name Default value Description
    DirName.ROOT C:/ApertureDataStudio
    DirName.DATA %ROOT%/data
    DirName.REPOSITORY %DATA%/repository Contains the metadata repository, and backups if backup is configured.
    DirName.IMPORT %ROOT%/import Contains files that can be loaded from the server's import directory.
    DirName.EXPORT %DATA%/export Contains files exported to the server's export directory.
    DirName.LOG %DATA%/log Contains server log files.
    DirName.TEMP %DATA%/temp Contains temporary files.
    DirName.TEMP_WORKFLOWREPORT %TEMP%/workflowreport Contains temporary files.
    DirName.TEMP_FILEUPLOADS %TEMP%/fileuploads Contains temporary files.
    DirName.CACHE %DATA%/cache Contains caches generated by workflow steps.
    DirName.CACHE_STEP %CACHE%/step Contains caches generated by custom workflow steps.
    DirName.RESOURCE %DATA%/resource Contains all loaded data and indexes created during workflow execution indexes or during interactive data exploration.
    DirName.DRIVERS_JDBC %ROOT%/drivers/jdbc The location of JDBC drivers used for External System connections.
    Dataset.DROPZONE %DATA%/datasetdropzone Contains dataset-specific file dropzones.

For example, to change the location of the database root directory to the D:\ drive, the entry in server.properties should be:

DirName.ROOT=D:\\aperturedatastudio

To change the location of the export directory to a network location using the UNC path, the setting should be similar to:

DirName.EXPORT=\\\\myfileserver.myorg.local\\datastudio\\fileexports

If you're changing DirName.ROOT, DirName.DATA or DirName.REPOSITORY you need to ensure that the path to the repository.db file is also changed in the repository.json file. This file is located in the same folder as the server.properties file.

To configure Data Studio to write to a network drive, use the full UNC path and make sure that the account running the Data Studio Database Service (by default the Local System account) has write access to the network drive folder location. Only DirName.IMPORT, DirName.EXPORT and Dataset.DROPZONE are suitable options for network locations.

The Data Studio service will need to be restarted for changes to take effect.

For performing domain level validation of email addresses using Data Studio's Validate Emails step, the location of an accessible Domain Name System (DNS) server should be configured.

The default server used is 8.8.8.8, the Google DNS, but network or firewall configuration settings may prevent you from accessing this IP address for DNS.

To check if the default Google DNS server is accessible from the Data Studio server, open a command line console and run:

nslookup experian.com 8.8.8.8

If this is successful, the command will return:

Server:  dns.google
Address:  8.8.8.8
Non-authoritative answer:
Name:    experian.com

If the DNS server is not accessible, the command will report DNS request timed out. In this case:

  1. Locate an accessible DNS server. Running ipconfig /all from the command line on the server Data Studio is installed on should return IP addresses for DNS servers that can be accessed.
  2. In the Data Studio UI browse to Settings > Workflow steps, and enter an IP in the DNS Servers field.
  3. Restart the Data Studio server for the settings to take effect.

Aperture Data Studio supports LDAP (Lightweight Directory Access Protocol) and SAML (Security Assertion Markup Language) for SSO (single sign-on) authentication.

Should you wish to configure Data Studio to use either of these methods of authentication as opposed to Data Studio's own internal username/password method of authentication, follow the associated steps.

Before creating any users, you should ensure that the password policy and other security settings enforced by Data Studio meet your security requirements.

To view and change the settings, Browse to Settings > Security:

  1. Review and edit the Password policy settings. This defines the standard that users' passwords must reach when using internal authentication.
  2. Review the Maximum login attempts, and Account lockout period configuration settings, which control how failed login attempts are handled.
  3. Review the Session timeout setting, which controls how long a user session can be inactive before it expires and the user is logged out.
  4. To prevent simultaneous logins from multiple devices using the same access credentials, check Restrict one session per user.

If installing Find duplicates in a system for processing more than 1 million records, we strongly recommend following these steps to deploy the Find duplicates server onto a separate instance from Data Studio.

In cases where Find duplicates will not handle large volumes, the Find duplicates server that's embedded with Data Studio can be used, and no separate installation is needed.

If you are licensed for postal address validation, follow these steps to configure Validate addresses.