Data Quality user documentation

Installing a separate instance

Overview

The default setup of the Find duplicates step is to run embedded within Data Studio. While suitable for testing and small data sets, this is not recommended for production use. Processing large data sets through Find duplicates can use a significant proportion of system resources (CPU/memory) and impact other users and performance of other workflows when run within Data Studio.

When ready to move to production, or to process larger volumes, it is recommended to run Find duplicates as a separate instance.

A separate instance can be:

• Local – the server is deployed on the same machine as Data Studio or
• Remote – the server is deployed on a separate machine (recommended).

Requirements

Before installing a separate instance of the Find duplicates server, make sure you meet the requirements below.

1. .NET Core 2.1 Runtime or above.
2. Java JRE 8.
3. An application server capable of deploying a war file. We recommend Apache Tomcat.

Hardware requirements differ depending on the number of records the Find duplicates step has to process, the quality of data, and its level of duplication.

Number of users and records Requirements
Up to 3 users and 10 million records Small workload requirements
Up to 10 users and 100 million records Medium workload requirements
Over 10 users or 100 million records Large workload requirements

Installation

To install a separate instance of the Find duplicates server follow these steps:

The Find duplicates step uses an external service named GdqStandardizeServer to perform input standardization. This service has to be running for the Find duplicates step to work correctly.

Setup

Prior to installing or starting the service, you need to copy the Standardize directory:

1. On the machine where you have installed Data Studio, find the Standardize directory e.g. C:\Program Files\Experian\Aperture Data Studio {version number}\Standardize.
2. Copy the Standardize directory to a location of your choice on your remote machine. We recommend:
• C:\Program Files\Experian\Standardize if you are using Windows.
• /home/<user>/experian/Standardize if you are using Linux.

Installing the service

Once you have copied the Standardize directory, you need to install Experian.Gdq.Standardize.Web as a service. Follow the instructions below.

Windows

There are two methods for installing Standardize, the first is using PowerShell and the second is using the windows command prompt. If PowerShell is disabled in your environment, use the command prompt option.

PowerShell
1. In an Administrator PowerShell prompt, navigate to the Standardize directory.
2. Run .\GdqStandardizeServiceManager.ps1 install. This will register the service in the Windows Services console.
3. Run .\GdqStandardizeServiceManager.ps1 start. This will attempt to start the service.
Windows command prompt
1. Run the windows Command Prompt as an administrator.
2. Run SC create GdqStandardizeServer binpath= "{path to Standardize directory}\Experian.Gdq.Standardize.Web.exe action:run" start= "auto". This will register the service in the Windows Services console.
3. Run SC start GdqStandardizeServer. This will attempt to start the service.
Linux
1. Navigate to the Standardize directory.
2. As an administrator, run ./install_linux_prereqs.sh. This will install further prerequisites required for the service.
3. As an administrator, run ./deploy_gdqs.sh. This will install and start the service.

Install the latest stable version of your chosen application server. We recommend using Apache Tomcat.

Install Apache Tomcat

Prerequisites

You will need to install a 64-bit Java JDK or JRE before you can run Apache Tomcat. Tomcat 9.0 requires Java 8 or higher. Refer to the Apache Tomcat documentation for more information on Java version support.

Install on Windows
• You will need the 64-bit version of Tomcat. We recommend downloading and running the 64-bit Windows Service Installer.
2. Confirm that the Apache Tomcat service is running as the LOCAL SYSTEM account
• Open the windows service manager
• Right on the Apache Tomcat service, select properties and then the Log On tab
• Ensure the Local System account option is selected.
3. Navigate to http://localhost:8080 to confirm that Tomcat is running.
Install on Linux

• Tomcat's installation directory.
• Tomcat's installation sub-directories, including /conf, /webapps, /work, /temp, /db and /logs.

Deploy the web application

The Find duplicates server is deployed like any other web application by copying the supplied war file to the Apache Tomcat \webapps directory.

To check that your deployment was successful, go to: http://localhost:{port}/match-rest-api-{VersionNumber}/match/docs/index.html. The default Tomcat port is 8080.

Memory tuning

Ensure that you have as much memory allocated to the Tomcat JVM as possible.

The maximum heap size should be set as high as possible while allowing sufficient memory for the operating system and any other running processes.

Setting memory settings on Windows

To set the minimum and maximum values for the memory usage:

1. Navigate to the \bin folder in the Tomcat installation location.
2. Run tomcat9w.exe
3. In the Java tab, set the minimum and maximum memory pool.

Encrypting the connection

When deploying a remote instance of the Find duplicates server, it can be set up to support an encrypted connection (HTTPS). Follow the steps within your application web server documentation to achieve this.

Configuring SSL/TLS with Apache Tomcat

If using Apache Tomcat, refer to the SSL/TLS configuration how-to guide for detailed steps and supported protocols.

If you already have a PKCS12 (.pkcs12 .pfx .p12) file containing the certificate chain and private key, this is the easiest way to configure SSL for Find duplicates using Tomcat. The PKCS12 file can be used directly as a keystore for Tomcat.

Locate Tomcat's main configuration file, /conf/server.xml, in the installation root directory. If this is the first time you are configuring TLS for Tomcat, first you will need to uncomment the SSL Connector element by removing the comment tags <!–- and -–>. To edit the Connector that connects on port 8443 by default using JSSE:

<Connector port="8443" protocol="org.apache.coyote.http11.Http11NioProtocol"
SSLEnabled="true"
scheme="https"
secure="true"
keystoreFile="C:/path/to/my/certificate.pfx"
keystoreType="PKCS12"
clientAuth="false"
SSLProtocol="TLSv1.2"
/>


Test the connection by browsing to https://localhost:{port}/match-rest-api-{VersionNumber}/match/docs/index.html. The default Tomcat port is 8443.

Using a private CA root certificate

The JRE used by Data Studio will validate certificate trust. By default, the certificate must have a valid trust chain referencing a public Certificate Authority (CA). If a private CA is used to create the certificate, it must be added to the Java truststore being used by Data Studio. This can be achieved as follows:

C:\Program Files\Experian\Aperture Data Studio {version number}\java64\jre\lib\security>..\..\bin\keytool.exe -importcert -keystore cacerts -file c:\path\to\ca\ca.cert.pem

The default password for the cacerts truststore is "changeit".

Logging

The instructions in this section only apply if you wish to configure logging information in greater detail.

When the Find duplicates server has been deployed using Tomcat, the findDuplicates.log and findDuplicatesCore.log files can be found in CATALINA_HOME\logs. Logging is handled by the log4j framework. The logging behaviour can be changed by updating the deployed log4j2.xml file, as described below.

On Linux, the log file path(s) must be specified explicitly in the log4j2.xml configuration file as shown below:

<Property name="LOG_DIR">${sys:catalina.home}/logs</Property> <Property name="ARCHIVE">${sys:catalina.home}/logs/archive</Property>

Log Levels

The log level is specified for each major component of the deduplication process within its own section of the log4j2 configuration file under the XML section <loggers>. For example:

<Logger name="com.experian.match.rest.api" level="WARNING" additivity="false">
<AppenderRef ref="findDuplicatesLog"/>
</Logger>
<AppenderRef ref="findDuplicatesCoreLog"/>
</Logger>


This specifies that the logs will have a log level of WARNING, which is the recommended default for all components. Each component can have the logging level increased or decreased to change the granularity in the log file.

The components that may be individually configured are:

Component Description
com.experian.match.rest.api The overall application's web controllers; the level set here is the default to be applied if none of the below are configured.
com.experian.match.actorsys The core deduplication logic.
com.experian.standardisation The API that interfaces to the standalone standardisation component.

The log levels in the log4j2.xml file follow the hierarchy presented in the table below. Therefore, if you set the log level to DEBUG, you will get all the levels below DEBUG as well.

Level Description
ALL All levels.
TRACE Designates finer-grained informational events than DEBUG.
DEBUG Granular information, use this level to debug a package.
INFO Informational messages that highlight the progress of the application at coarse-grained level.
WARN Potentially harmful situations.
ERROR Error events that might still allow the application to continue running.
FATAL Severe error events that will presumably lead the application to abort.
OFF The highest possible rank. Intended to turn off logging.
Logging outputs

By default, the Find duplicates server is set to output the logs to CATALINA_HOME\logs within two separate log files called findDuplicates.log and findDuplicatesCore.log.

To change this, edit the below section of the log4j2.xml file:

<RollingFile name="findDuplicatesLog"
fileName="${LOG_DIR}/findDuplicates.log" filePattern="${ARCHIVE}/findDuplicates.log.%d{yyyy-MM-dd}.gz">
<PatternLayout pattern="${PATTERN}"/> <Policies> <TimeBasedTriggeringPolicy/> <SizeBasedTriggeringPolicy size="1 MB"/> </Policies> <DefaultRolloverStrategy max="2000"/> </RollingFile> <RollingFile name="findDuplicatesCoreLog" fileName="${LOG_DIR}/findDuplicatesCore.log"
filePattern="${ARCHIVE}/findDuplicatesCore.log.%d{yyyy-MM-dd}-%i.gz"> <PatternLayout pattern="${PATTERN}"/>
<Policies>
<TimeBasedTriggeringPolicy/>
<SizeBasedTriggeringPolicy size="1000 MB"/>
</Policies>
<DefaultRolloverStrategy max="100"/>
</RollingFile>


Adjusting the fileName attribute allows you to change the name and location; for example, you may choose to output logging from all components into a single file, or different file names than the ones above.

Once you've installed a separate instance of the Find duplicates server, you can configure it in Data Studio:

1. Go to Settings > Workflow steps.

2. Toggle on Remote find duplicates server: Enabled.

3. Specify the following to connect to your Find duplicates server:

Item Description
Remote find duplicates server: Hostname The IP address or the machine name. Don't include the protocol (http://).
Remote find duplicates server: Path The location of the folder that was created when the war file was read.
Default: match-rest-api-{VersionNumber}
Remote find duplicates server: Port The port number (8080 by default) used by the service.
4. If you deployed the Find duplicates server to use HTTPs, you have to toggle on Use Secure Sockets Layer (SSL) to encrypt the connection.

5. Click Test connection to ensure that the server information has been entered correctly and you can connect. If you receive a licensing error, this means that the server can be found and must now be licensed.