Data Quality user documentation

# Installing a separate instance

### Overview

The default setup of the Find duplicates step is to run embedded within Data Studio. While suitable for testing and small data sets, this is not recommended for production use. Processing large data sets through Find duplicates can use a significant proportion of system resources (CPU/memory) and impact other users and performance of other workflows when run within Data Studio.

When you are ready to move to production or to process larger volumes, we recommended configuring Find duplicates to run as a separate instance.

A separate instance can be:

• Remote – the server is deployed on a separate machine (recommended) or
• Local – the server is deployed on the same machine as Data Studio.

### Requirements

Before installing a separate instance of the Find duplicates server, make sure you meet the requirements below.

Hardware requirements differ depending on the number of records the Find duplicates step has to process, the quality of data, and its level of duplication.

Number of records Requirements
Up to 1 million Small workload requirements

### Windows installation

To install a separate instance of the Find duplicates server as a Windows service follow these steps:

2. Choose an installation type. Select Typical to install the Find Duplicates Service, Find Duplicates Standardize Service, Find Duplicates Workbench and Java Runtime Environment, or select Custom to choose only specific components.
3. Review the installation folder location.
4. Configure custom settings: depending on what components were selected relevant settings may be customized here, including the Find Duplicates database path, port and maximum cluster size. To continue with the default settings click Next.
5. Click Install.
6. Verify the service has started successfully: when the installation has completed click the link to open the Find Duplicates Swagger endpoint or navigate to http://localhost:{port}/swagger-ui/index.html, where port is 8080 or a custom value configured above.

#### Memory tuning

The Find Duplicates service runs as a Java executable. For the best performance ensure that you have as much memory allocated to the JVM as possible.

The maximum heap size should be set as high as possible while allowing sufficient memory for the operating system and any other running processes.

To set the minimum and maximum values for the memory usage, use the JVM parameters Xms and Xmx in the Find Duplicates configuration file:

1. Navigate to the Find Duplicates installation folder and open Find Duplicates.ini in a text editor.
2. Find the line starting with Virtual Machine Parameters=
3. Append your preferred minimum and maximum JVM parameters. For example, to set a minimum of 2 GB and maximum of 8 GB use the following: -Xms2g -Xmx8g
4. Save Find Duplicates.ini and restart the Find Duplicates service.

#### Encrypting the connection

The Find duplicates service can be set up to support an encrypted connection (HTTPS).

##### Configuring SSL
1. Navigate to the Find Duplicates installation directory and open the find_duplicates.properties file.

2. Set the server.ssl.enabled property to true.

3. Uncomment the remaining lines (remove the initial # character) and populate with your keystore settings:

server.ssl.enabled=true
server.ssl.key-alias=your_key_alias
server.ssl.key-store=file:///path/to/my/certificate.pfx
server.ssl.key-store-type=PKCS12


See below for an explanation of each property.

server.ssl.key-alias: The alias for the server key and certificate in the keystore. If the keystore only contains one key this property can be omitted.

server.ssl.key-store: The pathname of the keystore file in URI format. If the keystore file is in the Find duplicates installation directory only the filename is required. To specify an absolute file path the file URI scheme format must be used: file://[hostname]/path. The hostname may be omitted if the path is local to the Find Duplicates server. Any spaces in the path must be replaced by their HTML equivalent, and on Windows backslash characters must be replaced by forward slashes. For example, a local keystore file path of C:\Find duplicates\certificate.pfx would be formatted as file:///C:/Find%20duplicates/certificate.pfx.

server.ssl.key-store-type: The keystore certificate format. If omitted this defaults to JKS.

server.ssl.key-store-password: The password to access the keystore file.

server.ssl.key-password: The keystore certificate password. Typically this is identical to the keystore password, in which case it can be omitted.

4. Open the Find Duplicates.ini file and find the line starting with Virtual Machine Parameters=. Find the server.port setting and updated it to your preferred port. The default SSL port is 8443.

5. Restart the Find Duplicates service.

Test the connection by browsing to https://localhost:{port}/swagger-ui/index.html using the port number configured above.

##### Using a private CA root certificate

The JRE used by Data Studio will validate certificate trust. By default, the certificate must have a valid trust chain referencing a public Certificate Authority (CA). If a private CA is used to create the certificate, it must be added to the Java truststore being used by Data Studio. This can be by running the following command on the machine where Data Studio is installed:

C:\Program Files\Experian\Aperture Data Studio {version number}\java64\jre\bin\keytool.exe -import -trustcacerts -alias myCA -file "path\to\myCA.pem" -keystore "path\to\cacerts"

The default password for the cacerts truststore is changeit.

#### Logging

Logging is handled by the log4j framework. The logging behavior can be changed by updating the deployed log4j2.xml file, as described below.

##### Log Levels

The log level is specified for each major component of the deduplication process within its own section of the log4j2 configuration file under the XML section <loggers>. For example:

<Logger name="com.experian.match.rest.api" level="WARNING" additivity="false">
<AppenderRef ref="findDuplicatesLog"/>
</Logger>
<AppenderRef ref="findDuplicatesCoreLog"/>
</Logger>


This specifies that the logs will have a log level of WARNING, which is the recommended default for all components. Each component can have the logging level increased or decreased to change the granularity in the log file.

The components that may be individually configured are:

Component Description
com.experian.match.rest.api The overall application's web controllers; the level set here is the default to be applied if none of the below are configured.
com.experian.match.actorsys The core deduplication logic.
com.experian.standardisation The API that interfaces to the standalone standardisation component.

The log levels in the log4j2.xml file follow the hierarchy presented in the table below. Therefore, if you set the log level to DEBUG, you will get all the levels below DEBUG as well.

Level Description
ALL All levels.
TRACE Designates finer-grained informational events than DEBUG.
DEBUG Granular information, use this level to debug a package.
INFO Informational messages that highlight the progress of the application at coarse-grained level.
WARN Potentially harmful situations.
ERROR Error events that might still allow the application to continue running.
FATAL Severe error events that will presumably lead the application to abort.
OFF Suppress all logging.
##### Logging outputs

By default, the Find duplicates server is set to output the logs to {your Find Duplicates installation directory}\logs within two separate log files called findDuplicates.log and findDuplicatesCore.log.

To change this, edit the section of the log4j2.xml file shown below:

<RollingFile name="findDuplicatesLog"
fileName="${LOG_DIR}/findDuplicates.log" filePattern="${ARCHIVE}/findDuplicates.log.%d{yyyy-MM-dd}.gz">
<PatternLayout pattern="${PATTERN}"/> <Policies> <TimeBasedTriggeringPolicy/> <SizeBasedTriggeringPolicy size="1 MB"/> </Policies> <DefaultRolloverStrategy max="2000"/> </RollingFile> <RollingFile name="findDuplicatesCoreLog" fileName="${LOG_DIR}/findDuplicatesCore.log"
filePattern="${ARCHIVE}/findDuplicatesCore.log.%d{yyyy-MM-dd}-%i.gz"> <PatternLayout pattern="${PATTERN}"/>
<Policies>
<TimeBasedTriggeringPolicy/>
<SizeBasedTriggeringPolicy size="1000 MB"/>
</Policies>
<DefaultRolloverStrategy max="100"/>
</RollingFile>


Adjusting the fileName attribute allows you to change the name and location; for example, you may choose to output logging from all components into a single file, or different file names than the ones above.

The Find Duplicates Service runs using Spring Boot with an embedded Tomcat server. CORS settings for the embedded Tomcat server are passed via JVM parameters. For example, to allows connections to any resource from any domain, find the line starting with Virtual Machine Parameters= in the Find Duplicates.ini file in the Find Duplicates install directory and append the following value:

-Dcors.allowed.origins=*


Additonal Tomcat CORS settings can be configured in the same way as shown above using the corresponding JVM parameter for each:

cors.allowed.origins
cors.allowed.methods
cors.preflight.maxage
cors.support.credentials


For more details on CORS configuration refer to the Tomcat CORS Filter documentation.

#### Configure a remote Standardize instance

1. Download and run the Experian Find Duplicates Setup executable on the machine where the Find Duplicates Service will be installed.
2. From the Choose Setup Type screen select Custom.
3. Click on the Standardize feature in the feature tree and choose to not install it on this machine by selecting Entire feature will be unavailable.
4. On the Configure Settings screen set the hostname and port for the remote Standardize service.
5. Click Next and then Install.
6. Download and run the Experian Find Duplicates Setup executable on the machine where the Standardize service will be installed.
7. Click on each feature except Standardize in the feature tree and choose to not install each by selecting Entire feature will be unavailable.
8. On the Configure Settings screen set the hostname and port for the remote Standardize service.
9. Click Next and then Install.
10. Verify the Standardize server is running by navigating to the version endpoint at http://{hostname}:{port}/api/version.

Once you've installed a separate instance of the Find duplicates server, you can configure it in Data Studio:

1. Go to Configuration > Step settings > Find duplicates.

2. Toggle on Remote server: Enabled.

3. Specify the following to connect to your Find duplicates server:

Item Description
Remote server: Hostname The IP address or the machine name. Do not include the protocol (http://).
Remote server: Path The base path of the Data Studio REST API. This should be left blank when using the Find Duplicates Windows service.
Remote server: Port The port number used by the service. Default: 8080
4. If you configured the Find duplicates server to use HTTPs, you have to toggle on Remote server: Use https (TLS/SSL) to encrypt the connection.

5. Click Test connection to ensure that the server information has been entered correctly and you can connect. If you receive a licensing error, this means that the server can be found and must now be licensed.

### Manual web server deployment

Follow the steps below to deploy Find duplicates manually to a separate web server:

Before installing a separate instance of the Find duplicates server, make sure you meet the requirements below.

1. .NET Core 2.1 Runtime or above.
2. Java JRE 8.
3. An application server capable of deploying a war file. We recommend Apache Tomcat.

The Find duplicates step uses an external service named GdqStandardizeServer to perform input standardization. This service has to be running for the Find duplicates step to work correctly.

#### Setup

Prior to installing or starting the service, you need to copy the Standardize directory:

1. On the machine where you have installed Data Studio, find the Standardize directory e.g. C:\Program Files\Experian\Aperture Data Studio {version number}\Standardize.
2. Copy the Standardize directory to a location of your choice on your remote machine. We recommend:
• C:\Program Files\Experian\Standardize if you are using Windows.
• /home/<user>/experian/Standardize if you are using Linux.

#### Installing the service

Once you have copied the Standardize directory, you need to install Experian.Gdq.Standardize.Web as a service. Follow the instructions below.

##### Windows

There are two methods for installing Standardize, the first is using PowerShell and the second is using the windows command prompt. If PowerShell is disabled in your environment, use the command prompt option.

###### PowerShell
1. In an Administrator PowerShell prompt, navigate to the Standardize directory.
2. Run .\GdqStandardizeServiceManager.ps1 install. This will register the service in the Windows Services console.
3. Run .\GdqStandardizeServiceManager.ps1 start. This will attempt to start the service.
###### Windows command prompt
1. Run the windows Command Prompt as an administrator.
2. Run SC create GdqStandardizeServer binpath= "{path to Standardize directory}\Experian.Gdq.Standardize.Web.exe action:run" start= "auto". This will register the service in the Windows Services console.
3. Run SC start GdqStandardizeServer. This will attempt to start the service.
##### Linux
1. Navigate to the Standardize directory.
2. As an administrator, run ./install_linux_prereqs.sh. This will install further prerequisites required for the service.
3. As an administrator, run ./deploy_gdqs.sh. This will install and start the service.

Install the latest stable version of your chosen application server. We recommend using Apache Tomcat.

#### Install Apache Tomcat

##### Prerequisites

You will need to install a 64-bit Java JDK or JRE before you can run Apache Tomcat. Tomcat 9.0 requires Java 8 or higher. Refer to the Apache Tomcat documentation for more information on Java version support.

##### Install on Windows
• You will need the 64-bit version of Tomcat. We recommend downloading and running the 64-bit Windows Service Installer.
2. Confirm that the Apache Tomcat service is running as the LOCAL SYSTEM account
• Open the windows service manager
• Right on the Apache Tomcat service, select properties and then the Log On tab
• Ensure the Local System account option is selected.
3. Navigate to http://localhost:8080 to confirm that Tomcat is running.
##### Install on Linux

• Tomcat's installation directory.
• Tomcat's installation sub-directories, including /conf, /webapps, /work, /temp, /db and /logs.

#### Deploy the web application

The Find duplicates server is deployed like any other web application by copying the supplied war file to the Apache Tomcat \webapps directory.

To check that your deployment was successful, go to: http://localhost:{port}/match-rest-api-{VersionNumber}/match/docs/index.html. The default Tomcat port is 8080.

#### Memory tuning

Ensure that you have as much memory allocated to the Tomcat JVM as possible.

The maximum heap size should be set as high as possible while allowing sufficient memory for the operating system and any other running processes.

##### Setting memory settings on Windows

To set the minimum and maximum values for the memory usage:

1. Navigate to the \bin folder in the Tomcat installation location.
2. Run tomcat9w.exe
3. In the Java tab, set the minimum and maximum memory pool.

#### Encrypting the connection

When deploying a remote instance of the Find duplicates server, it can be set up to support an encrypted connection (HTTPS). Follow the steps within your application web server documentation to achieve this.

##### Configuring SSL/TLS with Apache Tomcat

If using Apache Tomcat, refer to the SSL/TLS configuration how-to guide for detailed steps and supported protocols.

If you already have a PKCS12 (.pkcs12 .pfx or .p12) file containing the certificate chain and private key, this is the easiest way to configure SSL for Find duplicates using Tomcat. The PKCS12 file can be used directly as a keystore for Tomcat.

Locate Tomcat's main configuration file, /conf/server.xml, in the installation root directory. If this is the first time you are configuring TLS for Tomcat, first you will need to uncomment the SSL Connector element by removing the comment tags <!–- and -–>. To edit the Connector that connects on port 8443 by default using JSSE:

<Connector port="8443" protocol="org.apache.coyote.http11.Http11NioProtocol"
SSLEnabled="true"
scheme="https"
secure="true"
keystoreFile="C:/path/to/my/certificate.pfx"
keystoreType="PKCS12"
clientAuth="false"
SSLProtocol="TLSv1.2"
/>


Test the connection by browsing to https://localhost:{port}/match-rest-api-{VersionNumber}/match/docs/index.html. The default Tomcat port is 8443.

##### Using a private CA root certificate

The JRE used by Data Studio will validate certificate trust. By default, the certificate must have a valid trust chain referencing a public Certificate Authority (CA). If a private CA is used to create the certificate, it must be added to the Java truststore being used by Data Studio. This can be by running the following command on the machine where Data Studio is installed:

C:\Program Files\Experian\Aperture Data Studio {version number}\java64\jre\bin\keytool.exe -import -trustcacerts -alias myCA -file "path\to\myCA.pem" -keystore "path\to\cacerts"

The default password for the cacerts truststore is changeit.

#### Logging

The instructions in this section only apply if you wish to configure logging information in greater detail.

When the Find duplicates server has been deployed using Tomcat, the findDuplicates.log and findDuplicatesCore.log files can be found in CATALINA_HOME\logs. Logging is handled by the log4j framework. The logging behavior can be changed by updating the deployed log4j2.xml file, as described below.

On Linux, the log file path(s) must be specified explicitly in the log4j2.xml configuration file as shown below:

<Property name="LOG_DIR">${sys:catalina.home}/logs</Property> <Property name="ARCHIVE">${sys:catalina.home}/logs/archive</Property>

##### Log Levels

The log level is specified for each major component of the deduplication process within its own section of the log4j2 configuration file under the XML section <loggers>. For example:

<Logger name="com.experian.match.rest.api" level="WARNING" additivity="false">
<AppenderRef ref="findDuplicatesLog"/>
</Logger>
<AppenderRef ref="findDuplicatesCoreLog"/>
</Logger>


This specifies that the logs will have a log level of WARNING, which is the recommended default for all components. Each component can have the logging level increased or decreased to change the granularity in the log file.

The components that may be individually configured are:

Component Description
com.experian.match.rest.api The overall application's web controllers; the level set here is the default to be applied if none of the below are configured.
com.experian.match.actorsys The core deduplication logic.
com.experian.standardisation The API that interfaces to the standalone standardisation component.

The log levels in the log4j2.xml file follow the hierarchy presented in the table below. Therefore, if you set the log level to DEBUG, you will get all the levels below DEBUG as well.

Level Description
ALL All levels.
TRACE Designates finer-grained informational events than DEBUG.
DEBUG Granular information, use this level to debug a package.
INFO Informational messages that highlight the progress of the application at coarse-grained level.
WARN Potentially harmful situations.
ERROR Error events that might still allow the application to continue running.
FATAL Severe error events that will presumably lead the application to abort.
OFF Suppress all logging.
##### Logging outputs

By default, the Find duplicates server is set to output the logs to CATALINA_HOME\logs within two separate log files called findDuplicates.log and findDuplicatesCore.log.

To change this, edit the section of the log4j2.xml file shown below:

<RollingFile name="findDuplicatesLog"
fileName="${LOG_DIR}/findDuplicates.log" filePattern="${ARCHIVE}/findDuplicates.log.%d{yyyy-MM-dd}.gz">
<PatternLayout pattern="${PATTERN}"/> <Policies> <TimeBasedTriggeringPolicy/> <SizeBasedTriggeringPolicy size="1 MB"/> </Policies> <DefaultRolloverStrategy max="2000"/> </RollingFile> <RollingFile name="findDuplicatesCoreLog" fileName="${LOG_DIR}/findDuplicatesCore.log"
filePattern="${ARCHIVE}/findDuplicatesCore.log.%d{yyyy-MM-dd}-%i.gz"> <PatternLayout pattern="${PATTERN}"/>
<Policies>
<TimeBasedTriggeringPolicy/>
<SizeBasedTriggeringPolicy size="1000 MB"/>
</Policies>
<DefaultRolloverStrategy max="100"/>
</RollingFile>


Adjusting the fileName attribute allows you to change the name and location; for example, you may choose to output logging from all components into a single file, or different file names than the ones above.

Once you've installed a separate instance of the Find duplicates server, you can configure it in Data Studio:

1. Go to Configuration > Step settings > Find duplicates.

2. Toggle on Remote server: Enabled.

3. Specify the following to connect to your Find duplicates server:

Item Description
Remote server: Hostname The IP address or the machine name. Do not include the protocol (http://).
Remote server: Path The location of the folder that was created when the war file was read.
Default: match-rest-api-{VersionNumber}
Remote server: Port The port number used by the service. Default: 8080
4. If you configured the Find duplicates server to use HTTPs, you have to toggle on Remote server: Use https (TLS/SSL) to encrypt the connection.

5. Click Test connection to ensure that the server information has been entered correctly and you can connect. If you receive a licensing error, this means that the server can be found and must now be licensed.