The tool described in this article can be used by a Member Association to verify their current level of data duplication. "Standardization Prototype"(named so for historical reasons) uses exactly the same algorithms of duplicate detection as Connect ID Service. It takes a file as an input, processes it, and produces a text output specifying all the potential duplicates with scoring. Note that the tool works locally only - it will not allow to find international duplicates - only local ones.
Tool may be treated as a preliminary data quality assessment. The next step - when the percentage of local duplicates is acceptable - will be to load the data to Pre-production environment and assess how many international duplicates there are.
The following is the list of main features of the Standardization Prototype/Console:
- tool finds duplicated data
- tool has a scoring mechanism assigning a higher score to records which are closer (i.e. more similar) than others
- tool runs in memory (does not need an Internet access)
- tool can read data from the Excel file of the following format
- First Name
- Last Name
- Birth Date
- tool produces (to both console and output file) an execution report
- tool hashes input data using SHA-512 algorithm
INSTALLATION AND RUNNING
Since the console spawns its child ElasticSearch server, you need to have:
1. Java 8 or higher (tested with Java 11)
2. Elasticsearch 6.8.7 (https://www.elastic.co/downloads/past-releases/elasticsearch-6-8-7 )
3. Elasticsearch ICU Plugin (https://artifacts.elastic.co/downloads/elasticsearch-plugins/analysis-icu/analysis-icu-6.8.7.zip)
The console searches for a Java installation by checking for the existence of 2 things:
- Presence of JAVA_HOME environmental variable, which should point to your Java installation (e.g. C:\jdk-11.0.5)
- PATH environmental variable being present (containing an entry to JRE's bin directory, e.g. C:\jdk-11.0.5\bin\)
If any of these is not present, then the console tries locate Java in the console's directory as follows: given that the console is located in C:\Fifa.ConnectId.Console\, then it assumes that JRE (Java Runtime Environment) is located in C:\Fifa.ConnectId.Console\jre (so no JRE version numbers allowed).
The Elasticsearch has to be downloaded into the console's directory in order to work:
C:\Fifa.ConnectId.Console\elasticsearch\ (no version numbers allowed!, see the screenshot below for reference).
The last thing is to unzip and copy the ICU plugin to Elasticsearch'es plugin directory as follows:
The current version of the tool is delivered in the form of zip-packed set of files containing an executable file <Fifa.ConnectId.Console.exe >.
The following steps are needed to install and use the tool:
- download the current version of the zip-packed file, e.g. <Fifa.ConnectId.Console-1.0.1244.zip > to disk
- unpack the file
RUNNING THE TOOL
The tool should can be run from the console by executing file <Fifa.ConnectId.Console.exe> with a number of parameters. If the program is run without parameters, a parameter guide will be printed on the console, providing details on the parameters usage.
In order to run the program and see results, at least one parameter should be provided – Excel file with input data.
- run the following command using command line:
Fifa.ConnectId.Console.exe –n -f "<input_file_name>"
- this will execute the program in memory and the inbuilt database with first name variants will be used
Note that each run of the tool is a clean run, i.e. DB is always empty at the beginning of the process. That includes both the database with the data and self-learning DB. Unless “-l” option is used, the input file is treated both as a source of records to compare and a feed to self-learning DB. Effectively it means that the program searches for duplicates within the input file.
During the execution, program produces logs onto the console. Console logs are mostly used to show progress to the user. Full analysis of the results should be done based on the output file produced. Output file name has format “output_" + <date> + "_" + <time> and the file contains the following information:
- summary of options the program has been run with
- list of all transformations used by the algorithm with default scores
- execution time of certain steps (with total)
- number of duplicates found
- list of all duplicates found
total execution time does not equal the sum of the steps displayed. This is expected as it is only the most important steps that are listed with their execution times - not all of them.
the program is able to import date in any format as long as the column is formatted as "Date" in Excel. If the column is not a date, program will accept the following date formats:
If the program does not run (or starts and produces an error message), make sure that:
- Microsoft Visual C++ 2010 x64 Redistributable library is installed (vcredist_x64.exe) from https://www.microsoft.com/en-us/download/details.aspx?id=14632
In case of an error please locate the following log files and send them to Making Waves:
It is also possible that you see another console window blink and disappear and the main console program will end with an error (e.g. timeout).
The possible reason is the lack of JAVA_HOME environmental variable (See Prerequisites section above). Set it and check if the program runs.
If you tried everything, do the following:
- find Fifa.ConnectId.Console.exe.config file
- change <add key="elasticInNewWindow" value="true" /> to <add key="elasticInNewWindow" value="false" />
- start the program again and you will see an error message in the main console window. Send it to email@example.com and we will help you
ADVANCED TESTING AND RESULTS ANALYSIS (SECTION FOR DEVELOPERS)
There is a number of additional features which can be used for testing and troubleshooting purposes:
- tool can learn from one file (self-learning DB) but find duplicates in another
- tool can be configured to ignore birth dates
- tool can be taught that certain first names or last names are distinct despite looking or sounding similar
- tool produces a csv file with the list of all duplicates found
FILE WITH DUPLICATES
In order to speed up the analysis of the results returned by the tool, an additional file is produced as an output. It contains the list of all duplicates found. Having a csv format, the file is easy to understand and can be further processed in Excel.
The most interesting feature of this file, however, is that it is in the format that can be understood by the tool. An example usage scenario is the following:
- Full deduplication is run on the file containing e.g. 200 000 records
- An output file is produced containing 1000 records regarded as potential duplicates
- If the number of duplicates is considered to be too high (too many false positives), additional testing can be done using the output file only.
- Deduplication with some of the transformations turned off is run on the output file (with 1000 records)
- Observations can be made if less duplicates are found
It might be expected that if the full deduplication process is run once on the input file with 200 000 records and then run again on the output file with 1000 potential duplicates, the result will be exactly the same (same algorithm gives us the same 1000 potential duplicates). That is not always the case, however, as the tool uses self-learning algorithms. With an input of 200 000 records, the tool can learn something that it won’t learn with the input as small as 1000 records.
In order to reduce the number of false positives, a feature was introduced that allows to tell the tool that certain pairs of first or last names, although similar, are in fact distinct. If such a configuration is applied, the names that could normally be identified as duplicates with a certain score, will be regarded as different ones. One example could be “Mirek” and “Marek”. These are two distinct names in Polish, however a number of transformations would consider them to be duplicate candidates, for example fuzzy search, as levenshtein distance between them equals 1. Double metaphone transformation would also regard them as duplicates as their pronunciation is similar.
For the sake of the prototype only, exclusions configuration can be added not as a separate file, but as an additional tab in the input file. The Excel file that needs to be processed would then contain the following tabs:
- [mandatory] tab with input data
- [optional] tab containing pairs of distinct first names
- [optional] tab containing pairs of distinct last names
For the exact format of the file, see an example file: “example_DistinctNames.xlsx”.
Tool introduces the scoring mechanism. The list of potential duplicates is sorted by the value between 0 and 1. The higher the value, the most probable the record is to be a duplicate. For example, if a potential duplicate was found based on the exact match transformation, it will have the score of 1 – which should be understood as “it is a 100% duplicate”.
The tool can be configured to ignore duplicate candidates below a certain score. Example usage:
Fifa.ConnectId.Console.exe –t 1.0 -f "<input_file_name>"
If the program is run with the options above it will only find exact matching records.
IGNORE BIRTH DATES (-B)
This option should only be used for testing purposes. It makes the tool ignore the column with birth dates completely – that allows the tester to focus on deduplication for first and last names. The rationale behind it, is that in most cases birth date is a very good differentiator, i.e. if the birth date column is kept, a very low number of duplicates is identified. Ignoring it will increase the number of duplicates.
Please note, that using (“-b”) option is equivalent to making all the values in the birth date column identical.
[SELF] LEARNING FROM A DIFFERENT FILE (-L)
Input file takes a double role in processing:
- it serves as set of records to find duplicates in
- but it is also used for teaching the self-learning DB
In certain cases a tester might want to control how teaching the self-learning DB influences the results of processing a specific input file. Option (“-l”) will come handy in such a situation. Example usage:
Fifa.ConnectId.Console.exe –l <self_learning_file> –f <input_file>
In order for the duplicate detection algorithm to be anything beyond simple exact matching of first name, last name and birth date, the tool performs a series of transformations on the input data, generating hashes after each combination of transformations. As a result, multiple hashes are stored for each record, along with an information, which transformations led to their generation. Different combinations of transformations generate a different score.
For the testing purposes it is possible to turn off certain transformations with (“-d”) option. Example:
Fifa.ConnectId.StandardizationPrototype.Console.exe –d Translate SelfLearning -f "<input_file_name>"
This will disable 2 transformations (details in the table below).
The table below shows all the transformations that are currently supported along with examples.
|Transformation code name||Description||Example|
|none||Two of the transformations are mandatory. They are always applied to the input data and cannot be turned off with “-D” option. These are:||„John SMIth” -> „john smith”|
„ joHn smiTH „ -> „john smith”
Produces version of first and last name that is lower case. Accounts for all lower/upper case mistakes in typing.
|JOHN SmiTh -> john smith|
|“ John Smith “ -> “John Smith”|
|DoubleMetaphone||Produces representation of the name which is based on the pronunciation rather than spelling. Similarly pronounced names will have a similar representation.||Steven Merton -> STFN MRTN|
Stephen Martin -> STFN MRTN
|FirstNameVariants||Produces a number of variants of the first name based on relations between first names.||Botros à Peter [ENG], Peter [GER], Pierre [FRA], etc.|
|NormalizeChars||Removes diacritics.||Dzierżawski -> Dzierzawski|
|SelfLearning||Compares the name to the existing database. Produces variants of the name that already exist in the DB which are “close” to the original in terms of Levenshtein distance. The more frequent a certain variant found in the self-learning DB is, the higher the score.|
Algorithm performs the same operation for both first name and last name separately.
This transformation will account for simple spelling mistakes.
A customized version of Levenshtein distance is used that takes into consideration characteristics of the input (e.g. length of the name).
|Tohmas -> Thomas|
Mart -> Marta
|SwapBirthDate||Produces a variant of writing birth date to account for swapping days with months.||12/04/1978 -> 04/12/1978|
|SwapNames||Produces a variant with first and last names swapped to account for such mistakes.||John Smith -> Smith John|
|Translate||Produces a variant of the name with the first name translated into English.|
The final implementation of this method may be different – a number of approaches needs to be tested in order to get meaningful results.
E.g. Translation proves to work better than Transliteration in some cases (Hebrew).
|Mateusz -> Matthew|
אביחי -> Avihai
|Transliterate||Produces a transliterated version of the first and last name.||Андрей Печонкин -> Andrey Pechonkin|
EXPERIMENTING WITH SCORING ALGORITHM
For the testing purposes only, it is possible to modify scoring values for different transformations. The configuration file can be located in the application folder:
File can be opened in a text editor and scoring is to be found in <appSettings> node. For example:
<add key="TransformationScore.CharFoldingTransformation" value="0.92" />
For the troubleshooting purposes, scoring values are also displayed in the output file.
An experimental flag has been introduced in order to limit memory consumption. When “-o” is specified (turned off by default), the data will not be stored in a hashed format but in open text. That allows to decrease memory consumption by around 40%. This flag should only be used for testing purposes.