The tool described in this article can be used by a Member Association to verify their current level of data duplication. "Standardization Prototype"(named so for historical reasons) uses exactly the same algorithms of duplicate detection as Connect ID Service. It takes a file as an input, processes it, and produces a text output specifying all the potential duplicates with scoring. Note that the tool works locally only - it will not allow to find international duplicates - only local ones.

Tool may be treated as a preliminary data quality assessment. The next step - when the percentage of local duplicates is acceptable - will be to load the data to Pre-production environment and assess how many international duplicates there are.


FEATURES

The following is the list of main features of the Standardization Prototype/Console:

  • tool finds duplicated data
  • tool has a scoring mechanism assigning a higher score to records which are closer (i.e. more similar) than others
  • tool runs in memory (does not need an Internet access)
  • tool can read data from the Excel file of the following format
    • ID
    • First Name
    • Last Name
    • Birth Date
  • tool produces (to both console and output file) an execution report
  • tool hashes input data using SHA-512 algorithm

INSTALLATION AND RUNNING

PREREQUISITES

Since the console spawns its child ElasticSearch server, you need to have:

1. Java 8 or higher (tested with Java 11)

2. Elasticsearch 6.8.7 (https://www.elastic.co/downloads/past-releases/elasticsearch-6-8-7 )

3. Elasticsearch ICU Plugin (https://artifacts.elastic.co/downloads/elasticsearch-plugins/analysis-icu/analysis-icu-6.8.7.zip)


The console searches for a Java installation by checking for the existence of 2 things:

- Presence of JAVA_HOME environmental variable, which should point to your Java installation (e.g. C:\jdk-11.0.5)

- PATH environmental variable being present (containing an entry to JRE's bin directory, e.g. C:\jdk-11.0.5\bin\)


If any of these is not present, then the console tries locate Java in the console's directory as follows: given that the console is located in C:\Fifa.ConnectId.Console\, then it assumes that JRE (Java Runtime Environment) is located in C:\Fifa.ConnectId.Console\jre (so no JRE version numbers allowed).


The Elasticsearch has to be downloaded into the console's directory in order to work:

C:\Fifa.ConnectId.Console\elasticsearch\ (no version numbers allowed!, see the screenshot below for reference).


The last thing is to unzip and copy the ICU plugin to Elasticsearch'es plugin directory as follows:


INSTALLATION PROCESS

The current version of the tool is delivered in the form of zip-packed set of files containing an executable file <Fifa.ConnectId.Console.exe >.

The following steps are needed to install and use the tool:

  • download the current version of the zip-packed file, e.g. <Fifa.ConnectId.Console-1.0.1244.zip > to disk
  • unpack the file

RUNNING THE TOOL

The tool should can be run from the console by executing file <Fifa.ConnectId.Console.exe> with a number of parameters. If the program is run without parameters, a parameter guide will be printed on the console, providing details on the parameters usage.

In order to run the program and see results, at least one parameter should be provided – Excel file with input data.

Example:

  • run the following command using command line:

Fifa.ConnectId.Console.exe –n -f "<input_file_name>"

  • this will execute the program in memory and the inbuilt database with first name variants will be used


EXECUTION TIME

Note that each run of the tool is a clean run, i.e. DB is always empty at the beginning of the process. That includes both the database with the data and self-learning DB. Unless “-l” option is used, the input file is treated both as a source of records to compare and a feed to self-learning DB. Effectively it means that the program searches for duplicates within the input file.

During the execution, program produces logs onto the console. Console logs are mostly used to show progress to the user. Full analysis of the results should be done based on the output file produced. Output file name has format “output_" + <date> + "_" + <time> and the file contains the following information:

  • summary of options the program has been run with
  • list of all transformations used by the algorithm with default scores
  • execution time of certain steps (with total)
  • number of duplicates found
  • list of all duplicates found

REMARKS

total execution time does not equal the sum of the steps displayed. This is expected as it is only the most important steps that are listed with their execution times - not all of them.

the program is able to import date in any format as long as the column is formatted as "Date" in Excel. If the column is not a date, program will accept the following date formats:

  • "dd-MM-yyyy"
  • "dd.MM.yyyy"
  • "yyyy-MM-dd"
  • "yyyy.MM.dd"

TROUBLESHOOTING

If the program does not run (or starts and produces an error message), make sure that:

In case of an error please locate the following log files and send them to Making Waves:

  • /Logs/Fifa.ConnectId.Console.log
  • /elasticsearch/logs/elasticsearch.log


It is also possible that you see another console window blink and disappear and the main console program will end with an error (e.g. timeout).

The possible reason is the lack of JAVA_HOME environmental variable (See Prerequisites section above). Set it and check if the program runs.


If you tried everything, do the following:

  • find Fifa.ConnectId.Console.exe.config file
  • change <add key="elasticInNewWindow" value="true" /> to <add key="elasticInNewWindow" value="false" />
  • start the program again and you will see an error message in the main console window. Send it to support.id@fifa.org and we will help you


ADVANCED TESTING AND RESULTS ANALYSIS (SECTION FOR DEVELOPERS)

There is a number of additional features which can be used for testing and troubleshooting purposes:

  • tool can learn from one file (self-learning DB) but find duplicates in another
  • tool can be configured to ignore birth dates
  • tool can be taught that certain first names or last names are distinct despite looking or sounding similar
  • tool produces a csv file with the list of all duplicates found

FILE WITH DUPLICATES

In order to speed up the analysis of the results returned by the tool, an additional file is produced as an output. It contains the list of all duplicates found. Having a csv format, the file is easy to understand and can be further processed in Excel.

The most interesting feature of this file, however, is that it is in the format that can be understood by the tool. An example usage scenario is the following:

  • Full deduplication is run on the file containing e.g. 200 000 records
  • An output file is produced containing 1000 records regarded as potential duplicates
  • If the number of duplicates is considered to be too high (too many false positives), additional testing can be done using the output file only.
  • Deduplication with some of the transformations turned off is run on the output file (with 1000 records)
  • Observations can be made if less duplicates are found

It might be expected that if the full deduplication process is run once on the input file with 200 000 records and then run again on the output file with 1000 potential duplicates, the result will be exactly the same (same algorithm gives us the same 1000 potential duplicates). That is not always the case, however, as the tool uses self-learning algorithms. With an input of 200 000 records, the tool can learn something that it won’t learn with the input as small as 1000 records.

EXCLUSIONS FEATURE

In order to reduce the number of false positives, a feature was introduced that allows to tell the tool that certain pairs of first or last names, although similar, are in fact distinct. If such a configuration is applied, the names that could normally be identified as duplicates with a certain score, will be regarded as different ones. One example could be “Mirek” and “Marek”. These are two distinct names in Polish, however a number of transformations would consider them to be duplicate candidates, for example fuzzy search, as levenshtein distance between them equals 1. Double metaphone transformation would also regard them as duplicates as their pronunciation is similar.

For the sake of the prototype only, exclusions configuration can be added not as a separate file, but as an additional tab in the input file. The Excel file that needs to be processed would then contain the following tabs:

  • [mandatory] tab with input data
  • [optional] tab containing pairs of distinct first names
  • [optional] tab containing pairs of distinct last names

For the exact format of the file, see an example file: “example_DistinctNames.xlsx”.

SCORING

Tool introduces the scoring mechanism. The list of potential duplicates is sorted by the value between 0 and 1. The higher the value, the most probable the record is to be a duplicate. For example, if a potential duplicate was found based on the exact match transformation, it will have the score of 1 – which should be understood as “it is a 100% duplicate”.

The tool can be configured to ignore duplicate candidates below a certain score. Example usage:


Fifa.ConnectId.Console.exe –t 1.0 -f "<input_file_name>"


If the program is run with the options above it will only find exact matching records.

IGNORE BIRTH DATES (-B)

This option should only be used for testing purposes. It makes the tool ignore the column with birth dates completely – that allows the tester to focus on deduplication for first and last names. The rationale behind it, is that in most cases birth date is a very good differentiator, i.e. if the birth date column is kept, a very low number of duplicates is identified. Ignoring it will increase the number of duplicates.

Please note, that using (“-b”) option is equivalent to making all the values in the birth date column identical.

[SELF] LEARNING FROM A DIFFERENT FILE (-L)

Input file takes a double role in processing:

  • it serves as set of records to find duplicates in
  • but it is also used for teaching the self-learning DB

In certain cases a tester might want to control how teaching the self-learning DB influences the results of processing a specific input file. Option (“-l”) will come handy in such a situation. Example usage:

Fifa.ConnectId.Console.exe –l <self_learning_file> –f <input_file>


TRANSFORMATIONS (-D)

In order for the duplicate detection algorithm to be anything beyond simple exact matching of first name, last name and birth date, the tool performs a series of transformations on the input data, generating hashes after each combination of transformations. As a result, multiple hashes are stored for each record, along with an information, which transformations led to their generation. Different combinations of transformations generate a different score.

For the testing purposes it is possible to turn off certain transformations with (“-d”) option. Example:


Fifa.ConnectId.StandardizationPrototype.Console.exe –d Translate SelfLearning -f "<input_file_name>"


This will disable 2 transformations (details in the table below).

The table below shows all the transformations that are currently supported along with examples.

Transformation code nameDescriptionExample
noneTwo of the transformations are mandatory. They are always applied to the input data and cannot be turned off with “-D” option. These are:
  • LowerCase
  • Trimming
This approach allows to give the maximum score (i.e. 1) for names with most basic mistakes – case or adding a white space.
„John   SMIth” -> „john smith”
„ joHn smiTH  „ -> „john smith”
LowerCase[mandatory transformation]
Produces version of first and last name that is lower case. Accounts for all lower/upper case mistakes in typing.
JOHN SmiTh -> john smith
Trimming[mandatory transformation]
Removes whitespaces.
“  John   Smith  “ -> “John Smith”
CharFoldingtbd
DoubleMetaphoneProduces representation of the name which is based on the pronunciation rather than spelling. Similarly pronounced names will have a similar representation.Steven Merton -> STFN MRTN
Stephen Martin -> STFN MRTN
FirstNameVariantsProduces a number of variants of the first name based on relations between first names.
 
Botros à Peter [ENG], Peter [GER], Pierre [FRA], etc.
NormalizeCharsRemoves diacritics.Dzierżawski -> Dzierzawski
SelfLearningCompares the name to the existing database. Produces variants of the name that already exist in the DB which are “close” to the original in terms of Levenshtein distance. The more frequent a certain variant found in the self-learning DB is, the higher the score.
Algorithm performs the same operation for both first name and last name separately.
This transformation will account for simple spelling mistakes.
A customized version of Levenshtein distance is used that takes into consideration characteristics of the input (e.g. length of the name).
Tohmas -> Thomas
Mart -> Marta
SwapBirthDateProduces a variant of writing birth date to account for swapping days with months.12/04/1978 -> 04/12/1978
SwapNamesProduces a variant with first and last names swapped to account for such mistakes.John Smith -> Smith John
TranslateProduces a variant of the name with the first name translated into English.
The final implementation of this method may be different – a number of approaches needs to be tested in order to get meaningful results.
E.g. Translation proves to work better than Transliteration in some cases (Hebrew).
Mateusz -> Matthew
אביחי -> Avihai
TransliterateProduces a transliterated version of the first and last name.Андрей Печонкин -> Andrey Pechonkin

EXPERIMENTING WITH SCORING ALGORITHM

For the testing purposes only, it is possible to modify scoring values for different transformations. The configuration file can be located in the application folder:

Fifa.ConnectId.Console.exe.config

File can be opened in a text editor and scoring is to be found in <appSettings> node. For example:


<add key="TransformationScore.CharFoldingTransformation" value="0.92" />


For the troubleshooting purposes, scoring values are also displayed in the output file.

MEMORY CONSUMPTION

An experimental flag has been introduced in order to limit memory consumption. When “-o” is specified (turned off by default), the data will not be stored in a hashed format but in open text. That allows to decrease memory consumption by around 40%. This flag should only be used for testing purposes.