Introduction

During the person registration scenario Connect ID performs a duplicate check based on first name, last name, date of birth and a gender. If potential duplicates are detected, system tries to contact all MAs "owning" potential duplicates to get personal details. In most implementations detailed personal information coming from other MAs is displayed to the human operator so that he/she can determine whether it's a duplicate or not.

Some MAs implementing Connect ID have chosen to also display to the user information about a duplicate scoring (similarity between original records and a potential duplicate found) which comes not from other MAs but directly from Connect ID:

  • score
  • queried hash
  • queried date of birth
  • matched hash
  • matched date of birth

This article explains how to interpret values of these fields (e.g. "Exact" or "SwapNames") should you need to show them to the end user. An example of what can be shown to the user you can find in this article.


First chapter discusses the details of duplicate detection and scoring algorithm, while the next one gives the list of all potential values of the fields above with explanation. If you don't need to understand how the algorithm works, jump to chapter "All variant types explained".


Duplicate detection and scoring algorithm

To better understand the meaning of different fields, we need to first look at how Connect ID stores data and performs duplicate detection.


Firstly, we will assume that Person A has been already registered in Connect ID. Effectively it means that a number of variants of Person's A name and date of birth have been stored. For example, if we register "John Smith" born on "1990-05-31", the system will store the following variants of the name and date of birth (names are hashed therefore we will sometimes refer to them hashes):


Name
Variant/hash
Explanation
Date of birth
Explanation
John Smith
john smith
Type of hash: "Exact". The name is stored in lower case and this hash will have a score of 1.0.
1990-05-31
Type: "Original". Exact date of birth will be stored along the hash, with a score of 1.0. The total score for this variant will be 1.0 x 1.0 = 1.0.

as above
as above
1990-05-30
Type: "OneDayBefore". Date different by one day will be stored along the hash, with a score of 0.6. The total score of this variant will be 1.0 x 0.6 = 0.6.

as above
as above
1990-06-01
Type: "OneDayAfter". Date different by one day will be stored along the hash, with a score of 0.6. The total score of this variant will be 1.0 x 0.6 = 0.6.

smith john
Type of hash: "SwapNames". Algorithm will store a hash for swapped names and give it a score of 0.90.
1990-05-31
Type: "Original". Exact date of birth will be stored along the hash, with a score of 1.0. The total score for this variant will be 0.9 x 1.0 = 0.9.

as above
as above
1990-06-01
Type: "OneDayAfter". Date different by one day will be stored along the hash, with a score of 0.6. The total score of this variant will be 0.9 x 0.6 = 0.54.

(...)
(...)
(...)
(...)


Many more variants (up to 1000) are stored in Connect ID in order to detect duplicates better. The ones given above will be enough, however, to explain the way duplicates are detected and score calculated.


Now let's consider Person B that we want to register. As before, system will create a number of variants for Person B and then try to register it. The algorithm will try to match any variant of Person B with any of the existing variants. If a duplicate is found, Connect ID will return the following information:

  • queried hash - this is type of this hash of Person B, for which a duplicate has been found
  • queried date of birth - this is type of date of birth variant of Person B, for which a duplicate has been found
  • matched hash - this is type of the hash of a duplicate found which matched with Person's B hash
  • matched date of birth - this is type of date of birth variant of a duplicate found which matched with Person's B date of birth
  • score - this is a multiplication of partial scores for all of the above (value between 0 and 1).


Let's see a couple of examples to understand this better. We will continue using the example of Person B and their hashes above. We will consider a couple of different Persons B which will all match with Person A one way or another.


Person B example
Queried hash
Queried date of birth
Matched hash
Matched date of birth
Score
John Smith, 1990-05-31
"Exact", because it was the exact hash ("john smith") for which a duplicate has been found.
"Original", because it was the exact date of birth for which a duplicate has been found.
"Exact", because it was the exact hash of Person A that was found.
"Original", because it was the exact date of birth of Person A that was found.
1.0 x 1.0 x 1.0 x 1.0 = 1.0
John Smith, 1990-06-01
as above
"Original", because a duplicate has been found for date "1990-06-01".
as above
"OneDayAfter", because when the system searched for "1990-06-01", it found a variant of Person's A date of birth - date moved by one day.
1.0 x 1.0 x 1.0 x 0.6 = 0.6
John Smith, 1990-06-02
as above
"OneDayBefore", because a duplicate has been found when the system searched for a  non-exact variant "1990-06-01".
as above
"OneDayAfter", because when the system searched for "1990-06-01", it found a variant of Person's A date of birth - date moved by one day.
1.0 x 0.6 x 1.0 x 0.6 = 0.36
Smith John, 1990-06-02
"Exact", because it was the exact hash ("smith john") for which a duplicate has been found.
as above
"SwapNames", because it was a non-exact swap-name-hash of Person A, that was matched with "smith john".
as above
1.0 x 0.6 x 0.9 x 0.6 = 0.324


All variant types explained

The table below shows all potential values of queriedHash, matchedHash, queriedDateOfBirth and matchedDateOfBirth with description.


Transformation code name
Description
Example
Exact
Two of the transformations are mandatory. They are always applied to the input data and cannot be turned off with “-D” option. These are:
  • LowerCase
  • Trimming
This approach allows to give the maximum score (i.e. 1) for names with most basic mistakes – case or adding a white space.
„John   SMIth” -> „john smith”
„ joHn smiTH „ -> „john smith”
Original
This value means that original value of the date of birth has been used for duplicate matching in either input (queriedDateOfBirth) or in the record found (matchedDateOfBirth). If both are "Original", then date of birth has been matched exactly. Most common case is when 1 value is "Original" and the other takes value of any of the date transformations explained below, e.g. "IncorrectDay". Then it is recommended that description from the respective date transformation is used.

CharFolding
Replaces certain letters with the most common alternative but without diacritics.
Müller --> Mueller
FirstNameVariants
Produces a number of variants of the first name based on relations between first names.
 
Botros [ARA], Peter [ENG], Peter [GER], Pierre [FRA], etc.
NormalizeChars
Removes diacritics.
Dzierżawski -> Dzierzawski
SelfLearning
Compares the name to the existing database. Produces variants of the name that already exist in the DB which are “close” to the original in terms of Levenshtein distance. The more frequent a certain variant found in the self-learning DB is, the higher the score.
Algorithm performs the same operation for both first name and last name separately.
This transformation will account for simple spelling mistakes.
A customized version of Levenshtein distance is used that takes into consideration characteristics of the input (e.g. length of the name).
Tohmas -> Thomas
Mart -> Marta
Tokenize
Some but not all parts of multiple names are matching.

This only applies for multiple names. Transformation allows to match name which is a subset of another name. 
"John Smith-Gunderson" --> "John Smith", "John Gunderson"
SwapBirthDate
Produces a variant of writing birth date to account for swapping days with months.
12/04/1978 -> 04/12/1978
IncorrectDay
Produces a variant of date of birth which is matched if year & month are the same, but day is different.
12/04/1978 --> xx/04/1978 (all days will match)
IncorrectYear
Produces a variant of date of birth which is matched if day & month are the same, but year is different.
12/04/1978 --> 12/04/xx (all years will match)
IncorrectMonth
Produces a variant of date of birth which is matched if year & day are the same, but month is different.
12/04/1978 --> 12/xx/1978 (all months will match)
OneDayAfter
Produces a variant of date of birth which is 1 day after the correct date.
2001-05-31 --> 2001-06-01
OneDayBefore
Produces a variant of date of birth which is 1 day before the correct date.
2001-05-31 --> 2001-05-30
SwapNames
Produces a variant with first and last names swapped to account for such mistakes.
John Smith -> Smith John
Translate
Produces a variant of the name with the first name translated into English by Google Translate and by MS Cognitive Services (formerly Bing Translate).

Translation proves to work better than Transliteration in some cases (Hebrew).
Mateusz -> Matthew
אביחי -> Avihai
Transliterate
Produces a transliterated version of the first and last name.
Андрей Печонкин -> Andrey Pechonkin



More advanced topics

The fact that non-exact variants of Person B can be matched with non-exact variants of Person A has an interesting side effect:

  • by design Connect ID is able to match dates if they are 1 day apart (because we have OneDayAfter and OneDayBefore variants). In practices, it can match dates which are 2 days apart (because non-exact OneDayAfter will be matched with non-exact OneDayBefore.
  • by design Connect ID is able to match names which are 1 letter apart (e.g. "Smith" will be matched with "Smit", because they share a variant "Smit"). In practice, it can sometimes match names which are 2 letters apart ("Smith" and "Amit" will be matched as well because they share a non-exact variant "Smit").

Such matches are not very precise and they will have a low score because the partial scores will be multiplied. It may help, though, to understand why certain records pop up as potential duplicates.