During the person registration scenario Connect ID performs a duplicate check based on first name, last name, date of birth and a gender. If potential duplicates are detected, system tries to contact all MAs "owning" potential duplicates to get personal details. In most implementations detailed personal information coming from other MAs is displayed to the human operator so that he/she can determine whether it's a duplicate or not.
Some MAs implementing Connect ID have chosen to also display to the user information about a duplicate scoring (similarity between original records and a potential duplicate found) which comes not from other MAs but directly from Connect ID:
- queried hash
- queried date of birth
- matched hash
- matched date of birth
This article explains how to interpret values of these fields (e.g. "Exact" or "SwapNames") should you need to show them to the end user. An example of what can be shown to the user you can find in this article.
First chapter discusses the details of duplicate detection and scoring algorithm, while the next one gives the list of all potential values of the fields above with explanation. If you don't need to understand how the algorithm works, jump to chapter "All variant types explained".
Duplicate detection and scoring algorithm
To better understand the meaning of different fields, we need to first look at how Connect ID stores data and performs duplicate detection.
Firstly, we will assume that Person A has been already registered in Connect ID. Effectively it means that a number of variants of Person's A name and date of birth have been stored. For example, if we register "John Smith" born on "1990-05-31", the system will store the following variants of the name and date of birth (names are hashed therefore we will sometimes refer to them hashes):
|Name||Variant/hash||Explanation||Date of birth||Explanation|
|John Smith||john smith||Type of hash: "Exact". The name is stored in lower case and this hash will have a score of 1.0.||1990-05-31||Type: "Original". Exact date of birth will be stored along the hash, with a score of 1.0. The total score for this variant will be 1.0 x 1.0 = 1.0.|
|as above||as above||1990-05-30||Type: "OneDayBefore". Date different by one day will be stored along the hash, with a score of 0.6. The total score of this variant will be 1.0 x 0.6 = 0.6.|
|as above||as above||1990-06-01||Type: "OneDayAfter". Date different by one day will be stored along the hash, with a score of 0.6. The total score of this variant will be 1.0 x 0.6 = 0.6.|
|smith john||Type of hash: "SwapNames". Algorithm will store a hash for swapped names and give it a score of 0.90.||1990-05-31||Type: "Original". Exact date of birth will be stored along the hash, with a score of 1.0. The total score for this variant will be 0.9 x 1.0 = 0.9.|
|as above||as above||1990-06-01||Type: "OneDayAfter". Date different by one day will be stored along the hash, with a score of 0.6. The total score of this variant will be 0.9 x 0.6 = 0.54.|
Many more variants (up to 1000) are stored in Connect ID in order to detect duplicates better. The ones given above will be enough, however, to explain the way duplicates are detected and score calculated.
Now let's consider Person B that we want to register. As before, system will create a number of variants for Person B and then try to register it. The algorithm will try to match any variant of Person B with any of the existing variants. If a duplicate is found, Connect ID will return the following information:
- queried hash - this is type of this hash of Person B, for which a duplicate has been found
- queried date of birth - this is type of date of birth variant of Person B, for which a duplicate has been found
- matched hash - this is type of the hash of a duplicate found which matched with Person's B hash
- matched date of birth - this is type of date of birth variant of a duplicate found which matched with Person's B date of birth
- score - this is a multiplication of partial scores for all of the above (value between 0 and 1).
Let's see a couple of examples to understand this better. We will continue using the example of Person B and their hashes above. We will consider a couple of different Persons B which will all match with Person A one way or another.
|Person B example||Queried hash||Queried date of birth||Matched hash||Matched date of birth||Score|
|John Smith, 1990-05-31||"Exact", because it was the exact hash ("john smith") for which a duplicate has been found.||"Original", because it was the exact date of birth for which a duplicate has been found.||"Exact", because it was the exact hash of Person A that was found.||"Original", because it was the exact date of birth of Person A that was found.||1.0 x 1.0 x 1.0 x 1.0 = 1.0|
|John Smith, 1990-06-01||as above||"Original", because a duplicate has been found for date "1990-06-01".||as above||"OneDayAfter", because when the system searched for "1990-06-01", it found a variant of Person's A date of birth - date moved by one day.||1.0 x 1.0 x 1.0 x 0.6 = 0.6|
|John Smith, 1990-06-02||as above||"OneDayBefore", because a duplicate has been found when the system searched for a non-exact variant "1990-06-01".||as above||"OneDayAfter", because when the system searched for "1990-06-01", it found a variant of Person's A date of birth - date moved by one day.||1.0 x 0.6 x 1.0 x 0.6 = 0.36|
|Smith John, 1990-06-02||"Exact", because it was the exact hash ("smith john") for which a duplicate has been found.||as above||"SwapNames", because it was a non-exact swap-name-hash of Person A, that was matched with "smith john".||as above||1.0 x 0.6 x 0.9 x 0.6 = 0.324|
All variant types explained
The table below shows all potential values of queriedHash, matchedHash, queriedDateOfBirth and matchedDateOfBirth with description.
|Transformation code name||Description||Example|
|Exact||Two of the transformations are mandatory. They are always applied to the input data and cannot be turned off with “-D” option. These are:||„John SMIth” -> „john smith”|
„ joHn smiTH „ -> „john smith”
|Original||This value means that original value of the date of birth has been used for duplicate matching in either input (queriedDateOfBirth) or in the record found (matchedDateOfBirth). If both are "Original", then date of birth has been matched exactly. Most common case is when 1 value is "Original" and the other takes value of any of the date transformations explained below, e.g. "IncorrectDay". Then it is recommended that description from the respective date transformation is used.|
|CharFolding||Replaces certain letters with the most common alternative but without diacritics.||Müller --> Mueller|
|FirstNameVariants||Produces a number of variants of the first name based on relations between first names.||Botros [ARA], Peter [ENG], Peter [GER], Pierre [FRA], etc.|
|NormalizeChars||Removes diacritics.||Dzierżawski -> Dzierzawski|
|SelfLearning||Compares the name to the existing database. Produces variants of the name that already exist in the DB which are “close” to the original in terms of Levenshtein distance. The more frequent a certain variant found in the self-learning DB is, the higher the score.|
Algorithm performs the same operation for both first name and last name separately.
This transformation will account for simple spelling mistakes.
A customized version of Levenshtein distance is used that takes into consideration characteristics of the input (e.g. length of the name).
|Tohmas -> Thomas|
Mart -> Marta
|Tokenize||Some but not all parts of multiple names are matching.|
This only applies for multiple names. Transformation allows to match name which is a subset of another name.
|"John Smith-Gunderson" --> "John Smith", "John Gunderson" |
|SwapBirthDate||Produces a variant of writing birth date to account for swapping days with months.||12/04/1978 -> 04/12/1978|
|IncorrectDay||Produces a variant of date of birth which is matched if year & month are the same, but day is different.||12/04/1978 --> xx/04/1978 (all days will match)|
|IncorrectYear||Produces a variant of date of birth which is matched if day & month are the same, but year is different.||12/04/1978 --> 12/04/xx (all years will match)|
|IncorrectMonth||Produces a variant of date of birth which is matched if year & day are the same, but month is different.||12/04/1978 --> 12/xx/1978 (all months will match)|
|OneDayAfter||Produces a variant of date of birth which is 1 day after the correct date.||2001-05-31 --> 2001-06-01|
|OneDayBefore||Produces a variant of date of birth which is 1 day before the correct date.||2001-05-31 --> 2001-05-30|
|SwapNames||Produces a variant with first and last names swapped to account for such mistakes.||John Smith -> Smith John|
|Translate||Produces a variant of the name with the first name translated into English by Google Translate and by MS Cognitive Services (formerly Bing Translate).|
Translation proves to work better than Transliteration in some cases (Hebrew).
|Mateusz -> Matthew|
אביחי -> Avihai
|Transliterate||Produces a transliterated version of the first and last name.||Андрей Печонкин -> Andrey Pechonkin|
More advanced topics
The fact that non-exact variants of Person B can be matched with non-exact variants of Person A has an interesting side effect:
- by design Connect ID is able to match dates if they are 1 day apart (because we have OneDayAfter and OneDayBefore variants). In practices, it can match dates which are 2 days apart (because non-exact OneDayAfter will be matched with non-exact OneDayBefore.
- by design Connect ID is able to match names which are 1 letter apart (e.g. "Smith" will be matched with "Smit", because they share a variant "Smit"). In practice, it can sometimes match names which are 2 letters apart ("Smith" and "Amit" will be matched as well because they share a non-exact variant "Smit").
Such matches are not very precise and they will have a low score because the partial scores will be multiplied. It may help, though, to understand why certain records pop up as potential duplicates.