Introduction

During the person registration scenario Connect ID performs a duplicate check based on first name, last name, date of birth and a gender. If potential duplicates are detected, system tries to contact all MAs "owning" potential duplicates to get personal details. In most implementations detailed personal information coming from other MAs is displayed to the human operator so that he/she can determine whether it's a duplicate or not.

Some MAs implementing Connect ID have chosen to also display to the user information about a duplicate scoring (similarity between original records and a potential duplicate found) which comes not from other MAs but directly from Connect ID:

score
queried hash
queried date of birth
matched hash
matched date of birth

This article explains how to interpret values of these fields (e.g. "Exact" or "SwapNames") should you need to show them to the end user. An example of what can be shown to the user you can find in this article.

First chapter discusses the details of duplicate detection and scoring algorithm, while the next one gives the list of all potential values of the fields above with explanation. If you don't need to understand how the algorithm works, jump to chapter "All variant types explained".

Duplicate detection and scoring algorithm

To better understand the meaning of different fields, we need to first look at how Connect ID stores data and performs duplicate detection.

Firstly, we will assume that Person A has been already registered in Connect ID. Effectively it means that a number of variants of Person's A name and date of birth have been stored. For example, if we register "John Smith" born on "1990-05-31", the system will store the following variants of the name and date of birth (names are hashed therefore we will sometimes refer to them hashes):

Name	Variant/hash	Explanation	Date of birth	Explanation
John Smith	john smith	Type of hash: "Exact". The name is stored in lower case and this hash will have a score of 1.0.	1990-05-31	Type: "Original". Exact date of birth will be stored along the hash, with a score of 1.0. The total score for this variant will be 1.0 x 1.0 = 1.0.
	as above	as above	1990-05-30	Type: "OneDayBefore". Date different by one day will be stored along the hash, with a score of 0.6. The total score of this variant will be 1.0 x 0.6 = 0.6.
	as above	as above	1990-06-01	Type: "OneDayAfter". Date different by one day will be stored along the hash, with a score of 0.6. The total score of this variant will be 1.0 x 0.6 = 0.6.
	smith john	Type of hash: "SwapNames". Algorithm will store a hash for swapped names and give it a score of 0.90.	1990-05-31	Type: "Original". Exact date of birth will be stored along the hash, with a score of 1.0. The total score for this variant will be 0.9 x 1.0 = 0.9.
	as above	as above	1990-06-01	Type: "OneDayAfter". Date different by one day will be stored along the hash, with a score of 0.6. The total score of this variant will be 0.9 x 0.6 = 0.54.
	(...)	(...)	(...)	(...)

Many more variants (up to 1000) are stored in Connect ID in order to detect duplicates better. The ones given above will be enough, however, to explain the way duplicates are detected and score calculated.

Now let's consider Person B that we want to register. As before, system will create a number of variants for Person B and then try to register it. The algorithm will try to match any variant of Person B with any of the existing variants. If a duplicate is found, Connect ID will return the following information:

queried hash - this is type of this hash of Person B, for which a duplicate has been found
queried date of birth - this is type of date of birth variant of Person B, for which a duplicate has been found
matched hash - this is type of the hash of a duplicate found which matched with Person's B hash
matched date of birth - this is type of date of birth variant of a duplicate found which matched with Person's B date of birth
score - this is a multiplication of partial scores for all of the above (value between 0 and 1).

Let's see a couple of examples to understand this better. We will continue using the example of Person B and their hashes above. We will consider a couple of different Persons B which will all match with Person A one way or another.

Person B example	Queried hash	Queried date of birth	Matched hash	Matched date of birth	Score
John Smith, 1990-05-31	"Exact", because it was the exact hash ("john smith") for which a duplicate has been found.	"Original", because it was the exact date of birth for which a duplicate has been found.	"Exact", because it was the exact hash of Person A that was found.	"Original", because it was the exact date of birth of Person A that was found.	1.0 x 1.0 x 1.0 x 1.0 = 1.0
John Smith, 1990-06-01	as above	"Original", because a duplicate has been found for date "1990-06-01".	as above	"OneDayAfter", because when the system searched for "1990-06-01", it found a variant of Person's A date of birth - date moved by one day.	1.0 x 1.0 x 1.0 x 0.6 = 0.6
John Smith, 1990-06-02	as above	"OneDayBefore", because a duplicate has been found when the system searched for a non-exact variant "1990-06-01".	as above	"OneDayAfter", because when the system searched for "1990-06-01", it found a variant of Person's A date of birth - date moved by one day.	1.0 x 0.6 x 1.0 x 0.6 = 0.36
Smith John, 1990-06-02	"Exact", because it was the exact hash ("smith john") for which a duplicate has been found.	as above	"SwapNames", because it was a non-exact swap-name-hash of Person A, that was matched with "smith john".	as above	1.0 x 0.6 x 0.9 x 0.6 = 0.324

All variant types explained

The table showing all potential values of queriedHash, matchedHash, queriedDateOfBirth and matchedDateOfBirth can be found in this article.

Gender

If a gender value is provided, the system will only return matches for the same value of gender (or those with an empty gender). If the value is not provided all matches will be returned (as determined by other fields described above).

idservice

How can we help you today?

How to interpret scoring information sent from Connect ID when duplicates are detected Print

Introduction

Duplicate detection and scoring algorithm

All variant types explained

Gender

More advanced topics

How can we help you today?

How to interpret scoring information sent from Connect ID when duplicates are detected Print

Introduction

Duplicate detection and scoring algorithm

All variant types explained

Gender

More advanced topics

Related Articles