We used the following approach to enable this:
- We imported all the required data elements and cleansed them to a standardized format by getting the right Lat-Long through web extraction, stop word removal, abbreviation replacements etc.
- Following this, we used statistical fuzzy matching techniques like Jaccard similarity, Jaro Winker, phonetical match, distance match, etc. to determine records which were similar.
- We then applied business rules specific to the client’s needs to ensure that certain records were not falsely flagged as duplicates.
- The solution enabled the client to integrate multiple data sources into SFDC after removing duplicates to avoid redundancy of information.
- The entire algorithm has also been automated to run without any manual intervention and generate output files every time there is an update on the input files.