Idea: Fuzzy matching in De-duplicate?

grahamb · March 19, 2013 9:23AM

Can you add more advanced features to the de-duplicate option please. Right now every field is an exact match but you really need to be able to build up multiple keys using subsets and phonetics or soundex within this function. See a label on Doubletake set up half way down this page... http://www.tech4t.co.uk/pages/t4t-doubletake-2.html

tjbate · March 19, 2013 1:54PM

Graham - Omniscope Merge/Join operations blocks support both Soundex and Levenshtein fuzzy matching. Within the Merge/Join block, for each join criteria (no limit) you can specify two Fields and an Operator, which for fuzzy matching can be either Soundex or Leveshtein (at the bottom of the drop-down list).

You can duplicate both target Fields and entire Merge/Join operations and use either algorithm on a field-by-field basis, directing Matches and Non-matches into different processing flows and, in the case of Levenshtein, even run multiple iterations with different settings for case sensitivity and 'closeness'.

De-duplication is typically exact match, and should used prior to Merge/Joins to ensure at least one side of the merge's target fields contain no exact match duplicates that would proliferate many-to-many match records unnecessarily.

grahamb · March 19, 2013 1:59PM

Wow - didn't see this as I was looking under de-duplication. Will investigate further. Cheers Graham

grahamb · March 19, 2013 2:08PM

Thomas, to use fuzzy logic to de-duplicate a single file using merge/join, do I need to load the same file twice as merge expects two feeds?

tjbate · March 19, 2013 3:57PM

Graham - Multi-field Merge/Joins use fuzzy logic to permit some leniency in certain fields, so that variants in one field do not necessarily defeat an otherwise sound multi-field merge. When it comes to analysing a single field for 'duplicates' I think what you are really concerned with is eliminating 'variants', in other words, two different spellings for the same thing, leading to a duplication in the count...e.g. too many customers on the list because some are the same customers' names spelled two or more different ways.

You cannot do a single field row count reduction for variants using only the Merge/Join block, or currently only the De-Duplicate block either. Omniscope's current approach to this is in the DataExplorer Table View interface, where it is a manual sorting and mass-selection and one-time editing of variants approach to regularise spellings that is really pretty quick in most cases. The corrected output then feeds into a multi-file chain solution.

One of the problems with a purely-automated DataManager 'ETL' approach to identifying/classifying variants, then re-spelling the values and filtering out or aggregating with re-spelled duplicates is how to decide which spelling is the correct, standardised and official spelling. If these can be entered in a DataManager List of Values block, or used as a separate reference or validation IOK file with a list of (correct) values, you can use the Merge/Join block to isolate mismatches for manual editing...but that is not really your use case here.

I have re-classified this as an Idea for further specification/discussion.

steve · March 20, 2013 5:35AM

Adding fuzzy matching to De-Dup is a good idea. But it has a few issues given the two current approaches for fuzzy text matching/searching.

Soundex is a very primitive way of collapsing a single word into a short identifier. It's fairly specialist, and will often result in many false positives. It only considers the first letter and the following 3 consonants - everything else is disregarded. But, accepting these limitations, it is entirely possible to imagine a soundex de-dup option.

Levenshtein distance is much more powerful, but it measures specifically the "distance" or "closeness" between two arbitrary text values, efffectively counting the number of edits required to get from a to b.
The problem with this is, let's say you have values A, B and C. You set your de-dup fuzzy threshold such that A and B are "close" enough, and considered the same. B and C also. But A and C are different. What do you delete?

grahamb · March 20, 2013 6:36AM

Steve, from our 20 plus years experience with de-duplication, purge, suppression and merge/purge, there is no one size fits all answer to the duplicate record identification problem.

If you look at the page on our website - www.tech4t.co.uk/dedupe - this outlines a program that we have found to be one of the most flexible tools to tackle this. You can see the parameters from the screen shots. The trick is to not assume that you can use say Levenshtein on one field by itself, but instead use it as part of a pattern match which can be combined with other fields where these are subjected to different treatments. Multiple patterns then form the overall dedupe key. For instance..

Pattern 1
Full Postcode - Exact match
Title - Treat Ms as Miss
First name - Initials (convert)
Surname - Phonetics
Address 1 - First 10 characters

Pattern 2
Full Postcode - Take Postcode Sector element only
First name - Phonetics
Surname - First 8 letters
Company - use abbreviations table to drop reserved words like 'The', PLC, Ltd
Address 1 - First 10 characters....

Up to 16 patterns can be built into one key and each evaluated separately regarding the accuracy of the match. Once finished and tested, the key can be saved against the data records for future use and the output set to export - unique records in one file and another file of dupes. A dupe group ID is applied to each record so a link file can be created - URN to Dupe Group - and this used to reallocate (re-parent) child transaction records to a new master record.

Additionally, the dedupe process will create a new 'master' record from a set of duplicates to stack the different telephone numbers, emails, etc., that may reside within each duplicate record, creating the master to contain all info so no customer or contact data is lost from the dedupe process.

For merge purge, typically if there is a multi-file input where there is one master suppression, and X files to merge, the output would show by record by file which record matched the record on the master suppression file.

Does that help?

steve · March 20, 2013 7:27AM

This is the kind of level of complexity and specificity where we prefer to develop tools that allow *you* to build your tailored algorithm yourself.

Accordingly, we've made soundex and levenshtein available as formula functions in tonight's 2.8. I don't think you'll be able to use levenshtein easily, but soundex and first-n chars can be implemented using formula fields then de-duping upon those.

http://forums.visokio.com/discussion/2066/soundex-and-levenshtein-formula-functions-2.8-

Also consider writing your own JS operation. See http://forums.visokio.com/discussion/1657/custom-script-operation-2.8

grahamb · March 20, 2013 7:41AM

Thanks Steve. Will give this new version a go when ready.

Welcome!

Categories

Ideas Parade

Tagged