Annals of Master Data
Effective Use of Machine Learning to Aid Data Quality and Governance
Record matching and deduplication in data integration have been long-standing problems in data management and are prevalent concerns in master data solutions.
The ability to have a single view of customers, counterparties, accounts, and products has a deep impact on how transactions are recorded and hence the ability to accurately trace revenue and opportunities.
Improvements over the years in machine-learning algorithms and processing efficiency have made remarkable headway in match-merge, deduplication, and governance related to these integration problems.
The challenges of consolidating master data records can be summarized as data inconsistency, semantic differences, and structural variations.
Broadly, matching issues can be grouped into three categories:
I have tried my hand at applying few ML models to these master data issues, and here I share my learnings and some additional resources for you to consider in your own master data management.
Generally, string-matching algorithms compute the number of operations needed to transform one string into another. The more operations required, the less the similarity between the two strings.
Various string-matching algorithms exist to address inconsistencies in names and addresses. Of these string-distance algorithms, some, such as Levenshtein and Jaro-Winkler, are widely used in fuzzy matching algorithms.
Another approach to string matching is to convert the strings into numerical vectors. A string input is split into a set of tokens, rather than the complete string being treated as a unit. The idea is to find similar tokens in both sets.
The more common tokens, the greater the similarity between the sets. A string can be transformed into token sets by splitting using a delimiter. This way, a sentence can be transformed into tokens of words or n-grams characters.
Typical string-distance algorithms tend to be slow when the record size grows. To compare two strings of n characters can take O(n^2) time. That is, if you have 3 million strings, it can take up to 4.5 trillion compare operations (n(n-1)/2).
String comparison against unique values stored in a Python set data structure can execute this in O(n) time.
The Locally Sensitive Hashing (LSH) method that generates hashes for the tokens works on the premise that similar words are hashed into the same bucket. LSH, coupled with special hashing (MinHash), can be thought of as creating the same hash for similar prefixes and suffixes.
For example, the search for the word “Corporation” can limit the search in words that have with prefix “Cor” and suffix “tion.” Because LSH is an approximation but is very fast, as you only have to check the words inside the bucket or buckets, in general this is very close to O(1).
String-matching algorithm results generally lie in the spectrum as shown:
Best Case O(log n) < O(n) < O(n2) Worst case
Some of these computational time and space problems can be addressed with a brute-force approach, using many computing resources through parallelization.
In one use case, Uber used datasets enabled by Spark jobs to finish faster by a full order of magnitude (from about 55 hours with the N^2 method to 4 hours using LSH). You can learn more about the Uber LSH use case.
MinHash and Locally Sensitive Hashing (LSH) Model
The premise of LSH is that the similarity between two objects can be efficiently approximated by merely using their respective representations.
A popular way to construct an LSH for a similarity function is to carefully pick a hash function family so that each object is represented by a random hash value and the probability that the hashes for two objects collide is precisely their similarity.
The sentences or words are split into a set of words or characters. A Jaccard index can be used to find the similarity between strings by calculating the intersection of words (common in both sentences) over the union of them.
The MinHash function creates signatures that give a close estimation of similarity of the words/characters they represent; the larger the signatures, the more accurate the estimates.
The process of creating such a signature that achieves those requirements is the MinHash algorithm.
LSH is a technique of choosing the nearest neighbors-. This technique is based on special hashing where the signatures can tell how far apart or near they are to each other; based on this information, LSH groups the words to some bucket with an approximation of being similar.
My experimentation with the Spark ML MinHash library for comparing a few names and address of the companies gave me these results:
Matching of names/phrases that inconsistently use terms, such as “automobile” and “vehicle,” may require the establishment of similarity semantically. The string-matching techniques explained thus far only identify textual similarities of the names or phrases compared.
A potential solution to address these types of relationships is a neural network model, such as Word2Vec, that is pre-trained on news articles that have encoded semantic concepts, such as capital cities of countries. It can be trained to identify the capital of countries such as:
France - Paris = Berlin - Germany or sematic association of words such as “doctor” and “physician.”
In my experiment with an Apache Spark Word2Vec model, built on few training terms, a search of the term "Drug" gave me a cosine distance of 0.99 with the word ”Medicine” (where value 1 is similar and value 0 is entirely different). Similarly, a search of the term "Car" gave me "Motor" with a cosine distance of 0.65.
Matching of names/phrases can also arise due to the context in which the information appears. For example, “drug” and “pharmaceutical” mean the same thing in the context of pharma companies, but regular fuzzy matching approaches based on phonetic, edit distance, or lists would not capture this similarity.
When these are applied to organizations, the word-embedding method can recognize, for example, that JDR Drugs and JDR Pharmaceuticals are most likely the same company.
Corporate structural changes through mergers and acquisitions or product hierarchies result in a different set of master data problems. Whole Foods as a subsidiary of Amazon needs inputs of corporate events related to acquisition.
The case of “Chase Manhattan” as currently part of JP Morgan Chase & Co has significant effects on how transactions are linked. Similarly, embedding “Bear Stearns” and “Chase Manhattan” to “JP Morgan Chase and Co” can help connect these companies in a more automated fashion.
Typical classification and clustering models group an item based on the similarity of its features to a category to which it belongs. For example, an apple has discerning biological and culinary features that classify it as a fruit, whereas most words themselves have no inherent features or meaning.
It is the linguistic rules that give meaning to words such as “fruit” and “apple” and make these related. Embedding these meanings is what make neural network models more efficient. You may have to gather data from news articles or legal agreements and consolidated financial statements to obtain features that make Whole Foods a subsidiary of Amazon.
Corporate legal structures or product hierarchies can have complex relationships. A sample of relationship data is Nissan to its subsidiaries and other ventures that are embedded in contracts.
Product hierarchies can have complex relationships such as a replacement part that can fix multiple models or reporting groups that classify Turkey as part of Europe in one and part of Middle East in another.
These legal structures and complex hierarchies can be discerned from news articles and other data feeds and encoded into models, which can eventually be used to discern obscure relationships across companies and even complex product hierarchies.
Comprehensive data feeds through news articles and contributors, and a stronger data governance in ensuring accuracies, are essential to accurately building relationships among these companies. However small or large your customer or product master data set is, it can benefit from a machine-learning model learned from aggregated data, providing you with a single view of your customers and products.
As clichéd as it may sound, no single name-matching method can address all the nuances found in data. The design approach should combine multiple steps including pre-processing, data compilation, and multiple machine-learning models, based on the criticality of the problem.
Models, data collaborators, and vetting mechanisms may be crucial to real-time information gathering related to changes to the corporate structure. A comprehensive solution to master data management requires thinking, collective data aggregation, and continuous learning models.
Please fill email address
Please enter a valid email address!
Thank you for Subscribing our Magazine
Sorry!! There is Some Issues. Please Try Again. Thanks!!
Your Email ID is already registered with us. Thank you.