Applying an Active Learning Algorithm For Entity Deduplication In Graph Data
Presentation At Open Data Science Conference West 2017
Data deduplication, or entity resolution, is a common problem for anyone working with data, especially public data sets. Many real world datasets do not contain unique IDs, instead we often use a combination of fields to identify unique entities across records by linking and grouping. This talk will show how we can use active learning techniques to train learnable similarity functions that outperform standard similarity metrics (such as edit or cosine distance) for deduplicating data in a graph database. Active learning is a semi-supervised machine learning technique that incorporates user feedback at each training iteration to ensure that an optimal datapoint is used for training. Further, we show how these techniques can be enhanced by inspecting the structure of the graph to inform the linking and grouping processes. We will demonstrate how to use open source tools to perform entity resolution on a dataset of campaign finance contributions loaded into the Neo4j graph database. We will make use of Neo4j, Cypher (the query language for graphs), and Python data science tools.
Links And Resources#
so today I'm going to talk about entity resolution in graph data entity resolution is also called data deduplication sort of the this idea of if we have two records are these the same yes or no so I'm a I'm a software developer at near four J I'm not a data scientist so sorry about that I don't work on the the core database rather I work on integration so I build tools to make sure that you can use near four J with your favorite web development framework or with your data processing framework that you can move data around things like that and then I also lead what we call our data journalism accelerator program so one thing that's been interesting in the last few years at neo4j is we've seen more and more data journalists wanting to use graph databases in neo4j to make sense of very connected data for their stories and we wanted to make sure that there were resources available to help data journalists as they go through this process of analyzing data sets so we created this program to make some some resources available to them which makes mostly myself sort of working with data journalists hand on hand to try to solve some of their problems to make sense of data they're working with so in my talk today it's actually gonna be sort of a case study in some of the the challenges that we had with some of the folks in the data journalism program and specifically sort of a deep dive in one approach that we took to data deduplication so hopefully hopefully this is useful to give you an idea of sort of the challenges that arise when working with this sort of data although I'm sure we're all familiar with with the the problems of data deduplication but then also maybe just give you an idea of some some possible solutions there so so first of all who's familiar with with graph databases or near 4j who's used a graph database ok so so a good amount of folks I want to spend just a little bit of time talking about what a graph database is what near 4j is so that we're all sort of on the on the same starting plane and then like I said we're going to dive into looking at this idea of entity resolution sort of as a case study with one of the examples from this journalism program and then take a look into a specific approach that we use to solve this problem of entity resolution or a deduplication so first of all let's talk a little bit about about neo4j so they're all familiar with with me or Virginia graph databases I think the best way to to sort of explain something if you're new to it is to do this in the form of a tweet so here's near 4j in 140 characters open source software that stores in queries data as nodes and relationships using the cipher query language with index free adjacency okay there's there's a lot going on there so let's unpack that a little bit so first of all neo4j is open source it's available to download you can easily spin it up on AWS there are docker images you can use with neo4j you can clone it from github and build from source if you want so easily available so software that stores and queries data so in your JS is primarily a database not necessarily like an in-memory graph processing framework so its primary responsibility is to take a transaction take some data durably commit it so once that transaction ISM is acknowledged you know your data is is safe and then we also I want to make sure that we can query our data as a graph once it's committed and this is where this idea of graph processing comes in and we do this by modeling our data as nodes and relationships so nodes are the entities in our graph the objects relationships connect them we can store arbitrary key value pairs on our nodes and relationships this specific example this comes from the Panama papers data set how many people are familiar with Panama papers yeah so this is this was a leaked data set from a law firm in Panama that some data journalists got ahold of and then they ended up using near 4j to make sense of this data because it was very very complex and interconnected so we had we had legal entities that are registered in some offshore jurisdiction we had the law firm that registered this legal entity then we had shareholders and beneficiaries of these legal entities that may be themselves legal entities and ultimately we needed to sort of be able to find okay well who is sort of the ultimate shareholder or the ultimate natural person that's a beneficiary of this legal entity so this is the the data model that the IC IJ that's the the data journalism group that they came up with to work with this data in India for Jane okay so the next part of our tweet here we store in query data as nodes and relationships using the cipher query language so cipher is like sequel buffer graphs so cipher is very much about expressing patterns and matching on those patterns in the graph so we define nodes within parentheses relationships within brackets and then we can build more complex patterns by by building on those basic those basic building blocks so here in in this query on the first line we're looking for some address node where the address property contains Seattle so addresses in Seattle then we're traversing out from those address nodes to find any officers so these are the the natural persons so who has an address in Seattle and then traverse out from that person to find any legal entities that they're connected to so this query says find me people with an address in Seattle and then show me all of the offshore legal entities that they're shareholder of or a beneficiary of and if we ran this in neo4j browser which is this this query workbench for working with me here for J we would get a graph utilization something like this and this is useful we can see these these sort of clusters where we have tightly connected groups of people but often times the answer to our question is not a graph utilization instead it's maybe an aggregation and a group by so maybe we want to know for everyone with an address in Seattle that has some connection to an offshore entity what are the most common offshore jurisdictions so for people in Seattle what jurisdictions are they using to register their offshore legal entities so this query is just question yeah we'll see an example of data import in in a minute here sorry mm-hmm yeah so just like sequel it's gonna be something that you that you're gonna write they're tools for building building queries but yeah something something that you would write yep and we'll see an example of data import in a minute so this query is very similar to the previous one but now in our return Clause we have this implicit group by so we're doing an aggregation that's this count star so count number of paths and any time we we do an aggregation in cipher we get this implicit group by so we're grouping by jurisdiction and getting a count so now instead of a graph visualization our results is a table and we can see that the most common jurisdiction for officers with addresses in Seattle is the British Virgin Islands followed by undetermined so missing data Samoa and Panama so the point I want to make here is just that with cipher we can use graphs to express other graph patterns and and traverse the graph but we can also do things in sequel or do things that you would be familiar with in sequel like group eyes and in aggregations it's often the answer to your question is not just some some graph traversal it's it's some aggregation or running some algorithm so cypher ciphers part of the open cipher project it's an open source language there are other graph databases that use cipher there are other processing frameworks that use cipher you can use cipher with spark as well so you can express a spark job in cipher ok and the last part of our tweet here so we said this is open source software that stores in queries data as nodes and relationships using cipher with index free adjacent see so what is index free adjacency well it's this very important performance characteristic that we have with graph databases and what this means is that when we're traversing from one node to another when we're following relationships we're not doing an index lookup we're essentially just chasing pointers which computers are very good at and the reason this is important is that an index operation is dependent on the size of the data set so if we're not dependent on the size of the data set then that means that a local graph traversal is going to be a constant time operation which means that the performance of that is not going to change when we scale up to very very large data sets so that allows our queries to scale to large data sets very easily so this is this is an important performance characteristic of graph databases which means essentially that traversing the local graph is a constant time operation if you want to play around with with neo4j or with the data set that that we're gonna look at today and you every day's sandbox is it's probably the best way to do this it has lots of guided examples for learning learning cipher and playing around with graph data okay so let's take a look at a specific data model this data comes from campaign finance so the Federal Election Committee releases information about individual contributions and funding for re-election campaigns so we have information about who made a contribution where they work what their occupation is what state they live in the committee they made the contribution to is that committee a PAC associated with a corporation does that committee fund a specific candidate is it a primary election campaign for a specific candidate all this kind of information we're ignoring expenditures in this example but we also have lots we can do with how that money was spent this is just where the money comes from and what Canada it's going to so if we think of entity resolution or deduplication the problem essentially boils down to I have two records I have maybe it's two contributors and I need to know well are these records referring to the same person so let's take a look at at the data and we'll see if we can answer that question so like I said this data comes from the FEC the Federal Election Commission and we're going to use three CSV files one for candidates one for committees and one for individual contributions so the first one here is our candidates and zoom in on this a little bit so we have a CSV file for every candidate and and this is actually quite nice because the first column in the CSV file is a unique ID for each candidate so we can treat that as sort of a primary key in India for je we call this a uniqueness constraint so we we create a constraint on the candidate nodes we're going to create some candidate nodes and we're going to create this uniqueness constraint on the FEC ID property that says at the database level do not allow us to write any data to create another candidate node where we already have that existing FEC ID right so primary key this makes our import pretty simple in in cipher we have this merge man which is like a like an upsurge or a get or create so it says if we already have a candidate with this FEC ID then don't create a duplicate just just match on it and we have good tooling for importing from CSV files using a cipher so this is pretty straightforward when we get to committees it's almost exactly the same the the first column is a unique ID for the FEC committee that the FEC assigns so we can create a uniqueness constraint on that merge greates we now have committee nodes we have candidate nodes that's no problem but the next file is contributions and contributions or individual contributions rather these are these are a little bit more tricky because we don't have any unique ID here we have a name we have an occupation we have an employer we have a zip code but we don't have anything that we can treat as a unique ID remember this is all public data so we don't have anything like a like an SSN we do have a unique transaction ID for the contribution so in a naive approach for this import what can we do well we could create a synthetic unique ID so we could take name and zip code and concatenate those together and treat that as a unique ID so we say anywhere we have the same name and zip code mash together we're gonna say that's the same individual the same contributor and we can create a unique adjust constraint on that and into our imports and okay that's that works out fine in some cases so here's Brad Brad works for LexisNexis as a content engineer we can see that he's made a series of contributions to the Mayday PAC into the Zephyr Teachout for Congress committee he lives in Colorado so this worked out this approach worked out okay in Brad's case Brad's fairly consistent in reporting his name and zip code on the disclosure forms so thanks Brad but this doesn't work out in all cases so here we have duplicate records so here we have bertram finn who lists a zip code using the the expanded version and then he makes another contribution and he only uses the five digit zip code representation say so we end up with two different unique ids in that case we treat that as two different records we don't capture that's the same person making those contributions and we have that again and again and we have a similar problem when we get to the employer field employer is a bit more difficult because for employer we just have name so here's all the different ways that we can spell LexisNexis in in the FEC records LexisNexis likes a space Nexus LexisNexis Risk Solutions Inc LexisNexis Risk Solutions Inc with a comma right all of these kind of these kind of problems which we can can clean up a little bit but but again it is problematic so we can do something like a edit distance between company names so this is just edit distance between different versions of LexisNexis sorted by distance so if they have a smaller at a distance there have a greater probability of being the same entity okay so we've we've done our initial import we realize that we have some duplicate records we want to take a more I'd say less naive approach to try to deduplicate this data let's take a look at how we do that but first I want to just talk about a couple of examples that I think are maybe not unique to graph data but that show themselves much more clearly when you're working with graph data I guess and two as to why having duplicate data in the graph is a problem so in these three ideas are this concept of inferred relationships joining data sets and then what I'll call aggregating traversals so the first example going back to the Panama papers data set is this idea of inferred relationships in in graph theory we would call this a triadic closure so if we look here that top node that says iceland that's an address node and then the purple nodes those are two officers that share the same address so really there's this inferred relationship between them that they live together right because they share the same address and in this case the note on the Left whose name I'm not going to try to pronounce he is actually the former prime minister of iceland who failed to disclose that at some point he sold his interest in an offshore legal entity to his wife who lives at the same address he failed to disclose that information and this was discovered by the IC IJ when they started working with this Panama papers data set basically by exploring the graph looking for important people and then looking for connections and looking for these inferred relationships right so basically just looking for triangles closing the triangle triadic closure right so if we didn't have one address node if instead we had some misspelling in the address and we treated those as two different address nodes we wouldn't be able to see these inferred relationships I wouldn't be able to to have seen that there was this sale of interest in this winter Asst corporation to someone who lives at the same address so that's the idea of inferred relationships another thing that we do with graph data that's really really interesting and really powerful is this idea of combining data sets and then querying across them this is this is a something you can do pretty easily in a graph database because of the flexible nature of the data and then I think also because it's very easy and and powerful to be able to use cipher to express sort of complex traversals across data across data sets so what's an example here well we can take the BuzzFeed Trump world data set which has data about people near the Trump administration and their connections oftentimes to other corporations and we can combine that with the USA spending data set which has information on the awarding of federal contracts and we can say is there anyone connected to the Trump administration who's connected to a company that is a recipient of a government contract if so that might be kind of interesting to know through what the what the connections are there and of course there are so here's here's one example where we can see that some advisers to Donald Trump in his campaign are connected to Vornado real ad trust which is the parent organization of merchandise mark properties which was awarded a contract by the federal prison system to administer private prisons so that's that's kind of interesting especially if you look at look at some of the news and if you follow the price of private prison stock after Trump's election that might be something interesting to look at so that's joining datasets the third sort of example of why why making sure your data is not duplicated in in the graph is this idea of aggregating traversals so we saw this already this example of four people with an address somewhere and the offshore entities that are connected to what are the most common jurisdictions well if we had duplicate data along these paths of course we're not going to get get the same results in our accounts here if we have British Virgin Islands not showing up as sort of a canonical representation of that jurisdiction instead our counts are going to be off right in this this is fairly obvious but I think important to to point out the reasons of why we go through this process of trying to deduplicate our data and again similar problem when we look at the campaign finance data so this this query is saying find basically who are the employers the most common employers for every one who made a contribution and retired is the most common we see not employed self-employed none self showing up several times so again we should have some canonical representation of this idea of being you know self-employed so that our count stroke correctly in our aggregations okay so let's let's take a look at the approach that that I used with with the group that was working with this this campaign finance data we weren't actually working with campaign finance data it was something something else that I can't really talk about but you get the same basic idea so here are two records we have different fields in these records and we want to know are these the same entity yes or no well we can look at them and you can say oh yeah these are this is probably the same person right but of course we need some way to quantify this to automate it to do it at scale so so how do we do this well let's start with just a field by field comparison so just on one property we need some similarity metric so we can use like an edit distance to compare the two strings so how many edits do we need to make to turn one string into the other in this case for and we have this the sort of functionality available and in cipher we can use other approaches things like tf-idf that are our term based and look at look at a set of words I'm using you know probability of those words across our corpus there there are pros and cons to two tf-idf and then also there are some other probabilistic extensions to edit distance that take into account the type of edit that you're doing we can also look at some some Markov model things as well we can also look at sound X which is really interesting so this is taking a sort of a phonetic indexing of a string and we basically get a get a hash for for each string to compare that but let's say we're just going to use use edit distance so we'll compute edit distance for each field and then we need some way to combine these difference these different similarity measures across all of these fields because we want to do this comparison at the record level not at the not at the field level so you know we need some weight and as a human we can look at this and pretty easily come up with some weights that make sense email is probably a lot more importance than the name or phone number and then we can you know do a weighted average on these and when you get some get some metric comparing the two records okay but you I just made those those weights up how do we do this how do we do this at scale how do we automate this process and this is where this idea of active learning comes in has anyone worked with with active learning approaches a couple of folks okay so active learning in this case our goal is to learn weights for our similarity distance for each field and to do this we need we need labeled pairs so we need some pairs where someone has said yes these are duplicates no these are not duplicates yes these are duplicates right so supervised learning but we want to minimize the amount of time that some human has to spend comparing records and saying yes duplicate no not a duplicate not a duplicate so we want to be able that when we're presenting the human with a pair of records to label we want to make sure that the algorithm is going to get the most information out of being able to have a label on these two pairs and after each step after we have the the human label some pairs for us we then relearn the weights and then iterate on this again to present the user with more more information to improve to improve our estimate of in this case the weights okay so how do we estimate the weights well we use logistic regression many people have used logistic regression so this is this is very useful for classification basically we just fit the logistic function to our data so if we have if we have some data that looks like this where we have the distance so here we're just looking at a single field so for so for one field let's say this is named we have the Edit distance in our hair and we have the label is this a non duplicate yes or no we can just fit the logistic function to this and the coefficient becomes the weight for this field but we need to make sure that we're not comparing across all pairs of data for our algorithm this is this is really important I mean even with just you know relatively small data if we had just a thousand rows that's half a million comparisons potentially that we they might have to make so this is where the blocking function comes in it's important to note that in this kind of data anyway duplicates are are relatively rare almost all pairs are not duplicates so a blocking function allows us to just set some limit for the kind of pairs that we're going to compare basically to say are these in the same ballpark just give me things that are in that ballpark for comparison so what are what are things that are somewhat similar records in the the standard approach for for blocking functions are things like predicate blocks where maybe we'll compare if the fields match so if I have the same name for any records I I may be interested in comparing those or maybe I talk in eyes the fields and if I have shared tokens across records or if they share an integer in grams this kind of thing but we're using a database we can also build indexes and indexes databases are really good at building indexes and indexes are very nice for giving us things that are close to each other so we can we can build an index on any of our fields and use the index to give us give us our blocking function and then another thing that's interesting we can do because we're using a graph database is we can use graph algorithms as our blocking function so we can run some sort of community detection algorithm something like connected components and we're just using that cluster as our blocking function cool so let's let's look at a couple of code snippets to see how we can actually implement something like this there's a library called dedupe that bundles a lot of this a lot of this approach which which I really like and is easy to use with lots of different different tooling so it's a Python library that has this approach for active learning we can do logistic regression we can we can choose other options there and we can define our own blocking functions to use which which is really really important so we can use this with with really any database as long as we're able to define our blocking function so in this example we call out two tenir for j we send some cipher but of course you can do this in a relational database sending sequel as well so if we use dee doop we can you know at the end of the day spit out a CSV file or we can just read and write directly from a database using the Python driver for that database okay so how does this work well first we we need to do some training so we basically say these are the fields that we have these are the types it's okay if this is missing a value we instantiate a d duper object then we go through this active learning approach so we're presented with two records these two records we say are these duplicates yes or no human labels them then we go through through this training process eventually we get to to a steady state and then we can choose to write our write out our results either directly to the database and update records or we can just write to a CSV file if we write to a CSV file we get a synthetic unique ID for each record so that's this cluster ID that we can contributing that's silly synthetic primary key on the name and zip code instead we can just use this this new one on cluster ID so then that allows us to to write queries that say ok for for people who donated to the made a political action committee what are other political action committees that they're donating to and in this case the answer is fairly obvious it's well if you made a donation to to one committee the most likely committee that you're going to donate to again is the same committee right you're gonna make some donations again over time anyway those are those are the types of queries that that we can we can write if you you see the link that I that I shared earlier and I'll share my slides it has this as well with this data set another interesting thing we can do we can take this idea of combining datasets and we can combine data from US Congress so we can we can then look at for these candidates that are running for re-election that already serve in Congress we can say well let's look at the bills that they're sponsoring the subject of those bills that gives us some some inference on these subjects that these legislators have domain over then we can look at who's making contributions to those legislators and you know where did those people work that where are those people from that that kind of thing and there's some interesting things here so I'm from Montana and for a while Max Baucus who was the senator in Montana was on the Senate Finance Committee and almost every year the major donors for his election campaign were not from Montana they were from New York and worked in the finance industry right so you can sort of chase the money here and if you look at sort of the incentives then it it all sort of makes sense anyway I want to mention just again a couple of resources this neo4j sandbox this has the the datasets that we talked about today including that one with Trump World and USA spending for government contracts that one's kind of interesting as well as this campaign finance data so if you want to want to try that out it's just odd neo4j sandbox comm and I guess that's the only resource I have to share but I think we have maybe five minutes for questions if anyone has one or you can go to lunch I think lunch is served pretty quickly yeah okay that's your your team oh cool okay okay great cool yeah thank you yeah yeah so the question is can I say more about how we use graph algorithms as blocking functions yeah so so basically the point of the blocking function is to limit your scope of comparison right and and we can do this with with predicate or an index or we can look at essentially connectedness in the graph to try to treat that as in sort of an index as well so connected components is an algorithm that essentially says find pieces of the graph that are tightly connected and when we when we use that as our blocking function then that limits our comparisons only to within those connected components and so the the reason we do that is because if you've donated to one campaign for example one one committee it's it's probably likely that you're gonna donate to that same committee again and so if that's the case but we we missed your maybe your name spelling was off so we treated you as two different nodes then you're going to have both of your records you're going to have this inferred relationship through this through this committee that you've donated to so that that's an example where we're just sort of using the structure of the graph not just doing the string comparison things to sort of boost boost our ability to make comparisons where we're not looking at anything regarding the content of the records themselves just the structure of the graph yeah that's exactly right yep we're just trying to make smart comparisons so that so that we can learn the most information when the human goes to label exactly how does so so after we go through the learning process and then yeah so I I didn't in this example I I didn't really go through any evaluation process you you know essentially to compare for the entire set we'd have to to label the entire set but but dee doop will do some automated cross validation for you and give you some some metrics which i left out of my presentation today but that's something that is yeah i built into d do was to get crossed out validation with that cool i think we're out of time and lunch is served but i'll be around for
Subscribe To Will's Newsletter
Want to know when the next blog post or video is published? Subscribe now!