Applying an Active Learning Algorithm For Entity Deduplication In Graph Data

Presentation At Open Data Science Conference West 2017

Data deduplication, or entity resolution, is a common problem for anyone working with data, especially public data sets. Many real world datasets do not contain unique IDs, instead we often use a combination of fields to identify unique entities across records by linking and grouping. This talk will show how we can use active learning techniques to train learnable similarity functions that outperform standard similarity metrics (such as edit or cosine distance) for deduplicating data in a graph database. Active learning is a semi-supervised machine learning technique that incorporates user feedback at each training iteration to ensure that an optimal datapoint is used for training. Further, we show how these techniques can be enhanced by inspecting the structure of the graph to inform the linking and grouping processes. We will demonstrate how to use open source tools to perform entity resolution on a dataset of campaign finance contributions loaded into the Neo4j graph database. We will make use of Neo4j, Cypher (the query language for graphs), and Python data science tools.

  • Slides
Subscribe To Will's Newsletter

Want to know when the next blog post or video is published? Subscribe now!