I'm a software engineer from Brazil and partner at Vinta Software (www.vinta.com.br). At Vinta, we love beautiful high-quality products, from UX to code, and we will defend them against unreasonable deadlines and features. We use mostly React and Django.
What to do when you must find duplicate or related records in one or more datasets and you don't have unique identifiers, like the SSN for US citizens? The answer is to use Data Deduplication techniques and look for matches by cleaning and comparing attributes in a fuzzy way. In this talk, you'll learn with Python examples how to do this without needing any expert Data Science knowledge.
Record Deduplication, or more generally, Record Linkage is the task of finding which records refer to the same entity, like a person or a company. It's used mainly when there isn't a unique identifier in records like Social Security Number for US citizens. This means one can't trivially find duplicate records in a single dataset, neither easily link records from different datasets. Without an identifier, record linkage looks for matches by cleaning and comparing record attributes in a fuzzy way. Imagine you have two datasets with information about people, but without any unique identifier in the records. You have to compare attributes like name, date of birth, and address in a smart way to find which records from the two datasets refer to the same person. A similar approach must be used to dedupe records in a single dataset, so Record Deduplication is a kind of Record Linkage.
There are a number of important applications of data deduplication in government and business. For example, by deduping records from Census data, the Australian government was able to find there were 250,000 fewer people in the country than they previously thought. This reduction impacted the estimations of government agencies and even caused the revision economical projections. Similarly, businesses can use record linkage techniques to enrich their customers' data with publicly available datasets.
In this talk, you'll learn with Python examples the main concepts of Record Deduplication, what kinds of problems can be solved, what's the most common workflow for the process, what algorithms are involved, and which tools and libraries you can use. Although some of the discussed concepts are related to data mining, any intermediate-level Python developer will be able to learn the basics of how to dedupe data using Python.