Andrew Danks

University of Toronto graduate and Software Engineer on the Data Quality team @ Yelp for 5 years. Loves finding ways to leverage data to create better user experiences. Favorite chain is Krispy Krunchy Chicken!

Speaker home page

Detecting business chains at scale with PySpark and machine learning

Python & Libraries, AI & Data, DevOps & Automation, Scale & Performance, Intermediate
8/18/2018 | 5:20 PM-6:05 PM | Robertson

Description

You’re driving to Tahoe for the slopes. What’s the one thing more important than the snow report? Of course, where the nearest In-N-Out is!

Yelp has a special chain search for all your guilty pleasures, powered by our Chain Detector system that we just rearchitected

Learn how we leveraged Spark on AWS and py3, moving away from Redshift+py27, for better scalability and maintainability

Abstract

Chain Detector is a system at Yelp that automatically detects all chain businesses in the world. It is an important data set that helps us deliver a special user experience in search (“Show me all the Philz Coffee locations in San Francisco”). It also powers an API for these chains to understand what users are saying about them across all of their locations.

The application needs to make many complicated transformations and aggregations on all of our business data and put these through a machine learning classifier to come up with the final data set. It is a hard problem because you often deal with imperfect data. For example, you may have many Super Duper listings, and another named “Super Duper Burgers”. Should they all be classified as the same chain?

The original system was built in Python 2.7 and relied on Redshift to execute massively parallelized queries. This wasn’t fast enough and also used 3rd party Python 2.7 libraries to parallelize the Redshift queries even further.

This made the optimization logic heavily embedded with the application logic, so it was difficult to reason about the core algorithm. It also relied on too many unit tests mocking out Redshift calls, thus more prone to regressions and mocks that are tedious to maintain. Integration testing with Redshift is not ideal because you cannot readily sandbox clusters and only one developer can run tests at a time.

We explored Spark, a fast growing technology at Yelp, and after being impressed with prototypes, we rearchitected the rest of the system to use Spark instead of Redshift for all the data manipulation and machine learning. We were able to do this while keeping the application logic in tact with no regressions, and while upgrading to Python 3.6 in the process!

Spark allowed us to naturally express the data manipulations and the core algorithm. And by using Spark and setting up proper file system abstractions, it also enabled us to do local stateless end-to-end testing in pytest -- no more mocking in unit tests! The application is also 4x faster. However, we faced many challenges because some Python code is not compatible with the Spark paradigm and it can be tricky to get your dependencies setup to be compatible to run on AWS.

Overall, we now have a faster, more modern and maintainable system that has solved our infrastructure issues. In addition, it has allowed us to safely add new features, such as supporting custom overrides from an admin through a UI, and ensure the final data set is still computed correctly.