Xiao Li is an engineering manager, Apache Spark Committer, and PMC member at Databricks. His main interests are on Spark SQL, data replication and data integration. Previously, he was an IBM master inventor and an expert on asynchronous database replication and consistency verification. He received his Ph.D. from University of Florida in 2011.
In this talk, we present Koalas, a new open source project that was announced at the Spark + AI Summit in April. Koalas is a Python package that implements the pandas API on top of Apache Spark, to make the pandas API scalable to big data. Using Koalas, data scientists can make the transition from a single machine to a distributed environment without needing to learn a new framework.
Pandas is the standard tool for data science in python, and it is typically the first step to explore and manipulate a data set by data scientists. The problem is that pandas does not scale well to big data. It was designed for small data sets that a single machine could handle.. When data scientists work today with very large data sets, they either have to migrate to PySpark to leverage Spark or downsample their data so that they can use pandas.
This presentation will give a deep dive into the conversion between Spark and pandas dataframes. Through live demonstrations and code samples, you will understand:
- how to effectively leverage both pandas and Spark inside the same code base
- how to leverage powerful pandas concepts such as lightweight indexing with Spark
- technical considerations for unifying the different behaviors of Spark and pandas