Mike Lee Williams does applied research into computer science, statistics and machine learning at Cloudera Fast Forward Labs. While getting his PhD in astrophysics he spent 2% of his time observing the heavens in beautiful far west Texas, and the other 98% trying to figure out how to fit straight lines to data. He once did a postdoc at the Max Planck Institute for Extraterrestrial Physics, which, amazingly, is a real place.
Working in the cloud means you don’t have to deal with hardware. The goal of "serverless" is to also avoid dealing with operating systems. It offers instances that run for the duration of a single function call. These instances have limitations, but a lot of what data scientists do is a perfect fit for this new world! That’s what this talk is about.
In this talk we'll first see the basic idea behind serverless and learn how to deploy a very simple web application to AWS Lambda using Zappa. We'll then look in detail at the "embarrassingly parallel" problems where serverless really shines for data scientists. In particular we'll take a look at PyWren, an ultra-lightweight alternative to heavy big data distributed systems such as Spark. We'll learn how PyWren uses AWS Lambda as its computational backend to churn through huge analytics tasks. PyWren opens up big data to mere mortal data scientists who don't have the budget or engineering support for a long-lived cluster.