Adam Breindel consults and teaches on Apache Spark, data engineering, and machine learning. He supports instructional initiatives and teaches at Databricks, has taught classes on Apache Spark and on deep learning for O'Reilly, and runs a business helping large firms and startups implement data and ML architectures.
Adam's first full-time job in tech was on neural-net-based fraud detection deployed at North America's largest banks back in 1998. Since then, he's worked with startups where he's enjoyed getting to build the future (e.g., mobile check-in for 2 of America's 5 biggest airlines, 3 years prior to the iPhone). He has also worked in entertainment, insurance, and retail banking, on web, embedded, and server apps, as well as on clustering architectures, APIs, and streaming analytics
This talk is less about technical success and failures; it instead focuses on where the tech meets squishy humans: API design, communication, empowering (or not) end users, sometimes even protecting open source from ourselves.
Specifically, we'll look at some well-oiled parts and some rusty frictiony bits that come between PySpark (as a computing ecosystem) and its main user community.
Ever wonder how Apache Spark has proven to be one of the most popular (if not the most popular) large-scale data processing systems of all time, while also managing to frustrate so many users? Ever wonder how many partitions and tasks you should have -- and hey, why can't Spark just figure it out? Ever wonder why Spark has DataFrames and Pandas has DataFrames and it's like they're from two different galaxies? Ever wonder why some bits just don't make sense, and why after 10 years (!) neither the docs nor the vendors just explain how it all works?
There will be no villains in this talk -- the world has enough of those -- instead we're all going to try and be the heroes by discussing some ideas about making things even better for our end users throughout our many roles: coding, documenting, sharing, teaching, as well as managing OSS projects and related businesses.
About me - Adam Breindel is one of the top Spark consultants and instructors worldwide; having spent over 5 years immersed deeply in the Spark world, I'm the guy that explains Spark to Databricks' and Cloudera's own instructors
The Bold Adventure of PySpark and Big Data -- Making "Real" Python Work in a JVM World
* Lesson: Paying Attention to User Needs and Desires
Technical Success: Making Scale-Out Python ML Modeling Real
* Responding to Perf Challenges: Making Python a First-Class Citizen with DataFrames
* Lesson: Expanding the Community By Going To Where You Are Needed
Wait, These Are Not the DataFrames Your Are Looking For
* How Does this Spark Data-Parallel Thing Work Exactly? If you're not sure, you're not alone
* Lesson: Be Thoughtful with Naming and Metaphors. Explain the Hard Stuff Clearly (Don't Hide It).
SparkML is Inspired By Scikit-Learn! Surely You Know Scikit-Learn...
* "Inspired By" Means Different Things to Different People
* Lesson: It's OK to Make Tradeoffs and Compromises, But Please Communicate Them
Audience Participation: What Do You Think is the Number One Question/Problem Users Have with Spark?
* Figuring Out Partitioning (and Tasks): The Tragic Tale of SPARK-9850 or Will History Repeat with Spark-23128?
* Can We Fix the Biggest Problem First? Ever? Why Not?
* Lesson: Apache Independent Governance Rules Are Good But Maybe We Need Something Stronger
Does Spark Need Protecting? Or Just Spark Businesses? Flag-Planting, FUD, and Private Expeditionary [Code] Forces
* Lesson: Sometimes We Could Use a "Nudge" to Help Us Protect Our Altruism From Our Own Inexperience Or Selfish Instincts
Finishing the Tech Story on a Positive Note: Arrow Integration, Pandas, GOAI
* Summary: Recap of Lessons