Daniel Chen

Graduate Student at Virginia Tech (VT) in Genetics, Bioinformatics, and Computational Biology (GBCB).

Volunteer instructor for Software Carpentry and Data Carpentry. Author of "Pandas for Everyone".

Pandas for Everyone

8/15/2019 | 1:45 PM-5:15 PM | Twitter


Data Science and Machine learning have been synonymous with languages like Python. Libraries like Numpy and Pandas have become the de facto standard when working with data. The DataFrame object provided by Pandas gives us the ability to work with heterogeneous unstructured data that is commonly used in "real world" data.

New learners are often drawn to Python and Pandas because of all the different and exciting types of models and insights the language can do and provide, but are awestruck when faced with the initial learning curve.

This is a tutorial for beginners on using the Pandas library in Python for data manipulation. We will go from the basics of how to load and look at a dataset in pandas (python) for the first time, and begin the progess of preparing data for analysis.


The topics covered are:

  • Load and look at slices and views of data
  • Groupby aggregates to summarize data
  • Tidy and reshape data
  • Write functions and apply them to data
  • Encode dummy variables to prepare for analysis and model fit

By the end of this tutorial you should have a solid foundation on working with datasets in Python. The last topic of encoding dummy variables segues into using other libraries, such as scikit-learn and statsmodels to fit models on your data.

:00 - :45 Pandas DataFrame Basics + break/exercise = Hour 1

Before we start cleaning data, let's begin by covering the basics of the Pandas library. We'll cover importing libraries in Python, and how to load your own datasets into Pandas. From there, you'll typically want to look around your data, so we'll cover various ways we can filter and look at our data, calculate simple aggregate statistics and visualize them. This section will end with how to save our data into files we can share with others.

  • Loading your first dataset
  • Looking at columns, rows, and cells
  • Subsetting columns
  • Subsetting rows
  • Subsetting both columns and rows
  • Boolean subsetting
  • Grouped and aggregated calculations
  • Export/save data

exercise: load the tips dataset and filter rows by gender and total bill amount.

1:00 - 1:45 Tidy data + break/exercise = Hour 2

Knowing what is a "clean" and "tidy" dataset will help you look for common data problems and give you an idea what your final dataset should look like. Once your data is tidy, it can be easily transformed to other shapes you need for analysis. Understanding what kinds of data manipulation steps are needed will help you with the "how" to do it, i.e., it is language agnostic, and won't matter what language you use.
This section goes through Hadley Wickham's "Tidy Data" paper.

  • What is tidy data
  • Fixing common data problems
  • Columns containing values, not variables
  • Columns containing multiple variables
  • Variables in both rows and columns
  • Multiple observational units in a table (normalization)

exercise: tidy two small datasets used in the R for Data Science book tidy data example

2:00 - 2:45 Applying Functions + dummy variables + break/exercise =

Hour 3

Sometimes we need a more complex method to tidy our data. Other times, we need to perform more complex tasks on our data. Here we'll cover how to write functions in Python and how to apply them to our data. This way, if a method does not exist to perform the task we want, or if we want to combine multiple tasks together, we can write our own custom functions to process our data.

  • Writing a Python function
  • Applying functions
  • Vectorized functions

exercise: use the ebola dataset from the tidy section, and instead of using the .str. accessor, write a function to parse out the string.