Graduate Student at Virginia Tech (VT) in Genetics, Bioinformatics, and Computational Biology (GBCB).
Volunteer instructor for Software Carpentry and Data Carpentry. Author of "Pandas for Everyone".
Data Science and Machine learning have been synonymous with languages like Python.
Libraries like Numpy and Pandas have become the de facto standard when working with data.
The DataFrame object provided by Pandas gives us the ability to work with heterogeneous unstructured data that is commonly used in "real world" data.
New learners are often drawn to Python and Pandas because of all the different and exciting types of models
and insights the language can do and provide, but are awestruck when faced with the initial learning curve.
This is a tutorial for beginners on using the Pandas library in Python
for data manipulation. We will go from the basics of how to load and look at a dataset in pandas (python) for the first time, and begin the progess of preparing data for analysis.
The topics covered are:
By the end of this tutorial you should have a solid foundation on
working with datasets in Python.
The last topic of encoding dummy variables segues into using other
such as scikit-learn and statsmodels to fit models on your data.
Before we start cleaning data, let's begin by covering the basics of the
Pandas library. We'll cover importing libraries in Python, and how to
load your own datasets into Pandas. From there, you'll typically want to
look around your data, so we'll cover various ways we can filter and
look at our data, calculate simple aggregate statistics and visualize
them. This section will end with how to save our data into files we can
share with others.
exercise: load the tips dataset and filter rows by gender and total bill
Knowing what is a "clean" and "tidy" dataset will help you look for
common data problems and give you an idea what your final dataset should
look like. Once your data is tidy, it can be easily transformed to other
shapes you need for analysis. Understanding what kinds of data
manipulation steps are needed will help you with the "how" to do it,
i.e., it is language agnostic, and won't matter what language you use.
This section goes through Hadley Wickham's "Tidy Data" paper.
exercise: tidy two small datasets used in the R for Data Science book
tidy data example
Sometimes we need a more complex method to tidy our data. Other times,
we need to perform more complex tasks on our data. Here we'll cover how
to write functions in Python and how to apply them to our data. This
way, if a method does not exist to perform the task we want, or if we
want to combine multiple tasks together, we can write our own custom
functions to process our data.
exercise: use the ebola dataset from the tidy section, and instead of
using the .str. accessor, write a function to parse out the string.