Data analytics in Python benefits from the beautiful API offered by the pandas library. With it, manipulating and analysing data is fast and seamless. In this workshop, we'll take a hands-on approach to performing an exploratory analysis in pandas. We'll begin by importing some real data. Then, we'll clean it, transform it, and analyse it, finishing with some visualisations.
In this hands-on workshop, we'll walk through the exploratory analysis of real-world data. Datasets are often messy, full of holes and inconsistencies, and a data scientist or analyst may spend a large fraction of their time cleaning and preparing data.
Fortunately, pandas makes a lot of this fairly trivial. It allows the user to import data from all sorts of different sources, and then manipulate the powerful DataFrame object. Analytics with pandas are human-friendly.
Pulling in the data
Starting with some data in CSV form, we'll look at the general properties of our dataset. What columns do we have; what kind of values are contained in them ? We'll identify problematic fields, and join two datasets to make one complete dataframe.
We've identified problems with our data, and now it's time to correct them. We'll fill in missing values, drop irrelevant rows, and fix incorrect datatypes.
Transforming the data
Next, we'll standardise some numerical fields where we're looking for deviations rather than absolute values, and derive some new columns based on the data we have.
Throughout, we'll be generating visualisations, to guide us in where to go next.
You'll need to be fairly comfortable working with Python. We won't be doing anything overly complicated, but having a grasp of Python syntax is expected.
If you want to follow along, please have a working Python setup, with pandas and matplotlib installed. Aim for a recent version of pandas. If you're unsure what to install, I recommend getting Python 3 through Anaconda : https://www.continuum.io/downloads - this distribution comes with everything you need and is very friendly.
PyData is an educational program of NumFOCUS, a 501(c)3 non-profit organization in the United States. PyData provides a forum for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R.
PyData conferences aim to be accessible and community-driven, with novice to advanced level presentations. PyData tutorials and talks bring attendees the latest project features along with cutting-edge use cases.