Note : This cell should run without raising a traceback. Assuming it runs, you can also try printing the value of
pd.__version__ to see what version of
pandas you have installed.
We'll be analysing a real-world dataset together. It's about my favourite thing in the world : coffee. This dataset was collected at the Mathematics Institute at the University of Warwick. It's a time-series dataset, describing the total number of coffees made by our espresso machine by a certain date.
We'll be running through an analysis of this dataset as a way to expose you to the
pandas API. The aim is to develop a little familiarity with how to work with
Slides are available at https://github.com/QCaudron/pydata_pandas. One notebook contains solutions; beware of spoilers.
The notebooks contain notes about what we're doing that I'll skip during this workshop, but try to explain on the way.
The dataset : total number of coffees made vs time.
Note : The second line here tells
matplotlib to plot directly under the cell where any plotting code is called.
matplotlib to generate graphs, and without this, the graphs would appear outside the Jupyter notebook when you called
plt.show() - but we just want them to appear without having to do this.
Let's import the coffee data from CSV.
pandas can read from many data formats : CSV, JSON, Excel, HDF5, SQL, and more.
We have an index, and three columns :
Uh-oh. Why is there a string of text,
testing, in our coffee numbers ? What's going on in the
coffees column in the row after that ?
df.head(n=10) would show the first ten rows. The default is
Definitely a string. We'll note this as something to fix after we finish looking around.
.loc uses a label-based lookup, which means that the value you pass into the square brackets must be in the index. Another method,
.iloc, is integer-location-based, so
.iloc would return the third row. In this case, they're the same, but had we changed our index, as we'll see later, things would work differently.
Indexing a dataframe with
 directly returns a
pd.DataFrame by searching over columns, not rows. Indexing a
 is like indexing a dataframe with
Note : here, we're indexing a series ( a
pd.Series object ). From a
pd.DataFrame ( here,
data ), when you access a single column (
data["coffees"] ), the object returned is a
pd.Series. From that, indexing directly with
 works in an integer-location-based manner, and like with numpy arrays, you can take slices (
Looks like we also have some missing data - we have 671 rows, but the
coffees column only has 658 entries.
.describe() returns different things based on what's in the dataframe, as we'll see later. For numerical columns, it will return things like the mean, standard deviation, and percentiles. For object columns ( strings or datetimes ), it will return the most frequent entry and the first and last items. For all columns,
.describe() will return the count of objects in that column ( not counting NaNs ) and the unique number of entries. You can determine what's returned using
.describe()'s keyword arguments.
.isnull() returns a boolean array ( an array of
Falses ), that you can then use to index the dataframe directly. Here, our boolean array tells us which entries in the
coffees column are null, and we use that to index against the full dataframe - so we get back every column in the dataframe, but only those rows where
coffees is null.
contributor column makes sense as
object, because we expect strings there; but surely the
timestamp should be a timestamp-type, and
coffees should be numerical ?
It looks like the
timestamp field was read from CSV as a string. That makes sense - CSV files are very basic. We'll have
pandas interpret these strings as datetimes for us automatically.
Note : here's an example of using direct
 indexing on a
pd.Series. We're accessing the first entry, just to see what type of object we have there.
timestampcolumn contains strings; these need to be datetimes
coffeescolumn contains some null values and at least one string
.describe(include="all") is describing all attributes of all columns, but some don't make sense based on the column's
dtype. For example, the contributor column has no
last attributes, because those describe the first and last entries in an ordered series. That makes sense for the timestamp - those have an intuitive definition of sorting - but not so much for strings ( alphabetical order doesn't really matter when they're arbitrary strings ). Similary, the timestamp column has no mean or other numerical traits. What does it mean to calculate the mean timestamp ?
pandas is plotting the coffees against the index, which is just a series of integers.
.plot() on a
pd.Series will plot the data against the index. On a
.plot() method allows plotting of one column against another.
.plot() renders a line graph, but you can specify which type of plot you'd like - bar, line, histogram, area, scatter, etc..
We have some very uneven spacing in places. We might start by cutting off the last few points of this time-series, which is missing a lot of data.
After mid-March, things start getting spaced rather erratically.
Note : this is another example of boolean indexing.
data.timestamp < "2013-03-01" is a boolean array, and can be passed into the dataframe immediately in
, much like with a
.value_counts() counts the unique values in a series. It's similar to doing a
.groupby() followed by a
.count(), as we'll see soon.
Note : this first line could be replaced by
weekday_counts = data.weekdays.value_counts(), with the only difference being that that would return a series to us, and here, we got back a dataframe.
pd.date_range creates a fixed-frequency DatetimeIndex.
normalize=True ensures these datetimes are at midnight, and not at whatever time the starting point is.
Note : the union of these indices is just a new index where entries from both indices are present. It's sorted by time.
.reindex() keeps any values that conform to the new index, and inserts
NaNs where we have no values.
Note : we use
.shift() here because if we look at the
.diff() between a Monday and a Tuesday, those coffees are attributed to the Tuesday. However, what we want to say is "this many coffees were made at some point on the Monday", so we shift the entire series up one.
Wednesdays was seminar day...
Note : by default, inner joins are performed. That is, if a row from one of the datasets has an index that isn't in the other dataset, that row is dropped. You can specify whether you want outer, left, or right joins, as well plenty of other useful options. The
pandas API for joining or merging datasets is very developed.
Note : the
parse_dates keyword argument takes several values. By passing in a list of strings, we're telling
pandas to attempt to parse the dates in columns with those names.
Note : the first line here creates a boolean
pd.Series, holding the value
False otherwise. Because it's a
pd.Series, its index stays the same as that of
machine_status, which was a
DatetimeIndex. Then, we can plot the boolean series (
True appearing as
False appearing as
0 ), and just quickly scan to see that there are long areas where the coffee machine was operations, with short bouts ( thankfully ! ) of the machine being broken.