Do you have a working Python installation, with the `pandas`

package ?

**Note :** This cell should run without raising a traceback. Assuming it runs, you can also try printing the value of `pd.__version__`

to see what version of `pandas`

you have installed.

- Lapsed computational physicist
- PhD computational neuroscience, postdoc statistical epidemiology
- Data Scientist at CBRE -
`www.cbredev.com`

- ATOM at Seattle

We'll be analysing a real-world dataset together. It's about my favourite thing in the world : **coffee**. This dataset was collected at the Mathematics Institute at the University of Warwick. It's a time-series dataset, describing the **total number of coffees made by our espresso machine** by a certain date.

We'll be running through an analysis of this dataset as a way to expose you to the `pandas`

API. The aim is to develop a little familiarity with how to work with `pandas`

.

Slides are available at https://github.com/QCaudron/pydata_pandas. One notebook contains solutions; **beware of spoilers**.

The notebooks contain **notes** about what we're doing that I'll skip during this workshop, but try to explain on the way.

The `pandas`

API is **enormous**. The documentation is excellent, don't hesitate to look things up.

The dataset : **total number of coffees made vs time**.

- Who are the main contributors to this dataset, and when are contributions generally made ?
- What are the department's weekday coffee habits ?
- How much coffee are people drinking ?

**Note :** The second line here tells `matplotlib`

to plot directly under the cell where any plotting code is called. `pandas`

uses `matplotlib`

to generate graphs, and without this, the graphs would appear outside the Jupyter notebook when you called `plt.show()`

- but we just want them to appear without having to do this.

http://ipython.readthedocs.io/en/stable/interactive/plotting.html#id1

Let's import the coffee data from CSV.

**Note :** `pandas`

can read from many data formats : CSV, JSON, Excel, HDF5, SQL, and more.

Loading output library...

Loading output library...

We have an index, and three columns : `timestamp`

, `coffees`

, and `contributor`

.

Uh-oh. Why is there a string of text, `testing`

, in our coffee numbers ? What's going on in the `coffees`

column in the row after that ?

**Note :** `df.head(n=10)`

would show the first ten rows. The default is `n=5`

.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html

Loading output library...

Definitely a string. We'll note this as something to fix after we finish looking around.

**Note :** `.loc`

uses a label-based lookup, which means that the value you pass into the square brackets must be in the index. Another method, `.iloc`

, is integer-location-based, so `.iloc[2]`

would return the third row. In this case, they're the same, but had we changed our index, as we'll see later, things would work differently.

Indexing a dataframe with `[]`

directly returns a `pd.Series`

or `pd.DataFrame`

by searching over *columns*, not rows. Indexing a `pd.Series`

with `[]`

is like indexing a dataframe with `.iloc`

.

`coffees`

.Loading output library...

**Note :** here, we're indexing a *series* ( a `pd.Series`

object ). From a `pd.DataFrame`

( here, `data`

), when you access a single column ( `data.coffees`

or `data["coffees"]`

), the object returned is a `pd.Series`

. From that, indexing directly with `[]`

works in an integer-location-based manner, and like with numpy arrays, you can take slices ( `[:5]`

).

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html

Loading output library...

Looks like we also have some missing data - we have 671 rows, but the `coffees`

column only has 658 entries.

**Note :** `.describe()`

returns different things based on what's in the dataframe, as we'll see later. For numerical columns, it will return things like the mean, standard deviation, and percentiles. For object columns ( strings or datetimes ), it will return the most frequent entry and the first and last items. For all columns, `.describe()`

will return the count of objects in that column ( not counting NaNs ) and the unique number of entries. You can determine what's returned using `.describe()`

's keyword arguments.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html

`coffees`

is null.Loading output library...

**Note :** `.isnull()`

returns a boolean array ( an array of `True`

s and `False`

s ), that you can then use to index the dataframe directly. Here, our boolean array tells us which entries in the `coffees`

column are null, and we use that to index against the full dataframe - so we get back every column in the dataframe, but only those rows where `coffees`

is null.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.isnull.html

Loading output library...

The `contributor`

column makes sense as `object`

, because we expect strings there; but surely the `timestamp`

should be a timestamp-type, and `coffees`

should be numerical ?

`timestamp`

column.It looks like the `timestamp`

field was read from CSV as a string. That makes sense - CSV files are very basic. We'll have `pandas`

interpret these strings as datetimes for us automatically.

**Note :** here's an example of using direct `[]`

indexing on a `pd.Series`

. We're accessing the first entry, just to see what type of object we have there.

- The
`timestamp`

column contains strings; these need to be datetimes - The
`coffees`

column contains some null values and at least one string

Loading output library...

Loading output library...

Loading output library...

`timestamp`

strings to datetime objects.Loading output library...

Loading output library...

Loading output library...

**Note :** `.describe(include="all")`

is describing all attributes of all columns, but some don't make sense based on the column's `dtype`

. For example, the contributor column has no `first`

and `last`

attributes, because those describe the first and last entries in an ordered series. That makes sense for the timestamp - those have an intuitive definition of sorting - but not so much for strings ( alphabetical order doesn't really matter when they're arbitrary strings ). Similary, the timestamp column has no mean or other numerical traits. What does it mean to calculate the mean timestamp ?

Loading output library...

Loading output library...

`pandas`

is plotting the coffees against the index, which is just a series of integers.

**Note :** `.plot()`

on a `pd.Series`

will plot the data against the index. On a `pd.DataFrame`

, the `.plot()`

method allows plotting of one column against another.

By default, `.plot()`

renders a line graph, but you can specify which type of plot you'd like - bar, line, histogram, area, scatter, etc..

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.plot.html

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.plot.html

`plot()`

method rather than that of the series.Loading output library...

Loading output library...

We have some very uneven spacing in places. We might start by cutting off the last few points of this time-series, which is missing a lot of data.

Loading output library...

After mid-March, things start getting spaced rather erratically.

Loading output library...

**Note :** this is another example of boolean indexing. `data.timestamp < "2013-03-01"`

is a boolean array, and can be passed into the dataframe immediately in `[]`

, much like with a `np.ndarray`

.

Loading output library...

Loading output library...

Loading output library...

**Note :** `.value_counts()`

counts the unique values in a series. It's similar to doing a `.groupby()`

followed by a `.count()`

, as we'll see soon.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html

Loading output library...

Loading output library...

Loading output library...

Loading output library...

Loading output library...

**Note :** this first line could be replaced by `weekday_counts = data.weekdays.value_counts()`

, with the only difference being that that would return a series to us, and here, we got back a dataframe.

Loading output library...

Loading output library...

Loading output library...

Loading output library...

**Note :** `pd.date_range`

creates a fixed-frequency DatetimeIndex. `normalize=True`

ensures these datetimes are at midnight, and not at whatever time the starting point is.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.date_range.html

Loading output library...

**Note :** the union of these indices is just a new index where entries from both indices are present. It's sorted by time.

Loading output library...

**Note :** `.reindex()`

keeps any values that conform to the new index, and inserts `NaN`

s where we have no values.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html

`NaN`

s using interpolation.Loading output library...

Loading output library...

Loading output library...

Loading output library...

Loading output library...

**Note :** we use `.shift()`

here because if we look at the `.diff()`

between a Monday and a Tuesday, those coffees are attributed to the Tuesday. However, what we want to say is "this many coffees were made at some point on the Monday", so we shift the entire series up one.

Loading output library...

Loading output library...

Loading output library...

Wednesdays was seminar day...

Loading output library...

Loading output library...

**Note :** by default, inner joins are performed. That is, if a row from one of the datasets has an index that isn't in the other dataset, that row is dropped. You can specify whether you want outer, left, or right joins, as well plenty of other useful options. The `pandas`

API for joining or merging datasets is very developed.

Loading output library...

Loading output library...

Loading output library...

Loading output library...

**Note :** the `parse_dates`

keyword argument takes several values. By passing in a list of strings, we're telling `pandas`

to attempt to parse the dates in columns with those names.

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

`status`

column ?Loading output library...

Loading output library...

Loading output library...

**Note :** the first line here creates a boolean `pd.Series`

, holding the value `True`

when `machine_status.status`

is `"OK"`

, and `False`

otherwise. Because it's a `pd.Series`

, its index stays the same as that of `machine_status`

, which was a `DatetimeIndex`

. Then, we can plot the boolean series ( `True`

appearing as `1`

, and `False`

appearing as `0`

), and just quickly scan to see that there are long areas where the coffee machine was operations, with short bouts ( thankfully ! ) of the machine being broken.

Loading output library...

Loading output library...

Loading output library...

Loading output library...

Loading output library...

Loading output library...

- Cast columns to the correct dtypes
- Dropped rows with no data
- Truncated the time-series when things got sparse

- Discovered who contributed to the dataset and how much
- Established how contributions varied by day of the week

- Resampled the uneven time-series to regular daily intervals
- Interpolated over missing data using our time index
- Discovered on what days of the week coffee was especially popular
- Joined the dataset with others to enrich the information available
- Derived new columns from existing ones
- Smoothed the time-series by weekly downsampling

- A small number of people contributed most of the data
- Contributions were lacking at the weekends, and Mondays had the most contributions
- Seminar Wednesdays is a strong driver of coffee consumption
- Periods of low coffee drinking correlated strongly with periods where the machine was broken
- A significant dip in consumption occurred in the summer months