Data Visualisation: An Intro to Seaborn

#Data-Visualisation:-An-Intro-to-Seaborn

Matplotlib is an incredibly useful and popular visualization tool, but even long-time users often feel frustrated with its shortcomings in relation to, among other things, Matplotlib's default parameters and its apparent disharmony with DataFrames (which comes as no surprise, seeing as it predated Pandas by over a decade.

Seaborn provides an API on top of Matplotlib that offers similar choices for plot style and color defaults, defines simple high-level functions for common statistical plot types, and integrates with the functionality provided by Pandas DataFrames, allowing users to simply pass DataFrame labels to any plot.

In this Jupyter Notebook we'll explore two different datasets, Iris and Pokemon. The reason why Seaborn is so great with DataFrames is, for example, because labels from DataFrames are automatically propagated to plots or other data structures. Let's get on with it!

Exploratory Data Analysis

#Exploratory-Data-Analysis

Pairplots are one of the best ways to visualise the multidimensional relationships within a dataset, and is as easy as calling sns.pairplot. This creates a matrix of axes and shows the relationship for each pair of columns in a DataFrame. By default, it also draws the univariate distribution of each variable on the diagonal axis.

Below we specify the different species of flower, the size of the graph (now done using the "height" parameter) and the color palette. We also set the title.

Kernel Density Estimations

#Kernel-Density-Estimations

The pairplot() function is built on top of a PairGrid object, which can be used directly for more flexibility.

KDE plots are a useful tool for plotting the shape of a distribution. Like the histogram, the KDE plots encode the density of observations on one axis with height along the other axis. We can get a smooth estimate of the distribution, which Seaborn does with sns.kdeplot.

Histograms and KDEs can be combined using distplot:

Passing the full two-dimensional dataset to kdeplot, we get a two-dimensional visualization of the data.

Joint Distributions

#Joint-Distributions

Likely, the simplest way to visualize a bivariate distribution, familiar to everyone, is a scatterplot. The scatterplot is the default plot of the jointplot() function.

In Seaborn you can also pass plot syles as parameters to other plot functions. Take for example the example below, using the kernel density estimation procedure described above to visualize a bivariate distribution, as a style in jointplot().

Fitting Parametric Distributions

#Fitting-Parametric-Distributions

Here we are using the distplot() function to plot the distribution of petal lengths (in cm) for the virginica flower only.

You can also use distplot() to fit a parametric distribution to a dataset and visually evaluate how closely it corresponds to our data.

Time to explore the pokemon dataset (much more interesting)!

Categorical analysis

#Categorical-analysis

Here we're just plotting out the number of Pokemon "Type" categories using catplot(). Note that "catplot" is the updated name for this function (formerly factorplot()).

Another Look at Joint Distributions

#Another-Look-at-Joint-Distributions

Let's look at the joint distribution of Attack and Defense capabilities for all pokemon in the dataset, which highlights some pretty strong outliers in these two strength categories!

The joint plot can even do some automatic kernel density estimation and regression, this time on the special abilities.

A nice way to compare distributions between different variables is to use a violin plot (I've dropped a few pokemon types for the purpose of visualisation).

Naturally, Flying pokemon have a heavier distribution at the higher end of the speed spectrum.

Subplotting (and Boxplots)

#Subplotting-(and-Boxplots)

As mentioned already, seaborn is built on top of matplotlib, so we are able to display subplots as follows: