Table of Contents

#Table-of-Contents

Introduction: Analysis of Medium Stats

#Introduction:-Analysis-of-Medium-Stats

In this notebook, we will analyze my Medium article stats. The functions for scraping and formatting the data were developed in the Development notebook, and here we will focus on looking at the data quantitatively and visually.

Instructions

#Instructions

To apply to your own medium data

  • Go to the stats page https://medium.com/me/stats
  • Make sure to scroll all the way down to the bottom so all the articles are loaded
  • Right click, and hit 'save as'.
  • Save the file as stats.html in the data/ directory. You can also save the responses to do a similar analysis.

1
2
3
# Might need to run this on MAC for multiprocessing to work properly
# see https://stackoverflow.com/questions/50168647/multiprocessing-causes-python-to-crash-and-gives-an-error-may-have-been-in-progr
export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES

For any of the figures, I recommend opening them in plotly and touching them up. plotly is an incredible library and I highly it as a replacement for whatever plotting library you are using.

Retrieve Statistics

#Retrieve-Statistics

Thanks to a few functions already developed, you can get all of the statistics for your articles in under 10 seconds.

Each of these entries is a separate article. To get the information about each article, we use the next function. This scrapes both the article metadata and the article itself (using requests and BeautifulSoup).

Loading output library...

Analysis

#Analysis

With the comprehensive data, we can do any sort of analysis we want. There's a lot of data here and I'm sure you'll be able to find other interesting things to do with the data.

Loading output library...

Correlations

#Correlations

We can start off by looking at correlations. We'll limit this to the published articles for now.

Loading output library...

If we are looking at maximizing claps, what do we want to focus on?

Loading output library...

Okay, so most of these occur after the article is released. However, the tag Towards Data Science seems to help quite a bit! It also looks like the read time is negatively correlated with the number of claps.

Correlation Heatmap

#Correlation-Heatmap

Using the plotly python library, we can very rapidly create interactive great looking charts.

Here are the avaiable colorscales if you want to try others:

1
2
3
colorscales = ['Greys', 'YlGnBu', 'Greens', 'YlOrRd', 'Bluered', 'RdBu',
        'Reds', 'Blues', 'Picnic', 'Rainbow', 'Portland', 'Jet',
        'Hot', 'Blackbody', 'Earth', 'Electric', 'Viridis', 'Cividis']
Loading output library...

Correlations by themselves don't tell us that much. It does not help that most of these are pretty obvious, such as the claps and fans will be highly correlated. Sometimes correlations by themselves are useful, but not really in this case.

Scatterplot Matrix

#Scatterplot-Matrix
Loading output library...
Loading output library...
Loading output library...
Loading output library...

Histograms

#Histograms
Loading output library...
Loading output library...
Loading output library...
Loading output library...

Cumulative Plot

#Cumulative-Plot
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...

With Range Slider

#With-Range-Slider

The neat part about plotly is we can easily add more elements to our plots. For example, to make a range selector and a range slider, let's just pass in an extra parameter to the function.

Loading output library...
Loading output library...

Scatter Plots

#Scatter-Plots
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...

Univariate Linear Regressions

#Univariate-Linear-Regressions

For the linear regressions, we'll focus on articles that were published in Towards Data Science. This makes the relationships clearer because the other articles are a mixed bag. We'll start off using a single variable - univariate - and focusing on linear relationships.

Loading output library...

Views Regressed by Word Count

#Views-Regressed-by-Word-Count

Let's do a regression of the number of words versus the views for articles published in towards data science. We are using statsmodels.api.OLS which sets the intercept to be 0. I made this choice because the number of views can never be negative (sometimes we do need an intercept so I left this as a parameter).

Loading output library...

This tells us that for every extra word, I get 13 more views! If we look at the plot, there is one outlying data point beyond 5000 words. What happens if I stick to articles under 5000 words published on Towards Data Science?

Loading output library...

Now we see that for every extra word, I get 14 more views! However, it looks like I want to keep my articles under 5000 words (about a 25 minute reading time).

Read Ratio Regressed by Reading Time

#Read-Ratio-Regressed-by-Reading-Time

If we want to fit a model with an intercept, we can use scipy.stats.linregress

Loading output library...
Loading output library...

This time, we see that for every additional minute of reading time, the percentage of people who read the article declines by 2.3%. For an article with a 0 minute reading time, 53% of people will read it!

Let's take a look at a few different fits.

Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...

This clearly is not the best fit!

Univariate Polynomial Regressions

#Univariate-Polynomial-Regressions

Next, we'll let the degree of the fit increase above 1. Overfitting (especially with limited data) is definitely going to be the outcome, but we'll let this serve as a lesson about having too many parameters in your model!

Loading output library...
Loading output library...
Loading output library...
Loading output library...
Loading output library...

Multivariate Regressions

#Multivariate-Regressions

Next, we'll consider more independent variables in our model. For this, we need to break out the exceptional Scikit-Learn library. We'll use liner_model.LinearRegression which supports multiple independent variables.

Loading output library...
Loading output library...
Loading output library...
Loading output library...

We can see that some variables contribute positively to the number of reads, while others decrease the number of reads! Evidently, I should decrease the reading time, not use the tag education, and use the tags Towards Data Science and Python.

Loading output library...
Loading output library...
Loading output library...
Loading output library...

Extrapolations

#Extrapolations

The most fun part of this is extrapolating wildly into the future! Using the past stats, we can make estimates for the future using the numbers of days since publishing.

Loading output library...
Loading output library...
Loading output library...

Conclusions

#Conclusions

Well, that's about all I have! There is a lot of additional analysis that could be done here, and going forward, I'll be further developing these functions and trying to extract more information. Feel free to use these functions on your own articles, and of course, contribute as needed! Developing this library has been enjoyable, and I look forward to expanding it so any suggestions are welcome and appreciated.