In this notebook, we will analyze my Medium article stats. The functions for scraping and formatting the data were developed in the
Development notebook, and here we will focus on looking at the data quantitatively and visually.
To apply to your own medium data
data/directory. You can also save the responses to do a similar analysis.
1 2 3
# Might need to run this on MAC for multiprocessing to work properly # see https://stackoverflow.com/questions/50168647/multiprocessing-causes-python-to-crash-and-gives-an-error-may-have-been-in-progr export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES
For any of the figures, I recommend opening them in plotly and touching them up.
plotly is an incredible library and I highly it as a replacement for whatever plotting library you are using.
Thanks to a few functions already developed, you can get all of the statistics for your articles in under 10 seconds.
Each of these entries is a separate article. To get the information about each article, we use the next function. This scrapes both the article metadata and the article itself (using
With the comprehensive data, we can do any sort of analysis we want. There's a lot of data here and I'm sure you'll be able to find other interesting things to do with the data.
We can start off by looking at correlations. We'll limit this to the
published articles for now.
If we are looking at maximizing claps, what do we want to focus on?
Okay, so most of these occur after the article is released. However, the tag
Towards Data Science seems to help quite a bit! It also looks like the read time is negatively correlated with the number of claps.
plotly python library, we can very rapidly create interactive great looking charts.
Here are the avaiable colorscales if you want to try others:
1 2 3
colorscales = ['Greys', 'YlGnBu', 'Greens', 'YlOrRd', 'Bluered', 'RdBu', 'Reds', 'Blues', 'Picnic', 'Rainbow', 'Portland', 'Jet', 'Hot', 'Blackbody', 'Earth', 'Electric', 'Viridis', 'Cividis']
Correlations by themselves don't tell us that much. It does not help that most of these are pretty obvious, such as the
fans will be highly correlated. Sometimes correlations by themselves are useful, but not really in this case.
The neat part about plotly is we can easily add more elements to our plots. For example, to make a range selector and a range slider, let's just pass in an extra parameter to the function.
For the linear regressions, we'll focus on articles that were published in Towards Data Science. This makes the relationships clearer because the other articles are a mixed bag. We'll start off using a single variable - univariate - and focusing on linear relationships.
Let's do a regression of the number of words versus the views for articles published in towards data science. We are using
statsmodels.api.OLS which sets the intercept to be 0. I made this choice because the number of views can never be negative (sometimes we do need an intercept so I left this as a parameter).
This tells us that for every extra word, I get 13 more views! If we look at the plot, there is one outlying data point beyond 5000 words. What happens if I stick to articles under 5000 words published on Towards Data Science?
Now we see that for every extra word, I get 14 more views! However, it looks like I want to keep my articles under 5000 words (about a 25 minute reading time).
If we want to fit a model with an intercept, we can use
This time, we see that for every additional minute of reading time, the percentage of people who read the article declines by 2.3%. For an article with a 0 minute reading time, 53% of people will read it!
Let's take a look at a few different fits.
This clearly is not the best fit!
Next, we'll let the degree of the fit increase above 1. Overfitting (especially with limited data) is definitely going to be the outcome, but we'll let this serve as a lesson about having too many parameters in your model!
Next, we'll consider more independent variables in our model. For this, we need to break out the exceptional Scikit-Learn library. We'll use
liner_model.LinearRegression which supports multiple independent variables.
We can see that some variables contribute positively to the number of reads, while others decrease the number of reads! Evidently, I should decrease the reading time, not use the tag education, and use the tags Towards Data Science and Python.
The most fun part of this is extrapolating wildly into the future! Using the past stats, we can make estimates for the future using the numbers of days since publishing.
Well, that's about all I have! There is a lot of additional analysis that could be done here, and going forward, I'll be further developing these functions and trying to extract more information. Feel free to use these functions on your own articles, and of course, contribute as needed! Developing this library has been enjoyable, and I look forward to expanding it so any suggestions are welcome and appreciated.