In this notebook, we will analyze my Medium article stats. The functions for scraping and formatting the data were developed in the `Development`

notebook, and here we will focus on looking at the data quantitatively and visually.

To apply to your own medium data

- Go to the stats page https://medium.com/me/stats
- Make sure to scroll all the way down to the bottom so all the articles are loaded
- Right click, and hit 'save as'.
- Save the file as
`stats.html`

in the`data/`

directory. You can also save the responses to do a similar analysis.

```
1
2
3
```

```
# Might need to run this on MAC for multiprocessing to work properly
# see https://stackoverflow.com/questions/50168647/multiprocessing-causes-python-to-crash-and-gives-an-error-may-have-been-in-progr
export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES
```

For any of the figures, I recommend opening them in plotly and touching them up. `plotly`

is an incredible library and I highly it as a replacement for whatever plotting library you are using.

Thanks to a few functions already developed, you can get all of the statistics for your articles in under 10 seconds.

Each of these entries is a separate article. To get the information about each article, we use the next function. This scrapes both the article metadata and the article itself (using `requests`

and `BeautifulSoup`

).

Loading output library...

With the comprehensive data, we can do any sort of analysis we want. There's a lot of data here and I'm sure you'll be able to find other interesting things to do with the data.

Loading output library...

We can start off by looking at correlations. We'll limit this to the `published`

articles for now.

Loading output library...

If we are looking at maximizing claps, what do we want to focus on?

Loading output library...

Okay, so most of these occur after the article is released. However, the tag `Towards Data Science`

seems to help quite a bit! It also looks like the read time is negatively correlated with the number of claps.

Using the `plotly`

python library, we can very rapidly create interactive great looking charts.

Here are the avaiable colorscales if you want to try others:

`1 2 3`

`colorscales = ['Greys', 'YlGnBu', 'Greens', 'YlOrRd', 'Bluered', 'RdBu', 'Reds', 'Blues', 'Picnic', 'Rainbow', 'Portland', 'Jet', 'Hot', 'Blackbody', 'Earth', 'Electric', 'Viridis', 'Cividis']`

Loading output library...

Correlations by themselves don't tell us that much. It does not help that most of these are pretty obvious, such as the `claps`

and `fans`

will be highly correlated. Sometimes correlations by themselves are useful, but not really in this case.

Loading output library...

Loading output library...

Loading output library...

Loading output library...

Loading output library...

Loading output library...

Loading output library...

Loading output library...

Loading output library...

Loading output library...

Loading output library...

Loading output library...

Loading output library...

The neat part about plotly is we can easily add more elements to our plots. For example, to make a range selector and a range slider, let's just pass in an extra parameter to the function.

Loading output library...

Loading output library...

Loading output library...

Loading output library...

Loading output library...

Loading output library...

Loading output library...

Loading output library...

For the linear regressions, we'll focus on articles that were published in Towards Data Science. This makes the relationships clearer because the other articles are a mixed bag. We'll start off using a single variable - univariate - and focusing on linear relationships.

Loading output library...

Let's do a regression of the number of words versus the views for articles published in towards data science. We are using `statsmodels.api.OLS`

which sets the intercept to be 0. I made this choice because the number of views can never be negative (sometimes we do need an intercept so I left this as a parameter).

Loading output library...

This tells us that for every extra word, I get 13 more views! If we look at the plot, there is one outlying data point beyond 5000 words. What happens if I stick to articles under 5000 words published on Towards Data Science?

Loading output library...

Now we see that for every extra word, I get 14 more views! However, it looks like I want to keep my articles under 5000 words (about a 25 minute reading time).

If we want to fit a model with an intercept, we can use `scipy.stats.linregress`

Loading output library...

Loading output library...

This time, we see that for every additional minute of reading time, the percentage of people who read the article declines by 2.3%. For an article with a 0 minute reading time, 53% of people will read it!

Let's take a look at a few different fits.

Loading output library...

Loading output library...

Loading output library...

Loading output library...

Loading output library...

Loading output library...

This clearly is not the best fit!

Next, we'll let the degree of the fit increase above 1. Overfitting (especially with limited data) is definitely going to be the outcome, but we'll let this serve as a lesson about having too many parameters in your model!

Loading output library...

Loading output library...

Loading output library...

Loading output library...

Loading output library...

Next, we'll consider more independent variables in our model. For this, we need to break out the exceptional Scikit-Learn library. We'll use `liner_model.LinearRegression`

which supports multiple independent variables.

Loading output library...

Loading output library...

Loading output library...

Loading output library...

We can see that some variables contribute positively to the number of reads, while others decrease the number of reads! Evidently, I should decrease the reading time, not use the tag education, and use the tags Towards Data Science and Python.

Loading output library...

Loading output library...

Loading output library...

Loading output library...

The most fun part of this is extrapolating wildly into the future! Using the past stats, we can make estimates for the future using the numbers of days since publishing.

Loading output library...

Loading output library...

Loading output library...

Well, that's about all I have! There is a lot of additional analysis that could be done here, and going forward, I'll be further developing these functions and trying to extract more information. Feel free to use these functions on your own articles, and of course, contribute as needed! Developing this library has been enjoyable, and I look forward to expanding it so any suggestions are welcome and appreciated.