Data on number of public notebooks on Github was downloaded from this repository by Peter Parente, contributor to the Jupyter Project
He created a script that scrapes the GitHub web search UI for the count, appends to a CSV, executes a notebook. The entire collection process is automated and set to run on TravisCI on a daily schedule.
I've simply re-written the plots in plotly to make the graphs more readable and interactive. Enjoy!
First, let's load the historical data into a DataFrame indexed by date. There might be missing counts for days that we failed to sample. We build up the expected date range and insert NaNs for dates we missed. The we can plot the known notebook counts.
Next, let's look at various measurements of change.
The total change in the number of
*.ipynb hits between the first day we have data and today is:
The mean daily change for the entire duration is:
The change in hit count between any two consecutive days for which we have data looks like the following:
The large jumps in the data are from GitHub reporting drastically different counts from one day to the next. Maybe GitHub was rebuilding a search index when we queried or had a search broker out-of-sync with the others?
Let's drop outliers defined as values more than two standard deviations away from a centered 180 day rolling mean.
Now let's do a simple linear interpolation for missing values and then look at the rolling mean of change.
The model appears to favor seasonality effects in the early data and replicate them throughout the forecast period. The density of early data versus the sparsity of later data is a likely cause.
Finally, it's nice to celebrate million-notebook milestones. We can use our model to predict when they're going to occur.