## Introduction to Linear Regression - Predicting Bike Usage in Madrid

This study looks at how a fairly simple machine learning algorithm might be used to predict bike-share traffic. Scikit Learn is pretty much the golden standard when it comes to machine learning in Python. It has many learning algorithms, for regression, classification, clustering and much more. So today, we will explore the sklearn.linear_model module which provides methods of regression, in which the target variable is expected to exhibit a linear relationship with the independent variables.

Bike-sharing systems are in place in several cities around the world, and are an increasingly important support for multimodal transport systems.There are now hundreds of bike-sharing systems around the globe, and the trend is only increasing.

The extent of the data generated by these systems makes them attractive for researchers like myself, to explore and leverage to develop predictive models. Bike-sharing systems function as a sensor network, which can be used for studying mobility in a city. For this study, the effect of factors such as weather (wind and rainfall), availability and service hours are all taken into consideration as we apply and determine the accuracy of a linear regression model, predicting the daily number of bike users in the city of Madrid.

Let's import the necessary packages that we'll need to run this study and let's also set the defaults for the plots.

Now we're going to define two functions that we'll call later, one to train & then test our predictive model on our dataset and to return the accuracy of the model's predictions, and the other to display the results.

We split the dataset into train and test sets using scikit learn's K-Folds cross-validator. By using K-fold cross validation, we are testing how well our model can be trained on our data and then predict the data not used for building the model. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k − 1 subsamples are used as training data. The cross-validation process is then repeated k times (the folds), with each of the k subsamples used exactly once as the validation data. The k results from the folds can then be averaged to produce a single estimation.

Note that the purpose of cross-validation is model checking, not model building. The model we're actually using is an Ordinary Least Squares Linear Regression model from scikit learn's linear_model library. This is a form of supervised learning, which consists of learning the link between two datasets: the observed data and an external variable that we are trying to predict, usually called “features” and “target” respectively.

The R², which is the percentage of explained variance of the predictions, scores the accuracy of our model.

Then we want to display the results of our model's fit & plot them accordingly.

Now we're going to load our dataset to pandas. This dataset has the number of riders for each day from August 10th 2015 to August 6th 2017. The number of riders is split between casual and registered, summed in the 'cnt' column.

Let's first plot the total number of daily riders by date. Then we'll plot the total number of riders per day, separated into registered riders (those who have paid to use the bike service on an as-needed basis for a particular interval) and casual riders (those who pay to use the bikes day-to-day).

So let's first configure the plots:

As one would expect, total bike usage spikes during the spring-to-summer months and tapers off coming into the winter months.

Now let's run our model below. We have information about windspeed, rainfall, availability and bike usage hours, all of which are likely affecting the number of riders each day, which we'll be trying to capture.

The total number of riders each day ('cnt') is our target variable - the values we are trying to predict. To create the model, we must "learn" the values of the coefficients (features). Once we've learned these coefficients, we can use the model to predict bike usage.

At first glance our model seems to perform quite well, for a simple linear regression at least. Upon closer inspection it is clear that the model is over-predicting and under-predicting at certain points.

It is important to note that linear regression models are susceptible to low variance/high bias, meaning that, under repeated sampling, the predicted values won't deviate far from the mean (low variance), but the average of those models won't do a great job capturing the true relationship (high bias). This is what's known as the bias-variance tradeoff in supervised learning, which prevent the model from generalizing beyond the training dataset.

Linear models also rely on many other assumption, like the feature variables being independent from one another. If such assumptions are violated (which they usually are), the results are less reliable.

Now let's look at the R-Squared - that's pretty high, apparently indicating that we have a really well-fitted model. Is this actually the case? R-squared is the proportion of variance explained by the model, or the reduction in error (or residuals) in the model - it is the percentage of the response variable variation that is captured by the linear model. The residuals are the differences between the true value of the target variable and the predicted values. In a regression model, we are trying to minimize the errors by finding the “line of best fit” — the regression line from the errors should ideally be minimal. It is related to (or equivalent to) minimizing the Mean Squared Error (MSE) or the Residual Sum of Squares (RSS).

R-squared returns a result between 0 and 1 - higher values are (supposedly) better because it means that more variance is explained by the model. However, it is important to note that the R-Squared is not the optimal way to measure performance. R-Squared measures "goodness of fit," but it won't capture underlying factors and a high R-Squared can even be the result of multiple features in the model that aren't actually related to the response ('target') variable.

So, is what we have a "good" R-squared value? It's hard to say. The threshold for a good R-squared value depends widely on the model's domain. So the measure is mostly useful for comparing different models, rather than the accuracy of one particular model.

Our linear regression model used in this study is limited by the fact that it can only make good predictions if there is indeed a linear relationship between the features and the response, which is why more complex methods (with higher variance and lower bias) will often outperform linear regression. If our linear model is not the way to go, then we could move to more complex models that extend the traditional linear regression model, such as GLMs or Heteroskedastic models.

In a future blog post, along with a new model, we can implement better ways to determine its performance, such as adjusted and predicted R-squared values. These two measures overcome the problems discussed above in relation to R-squared, providing additional information by which we can evaluate a regression model’s explanatory power. We will also be testing the statistical significance of the predictors, by way of “reducing the model” - the practice of including all candidate predictors, and then systematically removing the ones with high p-values until only significant predictors are included. The F-test of overall significance, for example, can be used to determine whether the relationship bewteen the dependent and the independent variables is statistically significant.

All of this and more will be applied to our dataset above in the next part of the series.