This post was inspired by Section 2.3 of the excellent (and free) textbook, The Elements of Statistical Learning (1). The example analyses were done with scikit-learn (2). The data analyzed was taken from Kaggle (3).

Try for a moment to forget everything that you've learned about statistical modeling and machine learning. Now, consider the problem of predicting a quantity, @@0@@, associated with a set of descriptors @@1@@ is a list of descriptors, which we will take to be real-valued numbers for simplicity.

One common approach to predicting @@2@@ is to assume a particular kind of relationship between @@3@@ and @@4@@ such that

@@5@@

where @@6@@ and @@7@@ are adjustable parameters that are determined by minimizing a measure of error in the predictions over the training set. This approach, known as linear regression may have been the first approach to prediction of scalar values that you learned about, but is this what you would have come up with if you were inventing statistical modeling for the first time?

The application of linear regression entails a fairly strong assumption, namely that there's a linear relationship between the predictors (or descriptors) and the value to be predicted. The documented success of linear models in practice is very often all the justification that one needs for trying linear regression when you come upon a new prediction problem. But is there an even simpler assumption that you could have (or would have) made if you didn't know about the successes of linear models?

It would seem natural to assume that sets of descriptors @@8@@. So if nearest neighbors regression does not assume a linear model, then how does it work?

The predictions by nearest neighbors regression of @@9@@, denoted @@10@@, are given by

@@11@@

In words, the above equation says

Our prediction of @@12@@ for a set of descriptors @@13@@.

That is a strikingly simple and natural way of performing predictions. As with most methods, however, the devil is in the details. In particular, we need 1. a way of choosing @@14@@ 2. a measure of distance between points @@15@@

There are many ways to handle these problems, but a few of the simplest are to (1) train using different values of @@16@@ and find the minimum value of an error measure on a test set and (2) use the Euclidean distance (probably) after scaling the descriptors).

Let's see how some of these ideas work in practice. We will be looking at some housing price data from King's County that can be downloaded from Kaggle (3).

There are some houses that appear more than once in the set. For illustration purposes, let's eliminate the duplicate entries. You'll see why below.

It looks like we have a bunch data types to work with, but to simplify things let's look only at the latitude and longitude columns. These variables will lead to a natural sense of "closeness" and have the same units, so we don't have to worry about scaling them before computing the distances.

Split the data into train and test sets.

Import scikit-learn's nearest neighbor regression function and fit the training data. For now, we will fit using the limiting case of @@0@@, i.e., we will assume that the price of the house is equal to the price of the nearest neighbor in the training set.

Make the predictions @@0@@ for the training and test sets.

Compute the root mean squared error for the training and test sets.

The root mean squared error over the training set is zero! Think about how the prediction is being done. When it comes time to predict the price of a house in the training set, it looks for the closest house in the training set and uses the price of that house as the predicted price. But when you are making predictions on the training set, the closest house is, of course, the house itself!

Now you can understand why we eliminated the duplicate houses in the data set. If the same house shows up twice and sold for two different prices, then even on the training set the RMSE would not be zero. That might be fine in practice, but I wanted to illustrate the above point.

Let's try a few different values of @@0@@ to see what "size" neighborhood works best on the training set.

The test error drops initially when going from 1 to about 5 neighbors and then remains relatively constant. As the number of neighbors @@0@@ approaches 50, the testing and training error are nearly the same.

You may feel as though it would be more natural to average over prices over a specific radius rather than a fixed number of neighbors. There's a method for that! Note that, below, the radius has the same units as the distance that is being computed, in this case degrees latitude/longitude.

The warnings here are telling us an important story about the disadvantage of using a radius-based predictor: sometimes there are no houses in the training set to make predictions on a house in the test set. Those cases are discarded when computing the error metrics, but in practice this would be a problem.

Perhaps unsurprisingly, nearest neighbors regression is doing better than linear regression. This example was concocted such that this would be the case. Nonetheless, this comparison illustrates an important point, namely that thinking about how variables might be related to each other before choosing a particular machine learning model can help quite a bit. Can you see why @@0@@ for the prediction of housing prices ($y$) given the latitude and longitude ($\overline{x}$)?