We'll work with a Kaggle dataset: House Sales in King County, USA.
These are the features of the dataset:
Importing the required libraries:
Loading the dataframe:
The first step when analyzing data is cleaning. Understanding if we've loaded the data correctly and we have valid values. This is a process that will involve multiple steps, but for now, we start with our 5 minute check:
shape we know that there are 21,613 rows, with 21 columns (features). Let's check for red flags on those features:
info gives you a quick summary of both the type and the count for each column. In this case the data seems correct, there are no missing values and the types are correct.
Our objective is to predict the price of a house based on the features that we know about the house. For example, we know that a larger surface area and more bedrooms will relate with a highest price. But what about the
id of the house? It's probably just an internal ID and is not affecting the real price.
That is feature selection, understanding what features are important to the ML model.
With pandas is extremely simple to exclude columns:
What other variables would you exclude? For this workshop, we'll exclude
long. We could have done a better analysis for
long, but with
zipcode it's probably enough.
Some variables will have higher (positive or negative) correlation with the price. We know that the surface area of a house is positively correlated with its price: the larger the house, a higher price. But what about others? We can build a simple correlation plot to understand a little bit better the relationship between different variables:
So, for example, we can see that
sqft_living is highly correlated with the
We'll use a simple visualization mechanism to have a visual clue about these variables and their correlation:
We see some strange patterns, like for example, the apparent "negative" correlation between
zipcode and price. Something that doesn't make any sense. We'll talk more about this when we explore
zipcode as a categorical feature later.
Once we identify correlation between different variables, we can explore how they're correlated. For example, we saw
They also seem strongly correlated, but, are they just linearly correlated?
Doesn't seem so, or at least it's not as clear as with
sqft_living. There seems to be some sort of polynomic relationship. We can use a logarithmic y axis to test:
It now looks a little bit better. We can use these relationships we've identified to improve our model later.
Linear regression (along with other ML models) will be really sensitive to outliers:
🤔A house with 33 bedrooms? There's something going on here:
It makes sense for a (really expensive) house to have, let's say 10 bedrooms, but 33 seems like an error.
33 bedrooms and only 1.75 bathrooms? 😅 clearly an error.
Now, what about those properties without bathrooms? That is strange, let's take a look:
Now that we look at it it makes a little bit more sense. Maybe those are just warehouses or other type of storage unit facilities? Without more information is now difficult to make a decision. This is an important lesson: domain expertise is fundamental when analyzing data
I'll not remove any house for now.
How are other variables doing?
This probably requires a little bit more analysis, but let's proceed.
zipcode feature imposes an issue. Machine learning models, don't understand "human" features like
zipcode. For a ML algorithm, a value of
98178 in zipcode is "greater" than
98125, even though for us, knowing the area, the zipcode
98125 might have more expensive houses. These are the zipcodes in our dataset:
Only 70 zipcodes:
Introducing "Dummy Variables":
Dummy variables is the correct way to feed a ML model a categorical feature. We'll see how to combine these later.
There's a final IMPORTANT point to discuss, and that is "scaling" and "normalizing" features. It has a mathematical explanation, but basically, what we DON'T want is to have features that are in completely different units. For example:
The values here are too dissimilars, which will make some algorithms perform poorly and slower. We'll then "scale" these features to remove the unit. Read more here: Importance of Feature Scaling
Let's see now how our Linear Regression is performing with these simple modifications:
0.79! Much better, right? This is just an introduction on how important it is a good process of data analysis applied to Machine Learning.