We'll work with a Kaggle dataset: House Sales in King County, USA.

These are the features of the dataset:

**id**: a notation for a house**date**: Date house was sold**price**: Price is prediction target**bedrooms**: Number of Bedrooms/House**bathrooms**: Number of bathrooms/bedrooms**sqft_living**: square footage of the home**sqft_lot**: square footage of the lot**floors**: Total floors (levels) in house**waterfront**: House which has a view to a waterfront**view**: Has been viewed**condition**: How good the condition is ( Overall )**grade**: overall grade given to the housing unit, based on King County grading system**sqft_above**: square footage of house apart from basement**sqft_basement**: square footage of the basement**yr_built**: Built Year**yr_renovated**: Year when house was renovated**zipcode**: zip**lat**: Latitude coordinate**long**: Longitude coordinate**sqft_living15**: Living room area in 2015(implies-- some renovations) This might or might not have affected the lotsize area**sqft_lot15**: lotSize area in 2015(implies-- some renovations)

Importing the required libraries:

Loading the dataframe:

Loading output library...

The first step when analyzing data is cleaning. Understanding if we've loaded the data correctly and we have valid values. This is a process that will involve multiple steps, but for now, we start with our *5 minute* check:

Loading output library...

With `shape`

we know that there are 21,613 rows, with 21 columns (features). Let's check for red flags on those features:

`info`

gives you a quick summary of both the type and the count for each column. In this case the data seems correct, there are no missing values and the types are correct.

Our objective is to predict the price of a house based on the features that we know about the house. For example, we know that a larger surface area and more bedrooms will relate with a highest price. But what about the `id`

of the house? It's probably just an internal ID and is not affecting the real price.

That is feature selection, understanding what features are important to the ML model.

With pandas is extremely simple to exclude columns:

Loading output library...

What other variables would you exclude? For this workshop, we'll exclude `date`

, `lat`

and `long`

. We could have done a better analysis for `lat`

and `long`

, but with `zipcode`

it's probably enough.

Some variables will have higher (positive or negative) correlation with the price. We know that the surface area of a house is positively correlated with its price: the larger the house, a higher price. But what about others? We can build a simple correlation plot to understand a little bit better the relationship between different variables:

Loading output library...

So, for example, we can see that `sqft_living`

is highly correlated with the `price`

:

Loading output library...

We'll use a simple visualization mechanism to have a visual clue about these variables and their correlation:

Loading output library...

Loading output library...

We see some strange patterns, like for example, the apparent "negative" correlation between `zipcode`

and price. Something that doesn't make any sense. We'll talk more about this when we explore `zipcode`

as a categorical feature later.

Once we identify correlation between different variables, we can explore how they're correlated. For example, we saw `sqft_living`

and `price`

:

Loading output library...

Loading output library...

What about `grade`

and `price`

?

Loading output library...

They also seem strongly correlated, but, are they just linearly correlated?

Loading output library...

Loading output library...

Doesn't seem so, or at least it's not as clear as with `sqft_living`

. There seems to be some sort of polynomic relationship. We can use a logarithmic y axis to test:

Loading output library...

Loading output library...

It now looks a little bit better. We can use these relationships we've identified to improve our model later.

Linear regression (along with other ML models) will be really sensitive to outliers:

Loading output library...

ðŸ¤”A house with 33 bedrooms? There's something going on here:

Loading output library...

Loading output library...

It makes sense for a (really expensive) house to have, let's say 10 bedrooms, but 33 seems like an error.

Loading output library...

33 bedrooms and only 1.75 bathrooms? ðŸ˜… clearly an error.

Now, what about those properties without bathrooms? That is strange, let's take a look:

Loading output library...

Now that we look at it it makes a little bit more sense. Maybe those are just warehouses or other type of storage unit facilities? Without more information is now difficult to make a decision. This is an important lesson: **domain expertise is fundamental when analyzing data**

I'll not remove any house for now.

How are other variables doing?

Loading output library...

Loading output library...

This probably requires a little bit more analysis, but let's proceed.

The `zipcode`

feature imposes an issue. Machine learning models, don't understand "human" features like `zipcode`

. For a ML algorithm, a value of `98178`

in zipcode is "greater" than `98125`

, even though for us, knowing the area, the zipcode `98125`

might have more expensive houses. These are the zipcodes in our dataset:

Loading output library...

Loading output library...

Only 70 zipcodes:

Loading output library...

Introducing "Dummy Variables":

Loading output library...

Dummy variables is the correct way to feed a ML model a categorical feature. We'll see how to combine these later.

There's a final **IMPORTANT** point to discuss, and that is "scaling" and "normalizing" features. It has a mathematical explanation, but basically, what we **DON'T** want is to have features that are in completely different units. For example:

Loading output library...

The values here are too dissimilars, which will make some algorithms perform poorly and slower. We'll then "scale" these features to remove the unit. Read more here: Importance of Feature Scaling

Loading output library...

We'll now use a really convenient package called sklearn-pandas that will let us scale our features and also create the Dummy zip variables:

Let's see now how our Linear Regression is performing with these simple modifications:

Loading output library...

Loading output library...

**0.79**! Much better, right? This is just an introduction on how important it is a good process of data analysis applied to Machine Learning.