Purpose: Clean and analyze the used car listings
Dataset: Sample from the original Kaggle dataset. The data was cralwed from eBay Kleinanzeigen, a classified section of the German eBay website.
All data seems to be loaded.
yearOfRegistration, monthOfRegistration, notRepairDamage, and dateCreate are either wordy or unclear what they mean. Let's change it to something more clearler and less wordy
Let's take a quick glance on the data
looks like nr_of_pictures have mostly one value. Let's find out if that's the case.
The columns has all 0s for its value. It's either crawled wrong, or they really didn't have any pictures in the ad. Either case, it doesn't differentiate the data. We could drop it
Registration Year also needs more evaluation because year 9999 hasn't come yet and year 1000 was even before car was invented. There are other suspicious data points. We will have to decide what to do about it later on.
Also power PS zero is not valid value. Need further investigation. The minimum value of postal code also doesn't look like it conforms with other values. 4 digits instead of 5.
Price and odometer are numeric but stored as text. Let's change it.
It looks like there are quite few outliers
We ommitted about 12,000 record and it gives us more reasonable price range.
odometer values looks actually pretty normal and not many extreme values
let's look at date_crawled, ad_created, and last_seen
They include time as well. Also they are in object time instead of date or numeric. To look at their distribution by day, let's cut them by 10 to extract year-month-day.
Distribution of crawling data is from 2016-03-05 to 2016-04-07
ad_created has more spread value than crawled date which is understandable
Let's do the same thing for registration_year
Revisiting where we left off, there is definately incorrect data in registration year. Two obvious ones are
that removed about 2,000 data points and the distribution looks little more realistic.
from now on, let's use the data that outliers are excluded.
Top 10 most frequently registered brand. Now based on the list of the brand, I'm going to calculated their average price.
From the list above, Audi, Mercedes Benz, and BMW are the 3 most expesive brand in the data set. Renault, Peugeot, and Fiat are the least expensive.
This is surprising however, it looks like there are positive relationship between average odometer and average price.