Used Car Sales Analysis


Purpose: Clean and analyze the used car listings

Dataset: Sample from the original Kaggle dataset. The data was cralwed from eBay Kleinanzeigen, a classified section of the German eBay website.

Data Dictionary

  • dateCrawled - When this ad was first crawled. All field-values are taken from this date.
  • name - Name of the car.
  • seller - Whether the seller is private or a dealer.
  • offerType - The type of listing
  • price - The price on the ad to sell the car.
  • abtest - Whether the listing is included in an A/B test.
  • vehicleType - The vehicle Type.
  • yearOfRegistration - The year in which the car was first registered.
  • gearbox - The transmission type.
  • powerPS - The power of the car in PS.
  • model - The car model name.
  • kilometer - How many kilometers the car has driven.
  • monthOfRegistration - The month in which the car was first registered.
  • fuelType - What type of fuel the car uses.
  • brand - The brand of the car.
  • notRepairedDamage - If the car has a damage which is not yet repaired.
  • dateCreated - The date on which the eBay listing was -created.
  • nrOfPictures - The number of pictures in the ad.
  • postalCode - The postal code for the location of the vehicle.
  • lastSeenOnline - When the crawler saw this ad last online.

Data Load

Loading output library...
Loading output library...
Loading output library...

All data seems to be loaded.

Cleaning Columns Names


yearOfRegistration, monthOfRegistration, notRepairDamage, and dateCreate are either wordy or unclear what they mean. Let's change it to something more clearler and less wordy

  • yearOfRegistration to registration_year
  • monthOfRegistration to registration_month
  • notRepairedDamage to unrepaired_damage
  • dateCreated to ad_created
  • The rest of the columnn names from camelcase to snakecase.
Loading output library...

Initial Exploration and Cleaning


Let's take a quick glance on the data

Loading output library...

looks like nr_of_pictures have mostly one value. Let's find out if that's the case.

Loading output library...

The columns has all 0s for its value. It's either crawled wrong, or they really didn't have any pictures in the ad. Either case, it doesn't differentiate the data. We could drop it

Registration Year also needs more evaluation because year 9999 hasn't come yet and year 1000 was even before car was invented. There are other suspicious data points. We will have to decide what to do about it later on.

Loading output library...

Also power PS zero is not valid value. Need further investigation. The minimum value of postal code also doesn't look like it conforms with other values. 4 digits instead of 5.

Price and odometer are numeric but stored as text. Let's change it.

To summarize:

  • nr of pictures have only one value. 0
  • registration_year and postal code have some suspicious values
  • price and odometer have text values instead of numbers.
  • odometer is in km. We will change the column name to odometer_km
Loading output library...

Exploring the Odometer and Price Columns

Loading output library...
Loading output library...
Loading output library...

It looks like there are quite few outliers

  • Car price more than 350,000. Realistically, if the price is that high, it would be on somewhere else not ebay.
  • Car price under 1,000. Unlikely any car would be sold under 1,000 unless it's not functional
Loading output library...

We ommitted about 12,000 record and it gives us more reasonable price range.

Loading output library...
Loading output library...
Loading output library...

odometer values looks actually pretty normal and not many extreme values

Exploring the date columns


let's look at date_crawled, ad_created, and last_seen

Loading output library...
Loading output library...

They include time as well. Also they are in object time instead of date or numeric. To look at their distribution by day, let's cut them by 10 to extract year-month-day.

Loading output library...

Distribution of crawling data is from 2016-03-05 to 2016-04-07

Loading output library...

ad_created has more spread value than crawled date which is understandable

Loading output library...

Let's do the same thing for registration_year

Loading output library...

Dealing with Incorrect Registration Year Data


Revisiting where we left off, there is definately incorrect data in registration year. Two obvious ones are

  • The minimum value is 1000. It's before cars were invented
  • The maximum value is 9999. The data was collected in 2016. You can register car for the future
Loading output library...

that removed about 2,000 data points and the distribution looks little more realistic.

Loading output library...

Exploring Price by Brand


from now on, let's use the data that outliers are excluded.

Loading output library...

Top 10 most frequently registered brand. Now based on the list of the brand, I'm going to calculated their average price.

Loading output library...
Loading output library...

From the list above, Audi, Mercedes Benz, and BMW are the 3 most expesive brand in the data set. Renault, Peugeot, and Fiat are the least expensive.

Exploring Mileage by Brand

Loading output library...
Loading output library...

Storing Aggregate Data in a DataFrame

Loading output library...
Loading output library...
Loading output library...
Loading output library...

This is surprising however, it looks like there are positive relationship between average odometer and average price.