In this notebook I will present the process a Data Scientist / Analyst should follow in order to extract useful information from a dataset. As an example I will use the given Acc.csv file for Accidents in the United Kingdom for 2017. The analysis is split in the mandatory steps for creating meaningful insights.
In the following steps I will not use these columns, therefore there is no need to handle these Null values.
Above the basic statistics of our Dataframe are presented but most of them are meaningless as the specific attributes are recorded from Python in a fault data type. For example, the attribute Road_Type should be category and not integer as is obvious after the describe() function.
I will select only the columns that I will use for the analysis.
In order to understand what the specific elements represent in each column of the dataset, I referred to the given metadata excel file and I changed these values.
I will convert some attributes from object to category.
One of the important steps in order to have a general picture of the dataset is to extract a basic statistical information from the numeric attributes. The mean,standard deviation, min, max and the Quartiles are shown in the above table, for the two numeric attributes NumberOfVehicles and SpeedLimit.
Now, the dataset is cleaned and ready for the analysis. I will implement the analysis by answering some queries on the dataset in order to gain insight from the results.
****Find the percentage of all the accidents that are Fatal and occur on Saturday
So, 0.216% of the Accidents that occur on Saturday are Fatal.
****Find the number of accidents that happened in Greater Manchester and occured when it was snowing
So, 25 accidents happened in Greater Manchester when it was snowing.
So, 10% of the accidents that happened in urban area were due to the fact that the driver had been exceeding the speed limit of 30 miles per hour in these areas.
From this graph it is obvious that most of the accidents happened in the speed limit of 30 miles per hour. Also there is a significant number of accidents with speed greater than 60 miles per hour.
Notice: The shape of distribution is as such because the SpeedLimit attribute should be categorical and not numeric, as shown from this plot. But, I handled it like numeric for the extraction of other statistical information from the dataset.
From this plot it is obvious that the Fatal accidents have big interquartile range and therefore an accident can be fatal at any speed. Moreover, the slight accidents occur in low speed with some outliers.
We have the same results as the previous plot.
It is clear that the accidents with the most vehicles included happened in the speed limit of 40 and 50 miles per hour on Sundays, which makes sense as on that day most of the people return from weekend trips.
It is obvious that the most accidents occured on Fridays and were labeled as Slight.
So, most of the accidents are Slight and happened on Dry surface.