In this notebook is presented the segmentation and clustering of the Postal codes division of the city of Toronto in the province of Ontario, Canada, extracted from "List of postal codes of Canada: M" in Wikipedia (https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M). The Foursquare API was used to find the venues on each postal code zone using a radius based on the area cover by each postcode without overlapping between them and a maximum number of venues per postal code of 100. Using K-Means clustering algorithm, the postal codes were grouped based on the venues density (venues/area) and the result was showed on a map of Toronto.
BeautifulSoup library is used to scrape the Wikipedia's article that contains the Toronto neighborhood. The neighborhood data presented in a Table on the article is parsed and stored in a list that contains each row of the table, that is the Postcode, Borough and Neighborhood name.
Then the neighborhood_info list is passed to pandas to create a DataFrame
The data returned has missing info like "Not assigned" boroughs and neighborhoods.
The rows with "Not assigned" Boroughs will be eliminated
The "Not assigned" values in the Neighborhood column will be replace with the Borough name in that cell
The dataframe has 103 Postal codes but it has 212 rows, because each Postal code can present more than one neighborhood (210 in total). Therefore, the dataframe should be group by the Postal code, ending with a dataframe with 103 rows.
To add the coordinates to the neighborhood dataframe, a join is performed using the postcodes as keys
With the coordinates of each postal code, a map of Toronto with markers indicating the Postcode position is generated
The map shows that the Postal codes are not evenly spaced, and the area cover by some of them, using a radius of 500 meters, overlaps. A different radius for each postcode results in a better venues search because that will avoid misrepresentation of the number of venues per postcode caused by too large or low radius values.
To define the radius use with foursquare it's necessary to find the closest points for each postcode.
To explore the distance function, the closest postcode to the first example in the dataframe is found
A distant column is added to the DataFrame and is used as the radius cover for each postcode
The map is plotted using different radius for each postal code. Now not only overlapping was avoided but more area of the city is cover, consequently, more venues are retrieved
Next thing to do is explore each Postcode to get venues using the Foursquare API. For that, the credential must be declared
In order to get the venues in the perimeter of each Postal code, it is necessary to get the geographical coordinates (lat and lng) of each one of those and add them to the dataframe. The geopy library is not compatible with Canada's postcode and geocoder is an unreliable library. For that reason the coordinates are in the csv file 'Geospatial_Coordinates.csv".
To explore the data returned by the Foursquare API, a maximum of 100 venues from the first postcode are requested in a radius of 500 meters.
In this case, the relevant information is venue.categories, venue.location.lat, venue.location.lng and venue.name
It is necessary to extract the Category (shortName) of the JSON data
Next step is to get venues for each postal code
There is one postal code with no venues returned from the Foursquare API
In order to get a better sense of the best way of clustering the postalcodes, it's necessary to analyze the venues data returned by Foursquare.
The minimum amount of venues present on a postcode is 0, as we add M5E, and the maximum is 100, expected given the limit of venues set on the request sent to the Foursquare API. 50% of the venues presents 26 or less venues.
The venues Frequency Distribution of the number of venues is presented next
Given that each postcode has a different radius passed to the venues request, it's better to represent the venues per postcode in terms of density, that's venues per are cover for each postcode, in this case the area cover in the venues search defined by the distance to the closest postcode.
THe histogram shows that 60% of the postcodes presents a density between 0 and 30 venues per area (expressed as radius). That is expected given that Toronto has a low population density. The last three bars on the plot have very low values, it could be possible to merge that data and use 5 venues density ranges for the clustering
Next the postcodes are clustered based on venues density. One important hyperparameter is the number of clusters and based on previous analysis a tentative value is five clusters. Next the elbow method is used to have a better sense of the optimal number.
Using the elbow method, the optimal value of the number of cluster was defined as 5, which match with the value based on the histogram analysis.
Check the centroids values of venues density and postcodes per cluster
Based on the centroids of each cluster, the cluster names can be defined as: 1. 'Low Venues Density': Centroid equal to 11 2. 'Medium-Low Venues Density' with a centroid equal to 33 3. 'Medium-High Venues Density' with a centroid equal to 72 4. 'High Venues Density' with a centr0id equal to 114 5. 'Very High Venues Density' with a centroid equal to 211
The results showed on the map could be useful, among others, in: 1. Real estate: as part of property cost model (venues density could be related to the cost of a property) or as a tool for property search. 2. Epidemiology research: venues density could be related with noise, pollution or crime.