In this notebook is presented the segmentation and clustering of the Postal codes division of the city of Toronto in the province of Ontario, Canada, extracted from "List of postal codes of Canada: M" in Wikipedia (https://en.wikipedia.org/wiki/List_of_postal_codes_of_Canada:_M). The Foursquare API was used to find the venues on each postal code zone using a radius based on the area cover by each postcode without overlapping between them and a maximum number of venues per postal code of 100. Using K-Means clustering algorithm, the postal codes were grouped based on the venues density (venues/area) and the result was showed on a map of Toronto.

- Extract data of Toronto neighborhoods from Wikipedia
- Explore and clean neighborhoods dataset
- Get venues
- Analyze venues dataset
- Cluster Postcodes
- Examine Clusters

BeautifulSoup library is used to scrape the Wikipedia's article that contains the Toronto neighborhood. The neighborhood data presented in a Table on the article is parsed and stored in a list that contains each row of the table, that is the Postcode, Borough and Neighborhood name.

Loading output library...

Then the neighborhood_info list is passed to pandas to create a DataFrame

Loading output library...

The data returned has missing info like "Not assigned" boroughs and neighborhoods.

The rows with "Not assigned" Boroughs will be eliminated

Loading output library...

The "Not assigned" values in the Neighborhood column will be replace with the Borough name in that cell

Loading output library...

The dataframe has 103 Postal codes but it has 212 rows, because each Postal code can present more than one neighborhood (210 in total). Therefore, the dataframe should be group by the Postal code, ending with a dataframe with 103 rows.

Loading output library...

Loading output library...

To add the coordinates to the neighborhood dataframe, a join is performed using the postcodes as keys

Loading output library...

With the coordinates of each postal code, a map of Toronto with markers indicating the Postcode position is generated

Loading output library...

The map shows that the Postal codes are not evenly spaced, and the area cover by some of them, using a radius of 500 meters, overlaps. A different radius for each postcode results in a better venues search because that will avoid misrepresentation of the number of venues per postcode caused by too large or low radius values.

Loading output library...

To define the radius use with foursquare it's necessary to find the closest points for each postcode.

To explore the distance function, the closest postcode to the first example in the dataframe is found

Loading output library...

Loading output library...

A distant column is added to the DataFrame and is used as the radius cover for each postcode

Loading output library...

The map is plotted using different radius for each postal code. Now not only overlapping was avoided but more area of the city is cover, consequently, more venues are retrieved

Loading output library...

Next thing to do is explore each Postcode to get venues using the Foursquare API. For that, the credential must be declared

In order to get the venues in the perimeter of each Postal code, it is necessary to get the geographical coordinates (lat and lng) of each one of those and add them to the dataframe. The geopy library is not compatible with Canada's postcode and geocoder is an unreliable library. For that reason the coordinates are in the csv file 'Geospatial_Coordinates.csv".

To explore the data returned by the Foursquare API, a maximum of 100 venues from the first postcode are requested in a radius of 500 meters.

Loading output library...

Loading output library...

In this case, the relevant information is venue.categories, venue.location.lat, venue.location.lng and venue.name

Loading output library...

Loading output library...

It is necessary to extract the Category (shortName) of the JSON data

Loading output library...

Next step is to get venues for each postal code

Loading output library...

There is one postal code with no venues returned from the Foursquare API

Loading output library...

In order to get a better sense of the best way of clustering the postalcodes, it's necessary to analyze the venues data returned by Foursquare.

Loading output library...

Loading output library...

The minimum amount of venues present on a postcode is 0, as we add M5E, and the maximum is 100, expected given the limit of venues set on the request sent to the Foursquare API. 50% of the venues presents 26 or less venues.

The venues Frequency Distribution of the number of venues is presented next

Loading output library...

Given that each postcode has a different radius passed to the venues request, it's better to represent the venues per postcode in terms of density, that's venues per are cover for each postcode, in this case the area cover in the venues search defined by the distance to the closest postcode.

Loading output library...

Loading output library...

Loading output library...

THe histogram shows that 60% of the postcodes presents a density between 0 and 30 venues per area (expressed as radius). That is expected given that Toronto has a low population density. The last three bars on the plot have very low values, it could be possible to merge that data and use 5 venues density ranges for the clustering

Next the postcodes are clustered based on venues density. One important hyperparameter is the number of clusters and based on previous analysis a tentative value is five clusters. Next the elbow method is used to have a better sense of the optimal number.

Loading output library...

Using the elbow method, the optimal value of the number of cluster was defined as 5, which match with the value based on the histogram analysis.

Loading output library...

Check the centroids values of venues density and postcodes per cluster

Loading output library...

Based on the centroids of each cluster, the cluster names can be defined as:
1. **'Low Venues Density':** Centroid equal to 11
2. **'Medium-Low Venues Density'** with a centroid equal to 33
3. **'Medium-High Venues Density'** with a centroid equal to 72
4. **'High Venues Density'** with a centr0id equal to 114
5. **'Very High Venues Density'** with a centroid equal to 211

Loading output library...

Loading output library...

The results showed on the map could be useful, among others, in: 1. Real estate: as part of property cost model (venues density could be related to the cost of a property) or as a tool for property search. 2. Epidemiology research: venues density could be related with noise, pollution or crime.